Return to BSD News archive
Newsgroups: comp.os.386bsd.bugs Path: sserve!newshost.anu.edu.au!munnari.oz.au!news.Hawaii.Edu!ames!agate!doc.ic.ac.uk!pipex!uknet!mcsun!sun4nl!eur.nl!pk From: pk@cs.few.eur.nl (Paul Kranenburg) Subject: VM deadlocks Message-ID: <1993Apr20.073502.24319@cs.few.eur.nl> Keywords: virtual memory, deadlock Sender: news@cs.few.eur.nl Reply-To: pk@cs.few.eur.nl Organization: Erasmus University Rotterdam Date: Tue, 20 Apr 1993 07:35:02 GMT Lines: 243 A few days ago, I mentioned two scenarios that may bring the kernel into a deadlocked state. I'll elaborate on them a bit. Both deadlocks result from VM objects with dirty pages entering the object cache. Objects enter the object cache if their reference count drops to zero and they have the 'can_persist' bit set. Normally, the persist bit is turned off when the object is going to be written to. But there appear to be circumstances under which the bit is not cleared when it ought to be. One such case arises when one of the objects' resident pages is written behind the VM system's back (eg. by exploiting another bug such as running "main{} { read(0, 0, 1); }"). Another example (see code below) involves mmap()ing a file read only, and then using mprotect() to enable writing on the mapped addresses. Let's consider the first case. The program's text is mapped read only by the execve(2) system call. The program bug causes a protection violation, the kernel bug causes it to missed (the read() completes, modifying the text segment). The program exits, the text object's refcount goes to zero and the object enters the object cache. All of the objects pages are put on the inactive list, during which the pmap module is consulted. The hardware has noticed the inadvertent write to one of the pages, so this page's clean bit is now turned off. This dirty page now hangs around on the inactive list for a bit, until the pageout daemon decides it is time for this page to recycled. Since the page is not clean, a pager (in this case the vnode pager) is called upon to write the page's contents to its proper backing store. The VFS write routine is called to do the dirty work. This routine decides that the associated VM object can no longer be cached and notifies the VM object layer of this fact. Since the refcount was already zero the object gets destroyed here and now. But it then it is noticed that a paging operation is in progress on this object (this is in vm_object_terminate()) and the current process is put to sleep waiting the pageout to finish. Remember that all this takes place in the context of the pager process. The result is a paging daemon that will sleep forever. In the second case, a similar sequence of events lead to a deadlock on an IO buffer. I'll be brief this time: mmap(..., PROT_READ, ...) marks the VM object as cacheable next mprotect(.., PROT_WRITE) fails to undo that at process exit, the object enters the cache with dirty pages another process comes along, opens the same file and writes it VOP_WRITE() wants to uncache the object still associated with the vnode. Unfortunately, ufs_write() does so only after acquiring an IO buffer with balloc(). vm_object_terminate() wants to clean the dirty pages of the object, so it calls the pager's put routine, which in turn calls VOP_WRITE() ufs_write() is called again (recursively) and wants the IO buffer again, which is now locked. process is put to sleep in bread(), never to wake up again. The first case clearly demonstrates that it takes only one misbehaving process to bring the pageout daemon to its knees. After that, it won't take very long before you're left gazing at a not-so-responsive screen. What can be done about this? I propose the patch to vm/vm_object.c below as a safe-guard against dirty objects that want to go in the object cache. Other than this, the kernel code must be carefully screened so that the 'can_persist' field gets turned off in a timely manner whenever an object looses its read-only status (vm_map_protect() is a candidate, as is vm_fault()). I've also changed ufs_write() to take care not to call vnode_uncache() while it has buffers locked. -pk ----------------------------------------------------------------------- This program demonstrates 'case 2', run it twice in a row and watch its ps(1) status go to 'D'. #include <sys/types.h> #include <sys/mman.h> #include <sys/file.h> #include <sys/stat.h> #include <fcntl.h> #define SIZE 4096 struct stat statbuf; main() { char *ad; int i, j; int fd; fd = open("xxx", O_RDWR|O_CREAT, 0666); if (fd == -1) { perror("open"); exit(1); } #if 0 if (ftruncate(fd, SIZE) < 0) perror("ftruncate"); #else if (lseek(fd, SIZE-1, 0) < 0) perror("lseek"); if (write(fd, "", 1) != 1) perror("write"); #endif if (fstat(fd, &statbuf) < 0) perror("stat"); printf("statbuf.st_size = %d\n", statbuf.st_size); ad = mmap(0, SIZE, PROT_READ, MAP_FILE|MAP_SHARED, fd, 0); if ((int)ad == -1) { perror("mmap"); exit(0); } if (mprotect(ad, SIZE, PROT_READ|PROT_WRITE) < 0) perror("mprotect"); for (j = 0; j< SIZE; j++) ad[j] = 1; /* munmap(ad, SIZE); */ printf("Done\n"); return 0; } --------------------------------------------------------------------------- Kernel patches to prevent VM object cache realted deadlocks. ------- ufs_vnops.c ------- *** /tmp/da12474 Tue Apr 20 02:54:41 1993 --- ufs/ufs_vnops.c Sat Apr 17 11:55:43 1993 *************** *** 564,569 **** --- 564,571 ---- flags = 0; if (ioflag & IO_SYNC) flags = B_SYNC; + + (void) vnode_pager_uncache(vp); do { lbn = lblkno(fs, uio->uio_offset); on = blkoff(fs, uio->uio_offset); *************** *** 580,586 **** vnode_pager_setsize(vp, ip->i_size); } size = blksize(fs, ip, lbn); - (void) vnode_pager_uncache(vp); n = MIN(n, size - bp->b_resid); error = uiomove(bp->b_un.b_addr + on, n, uio); if (ioflag & IO_SYNC) --- 582,587 ---- ------- nfs_bio.c ------- *** /tmp/da12489 Tue Apr 20 02:55:56 1993 --- nfs/nfs_bio.c Sat Apr 17 12:01:48 1993 *************** *** 248,253 **** --- 248,254 ---- */ biosize = VFSTONFS(vp->v_mount)->nm_rsize; np->n_flag |= NMODIFIED; + vnode_pager_uncache(vp); do { nfsstats.biocache_writes++; lbn = uio->uio_offset / biosize; ------- vm_object.c ------- *** /tmp/da12499 Tue Apr 20 02:57:08 1993 --- vm/vm_object.c Mon Apr 19 22:53:53 1993 *************** *** 248,254 **** --- 248,274 ---- */ if (object->can_persist) { + register vm_page_t p; + /* + * Check for dirty pages in object + * Print warning as this may signify kernel bugs + * pk@cs.few.eur.nl - 4/15/93 + */ + p = (vm_page_t) queue_first(&object->memq); + while (!queue_end(&object->memq, (queue_entry_t) p)) { + VM_PAGE_CHECK(p); + + if (pmap_is_modified(VM_PAGE_TO_PHYS(p)) || + !p->clean) { + + printf("vm_object_dealloc: persistent object %x isn't clean\n", object); + goto cant_persist; + } + + p = (vm_page_t) queue_next(&p->listq); + } + queue_enter(&vm_object_cached_list, object, vm_object_t, cached_list); vm_object_cached++; *************** *** 260,265 **** --- 280,286 ---- vm_object_cache_trim(); return; } + cant_persist:; /* * Make sure no one can look us up now. ------- vm_page.c ------- *** /tmp/da12539 Tue Apr 20 03:02:23 1993 --- vm/vm_page.c Thu Apr 8 16:48:52 1993 *************** *** 555,560 **** --- 555,568 ---- spl = splimp(); /* XXX */ simple_lock(&vm_page_queue_free_lock); + if ( object != kernel_object && + object != kmem_object && + vm_page_free_count <= vm_page_free_reserved) { + + simple_unlock(&vm_page_queue_free_lock); + splx(spl); + return(NULL); + } if (queue_empty(&vm_page_queue_free)) { simple_unlock(&vm_page_queue_free_lock); splx(spl); --------------------------------------------------------------------------- --