*BSD News Article 14736

Newsgroups: comp.os.386bsd.bugs
Path: sserve!newshost.anu.edu.au!munnari.oz.au!news.Hawaii.Edu!ames!agate!doc.ic.ac.uk!pipex!uknet!mcsun!sun4nl!eur.nl!pk
From: pk@cs.few.eur.nl (Paul Kranenburg)
Subject: VM deadlocks
Message-ID: <1993Apr20.073502.24319@cs.few.eur.nl>
Keywords: virtual memory, deadlock
Sender: news@cs.few.eur.nl
Reply-To: pk@cs.few.eur.nl
Organization: Erasmus University Rotterdam
Date: Tue, 20 Apr 1993 07:35:02 GMT
Lines: 243


A few days ago, I mentioned two scenarios that may bring the kernel into
a deadlocked state. I'll elaborate on them a bit.

Both deadlocks result from VM objects with dirty pages entering the object
cache. Objects enter the object cache if their reference count drops to
zero and they have the 'can_persist' bit set. Normally, the persist bit
is turned off when the object is going to be written to. But there appear
to be circumstances under which the bit is not cleared when it ought to be.
One such case arises when one of the objects' resident pages is written
behind the VM system's back (eg. by exploiting another bug such as running
"main{} { read(0, 0, 1); }"). Another example (see code below) involves
mmap()ing a file read only, and then using mprotect() to enable writing
on the mapped addresses.

Let's consider the first case. The program's text is mapped read only by
the execve(2) system call. The program bug causes a protection violation,
the kernel bug causes it to missed (the read() completes, modifying the
text segment). The program exits, the text object's refcount goes to zero
and the object enters the object cache. All of the objects pages are
put on the inactive list, during which the pmap module is consulted. The
hardware has noticed the inadvertent write to one of the pages, so this
page's clean bit is now turned off. This dirty page now hangs around
on the inactive list for a bit, until the pageout daemon decides it is
time for this page to recycled. Since the page is not clean, a pager
(in this case the vnode pager) is called upon to write the page's contents
to its proper backing store. The VFS write routine is called to do the dirty
work. This routine decides that the associated VM object can no longer
be cached and notifies the VM object layer of this fact. Since the refcount
was already zero the object gets destroyed here and now. But it then it is
noticed that a paging operation is in progress on this object (this is
in vm_object_terminate()) and the current process is put to sleep waiting
the pageout to finish. Remember that all this takes place in the context
of the pager process. The result is a paging daemon that will sleep forever.

In the second case, a similar sequence of events lead to a deadlock on an
IO buffer. I'll be brief this time:

	mmap(..., PROT_READ, ...) marks the VM object as cacheable

	next mprotect(.., PROT_WRITE) fails to undo that

	at process exit, the object enters the cache with dirty pages

	another process comes along, opens the same file and writes it

	VOP_WRITE() wants to uncache the object still associated with the
	vnode. Unfortunately, ufs_write() does so only after acquiring an
	IO buffer with balloc().

	vm_object_terminate() wants to clean the dirty pages of the object,
	so it calls the pager's put routine, which in turn calls VOP_WRITE()

	ufs_write() is called again (recursively) and wants the IO buffer
	again, which is now locked.

	process is put to sleep in bread(), never to wake up again.


The first case clearly demonstrates that it takes only one misbehaving
process to bring the pageout daemon to its knees. After that, it won't
take very long before you're left gazing at a not-so-responsive screen.

What can be done about this? I propose the patch to vm/vm_object.c below
as a safe-guard against dirty objects that want to go in the object
cache. Other than this, the kernel code must be carefully screened so
that the 'can_persist' field gets turned off in a timely manner whenever
an object looses its read-only status (vm_map_protect() is a candidate, as is
vm_fault()).

I've also changed ufs_write() to take care not to call vnode_uncache()
while it has buffers locked.

-pk

-----------------------------------------------------------------------
This program demonstrates 'case 2', run it twice in a row and watch
its ps(1) status go to 'D'.

#include <sys/types.h>
#include <sys/mman.h>
#include <sys/file.h>
#include <sys/stat.h>
#include <fcntl.h>

#define SIZE	4096

struct stat statbuf;

main()
{
        char	*ad;
        int	i, j;
	int	fd;

	fd = open("xxx", O_RDWR|O_CREAT, 0666);
	if (fd == -1) {
		perror("open");
		exit(1);
	}
#if 0
	if (ftruncate(fd, SIZE) < 0)
		perror("ftruncate");
#else
	if (lseek(fd, SIZE-1, 0) < 0)
		perror("lseek");
	if (write(fd, "", 1) != 1)
		perror("write");
#endif

	if (fstat(fd, &statbuf) < 0)
		perror("stat");

	printf("statbuf.st_size = %d\n", statbuf.st_size);

	ad = mmap(0, SIZE, PROT_READ, MAP_FILE|MAP_SHARED, fd, 0);
	if ((int)ad == -1) {
		perror("mmap");
		exit(0);
	}

	if (mprotect(ad, SIZE, PROT_READ|PROT_WRITE) < 0)
		perror("mprotect");

	for (j = 0; j< SIZE; j++)
		ad[j] = 1;

/*
	munmap(ad, SIZE);
*/
	printf("Done\n");
	return 0;
}

---------------------------------------------------------------------------
Kernel patches to prevent VM object cache realted deadlocks.

------- ufs_vnops.c -------
*** /tmp/da12474	Tue Apr 20 02:54:41 1993
--- ufs/ufs_vnops.c	Sat Apr 17 11:55:43 1993
***************
*** 564,569 ****
--- 564,571 ----
  	flags = 0;
  	if (ioflag & IO_SYNC)
  		flags = B_SYNC;
+ 
+ 	(void) vnode_pager_uncache(vp);
  	do {
  		lbn = lblkno(fs, uio->uio_offset);
  		on = blkoff(fs, uio->uio_offset);
***************
*** 580,586 ****
  			vnode_pager_setsize(vp, ip->i_size);
  		}
  		size = blksize(fs, ip, lbn);
- 		(void) vnode_pager_uncache(vp);
  		n = MIN(n, size - bp->b_resid);
  		error = uiomove(bp->b_un.b_addr + on, n, uio);
  		if (ioflag & IO_SYNC)
--- 582,587 ----

------- nfs_bio.c -------
*** /tmp/da12489	Tue Apr 20 02:55:56 1993
--- nfs/nfs_bio.c	Sat Apr 17 12:01:48 1993
***************
*** 248,253 ****
--- 248,254 ----
  	 */
  	biosize = VFSTONFS(vp->v_mount)->nm_rsize;
  	np->n_flag |= NMODIFIED;
+ 	vnode_pager_uncache(vp);
  	do {
  		nfsstats.biocache_writes++;
  		lbn = uio->uio_offset / biosize;

------- vm_object.c -------
*** /tmp/da12499	Tue Apr 20 02:57:08 1993
--- vm/vm_object.c	Mon Apr 19 22:53:53 1993
***************
*** 248,254 ****
--- 248,274 ----
  		 */
  
  		if (object->can_persist) {
+ 			register vm_page_t	p;
  
+ 			/*
+ 			 * Check for dirty pages in object
+ 			 * Print warning as this may signify kernel bugs
+ 			 * pk@cs.few.eur.nl	- 4/15/93
+ 			 */
+ 			p = (vm_page_t) queue_first(&object->memq);
+ 			while (!queue_end(&object->memq, (queue_entry_t) p)) {
+ 				VM_PAGE_CHECK(p);
+ 
+ 				if (pmap_is_modified(VM_PAGE_TO_PHYS(p)) ||
+ 								!p->clean) {
+ 
+ 					printf("vm_object_dealloc: persistent object %x isn't clean\n", object);
+ 					goto cant_persist;
+ 				}
+ 
+ 				p = (vm_page_t) queue_next(&p->listq);
+ 			}
+ 
  			queue_enter(&vm_object_cached_list, object,
  				vm_object_t, cached_list);
  			vm_object_cached++;
***************
*** 260,265 ****
--- 280,286 ----
  			vm_object_cache_trim();
  			return;
  		}
+ 	cant_persist:;
  
  		/*
  		 *	Make sure no one can look us up now.


------- vm_page.c -------
*** /tmp/da12539	Tue Apr 20 03:02:23 1993
--- vm/vm_page.c	Thu Apr  8 16:48:52 1993
***************
*** 555,560 ****
--- 555,568 ----
  
  	spl = splimp();				/* XXX */
  	simple_lock(&vm_page_queue_free_lock);
+ 	if (	object != kernel_object &&
+ 		object != kmem_object	&&
+ 		vm_page_free_count <= vm_page_free_reserved) {
+ 
+ 		simple_unlock(&vm_page_queue_free_lock);
+ 		splx(spl);
+ 		return(NULL);
+ 	}
  	if (queue_empty(&vm_page_queue_free)) {
  		simple_unlock(&vm_page_queue_free_lock);
  		splx(spl);
---------------------------------------------------------------------------
--