Return to BSD News archive
Newsgroups: comp.unix.bsd Path: sserve!manuel.anu.edu.au!munnari.oz.au!sgiblab!zaphod.mps.ohio-state.edu!wupost!cs.utexas.edu!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry From: terry@cs.weber.edu (A Wizard of Earth C) Subject: Re: Repeat of the question about VFS and VOP_SEEK() Message-ID: <1992Oct27.181215.23644@fcom.cc.utah.edu> Keywords: VOP_SEEK VOP_READ VOP_WRITE VOP_RDWR Sender: news@fcom.cc.utah.edu Organization: Weber State University (Ogden, UT) References: <BwLp9z.8J2@flatlin.ka.sub.org> <1992Oct25.121136.26473@fcom.cc.utah.edu> <1992Oct26.213408.21184@Veritas.COM> Date: Tue, 27 Oct 92 18:12:15 GMT Lines: 161 In article <1992Oct26.213408.21184@Veritas.COM> craig@Veritas.COM (Craig Harmer) writes: >In article <1992Oct25.121136.26473@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes: >} An ASSERT() is used to insure the behaviour conforms to the >} agreed upon [in POSIX 1003.1-1988] vnode interface regarding >} the preservation of atomicity in reads and writes. This >} necessarily disallows calls to ufs_rdwr(), since the ufs_ilock() >} there would then become recursive. >} >}Clearly, if we can agree that POSIX compliant behaviour is what mandates the >}atomicity of reads and writes (the part I inserted and put in brackets), then >}we can agree that POSIX behaviour mandated the split. > >i don't see how atomicity guarantees demand seperate read/write >interfaces. imagine this code: > >ufs_rdwr(vp, uiop, type) > struct vnode *vp; > struct uio *uiop; > int type; >{ > ufs_ilock(VTOI(vp)); > > if (type == READ) { > ufs_read(vp, uiop); > } else { > ufs_write(vp, uiop): > } > > ufs_iunlock(VTOI(vp)); >} Apparently, with a loop-back VFS or some other mechansim supported by POSIX semantics, this could go infinitely recursive. I believe what you were looking for instead of "ufs_ilock" and "ufs_iunlock" was "vop_rwlock" followed by "vop_rwunlock", as documented in: UNIX(R) System V/386 Release 4 Version 3 Programmer's Guide: Writing File System Types Atomicity is discussed in some minor detail in this document. >assuming the inode lock is not released in ufs_read() or ufs_write() >how is this not atomic with respect to other read and write requests? > >i don't see what POSIX has to do with the the splitting of VOP_RDWR >at the vnode interface layer. Candidly, I'm only parroting the comments on this one. I believe it has to do with maintaining POSIX compliance in an environment where kernel preemption is allowed. I seriously doubt that it was "change for changes sake", since the prvious interface used "vop_rdwr". >also, inode locks in SVR4.0 are recursive, at least for UFS and VxFS. I think the point was with regard to vnode, not inode locks, from the AT&T document. >}|> >Thus perhaps the best answer is that the interface is ill defined. In >}|> >the previous post referenced above, I referred to the illogicality of >}|> >making the call, since a seek offset is an artifact of an open file >}|> >descriptor, and is not an attribute of an inode or vnode in most of >}|> >the current implementations. I also pointed out a potentially valid use >}|> >for passing the seek down: predictive read ahead. The problem here is >}|> >that either the read, the seek, or the open would have to be attributed >}|> >to flag the descriptor for predictive behaviour if this is to be a >}|> >successful optimization. > >the seek offsets are passed down because the file system independent >layer doesn't persume to know the range of valid seek offsets for a >file system type. this gives the file systems specific code an >opportunity to complain when the seek *system call* is made. >lseek() can return an error if it needs to. This is perhaps a valid contention, although I might argue it on the definition of lseek() requiring a long argument. I could definitely see mounting, for instance, a file system that only supported 16 bits for the file length, and used the other 16 bits for, as an example, promiscuous selection of namespace for a multinamespace file system (to support something like resource forks directly). Doing this is extremely questionable, since the semantics of the lookup mechanisms (file name parameters to open() or creat() and returns from getdents()) aren't set up to handle such promiscuous naming. Another potential use is on non-holey file systems (like DOS) to allocate real space for what UFS would consider a "hole" in a file. These reasons combined are probably sufficient to warrant passing down the lseek() with VOP_SEEK() --- so I withdraw my objections (although this will cause seek operations to be slower by a dereference and a function call per reference for no benefit of an kind for all of the current file systems supported by 386BSD. By this same token, we should provide a VOP_IGET() so as to allow future seperation of the directory entry management and inode management "layers" -- which aren't currently layered at all. We could see a great deal of benefit from that with very little effort. >}I think I can safely say the benefits of predictive read ahead are >}questionable unless there is a cooperative mechanism which obviates the >}need to use lseek() to communicate the read ahead. I can see the designers >}leaving it in there for some future "smarter NFS", but nothing in user >}space currently requires nor could benefit from predictive read ahead >}implemented this way. > >if you're talking about using lseek() to "request" a read ahead, that's >silly. lseek() already has a set of semantics associated with it, and >adding new ones would confuse the issue. invent a new system call or >convince USL to add the asynchronous I/O systems calls originally planned >for SVR4.0. That's why I said "the benefits of predictive read ahead are questionable .... [ if you ] ... need to use lseek() to communicate the read ahead" -- or, even more plainly, "predictive read ahead using VOP_SEEK() to inform the file system to do it is a dumb idea". >finally, read-ahead (and write-behind) are useful for applications >that don't perform any buffering of their own. a common application >behavior in Unix is to read an entire file sequentially, or to >truncate a file, write it sequentially, and close it. file systems >that detect this behavior and modify their behavior appropriately >can provide significant performance improvements. This is like arguing against external pagers. I think the prediction hueristics belong in the application, not in the file system; who is better to judge the future behaviour of the application? A certain amount of buffering is already done -- reads in UFS are in terms of one or more blocks; they are *never* in smaller increments. The main cost in getting the buffered data out is the copyout across the user/kernel boundry, and the expensive part of this is the page mapping. An application that does a read() a character at a time is going to bottleneck in reading the data, not in getting data from the disk to the kernel buffer. The largest benefit here is that which can be gained from user-space caching and copying across the user/kernel boundry in page multiples (at best) or cache buffer size multiples (at worst, if the cache buffer element size is not some multiple of the page size). This minimizes the block reads to disk, and minimizes the page mapping which must be done to get the data from kernel to user space. I'll agree that application-directed read-ahead and write-behind are good things, but unless you have pages mapped in user and kernel space at the same time (like a shmctl'ed shared memory segment), I see little if any benefit in kernel based predictive read-ahead for the examples you have given. This doesn't even address the overhead in "detecting the behaviour" as a means of employing the hueristic. Terry Lambert terry@icarus.weber.edu terry_lambert@novell.com --- Any opinions in this posting are my own and not those of my present or previous employers. -- ------------------------------------------------------------------------------- "I have an 8 user poetic license" - me Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial -------------------------------------------------------------------------------