Return to BSD News archive
Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!news.cs.su.oz.au!metro!metro!munnari.OZ.AU!news.ecn.uoknor.edu!news.eng.convex.com!newshost.convex.com!bcm.tmc.edu!news.msfc.nasa.gov!newsfeed.internetmci.com!swrinde!sdd.hp.com!hamblin.math.byu.edu!park.uvsc.edu!usenet From: Terry Lambert <terry@lambert.org> Newsgroups: comp.unix.bsd.freebsd.misc,comp.os.linux.development.system Subject: Re: The better (more suitable)Unix?? FreeBSD or Linux Date: 20 Feb 1996 23:20:49 GMT Organization: Utah Valley State College, Orem, Utah Lines: 160 Message-ID: <4gdl0h$qnc@park.uvsc.edu> References: <4er9hp$5ng@orb.direct.ca> <JRICHARD.96Feb9101113@paradise.zko.dec.com> <4fnd50$h1f@news.ox.ac.uk> <4frg0s$1jv@park.uvsc.edu> <4g9loc$si0@news.ox.ac.uk> NNTP-Posting-Host: hecate.artisoft.com Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:14123 comp.os.linux.development.system:17746 mbeattie@sable.ox.ac.uk (Malcolm Beattie) wrote: [ ... anecdote: fsck placed crap in a file after a crash ... ] ] >Clearly, someone anserwed "no" to "Clear?" during the fsck after ] >the crash. ] ] Not "clearly" at all and, in fact, wrong. Please stick to to the technical ] explanations you excel at and stop with the aspersion casting. The UFS storage and recovery algorithm in the sync case can not result in what you saw. Either the recovery tool was incorrectly used, or the port was incorrectly performed. I went with the most likely scenario. Can you tell me which revision of UFS from what source the OSF/1 UFS implementation was derived? There is a well known async operation that should be sync in the Net/2 UFS implementation; it's possible that they used that code. Since the location and fix is well known, I find it extremely unlikely that this is the explanation. [ ... ] ] Since we're supposed to be talking filesystems here, I'll try ] to ask an intelligent question. Under Digital UNIX 3.2c, AdvFS ] is still funnelled to the master CPU on an SMP machine (fixed ] in 4.0, I believe). Is that because it's intrinsically hard ] to make a bitfile/extent-based filesystem SMB-safe or just ] because DEC are lazy? This is an SMP granularity problem. Having worked on a UFS derived FS in several SMP kernels (UnixWare 2.x, Unisys 60xx SVR4 ES/MP, and Solaris 2.3) and currently being involved in kernel multithreading and SMP work on FreeBSD, I can still only guess. There are several possibilities. DEC being lazy is the *least* likley scenario. [ Note: the following information is dated; the Solaris information was inferred from header files and debugging and may not be totally accurate ] High grain parallelism is hard. The issues are nearly identical to those faced in kernel preeemption for Real Time and Kernel Multithreading. The main issue is reentrancy. If you use sync updates to order your metadata, each pending sync update is an outstanding synchronization block. To deal with this, you can either divorce the ordering from the file system requests, using Delayed Ordered Writes, like USL did with UnixWare 2.x, or soft updates, as in Ganger/Patt. Or you can go to a graphical lock manager that can compute transitive closure over the graph to detect when a request would cause a deadlock to occur. As far as I know, the only OS to implement the graph soloution at this time is Chorus, a European-originated microkernel which competes with MACH (its biggest claim to fame is that unlike MACH, it avoids most of the protection domain crossing in its IPC, which introduces other problems). Choris is available to educational institutions for a $1000 license fee, last time I checked. Since both USL and Novell are seperately pursuing Chorus based technology, it;s currently a hot thing for potential employess at both places. Without a divorce of the ordering from the FS proper, the file system is either handled as a black box, with a single reentrancy mutex (this is how non-MPized Solaris FS's must operate), or it is handled using medium granularity with discrete locking of "dangerous" routines causing more synchronization than "less dangerous" routines. Sequent's UFS locks (or used to lock) file system reeentrancy, using the "black box". You can see this by running multiple finds on the same tree and watching the processor utilization; like DEC, they thread through a signle processor. Medium grain locking is what Unisys uses in the SVR4 ES/MP. This is not as satisfying as high grain parallelism, but has the very real advantage of exposing the internals to the potential file system author. The Unisys model is probably the easiet to implement. The locking is done in the vncalls.c file in the kernel, and the reeentrancy is on a VOP call basis with knowledge of the order dependencies in the underlying FS implementations. This makes it slightly less general, but most easily supported. High grain parallelism is possible without divorce. This is the Solaris 2.3 approach. Sun has been very bad about documenting internals for file system authors, but there is some exposure of the underlying model, both in the University of Washingtom Usenix papers (fpt.sage.usenix.org) and in their /usr/include/sys header files (the kernel multithreading is most enlightening in t_lock.h, and vnode.h and fs/ufs_* also provides some tantalizing cludes). The USL UFS implementation is high grain parallelism using a divorce of the underlying I/O using delayed ordered writes. This means that the VOPs can run to completion, and the I/O ordering is the issue. Unlike soft updates, the USL Delayed Ordered Writes (Patent Pending) result in flat-graph lists of I/O's which must be ordered. The write clustering code has a hard time picking "the optimal" approach using DOW, since it can not reorder the ordered ops, only the unordered ones (what are handled in a traditional UFS as "async"). Still, a DOW-based UFS yields ~160% of the performance in a Uniprocessor machine, even after the SMP synchronization is taken into account. Soft updates, on the other hand, are reported to yield "within 5% of memory speed". Extent based file systems have their own problems, specifically the use of a single allocation pointer. It's possible to add a processor domain abstraction and allow multiple extent pointers. SMP VMS actually does this with the VMS file system. This is very similar to the VM technique of per processor page pools, as used by Sequent and described by Vahalia in "UNIX Internals: The New Frontiers (ISBN 0-13-101908-2 from Prentice Hall). Actually, Vahalia prefers SLAB allocation, but there is no reason the two techniques can't be combined. I suspect DEC hasn't doene this yet because it's intrinsically hard. Maybe their next release can go for soft updates, thus leap-frogging DOW (and avoiding the patent issues). Look for high grain SMP in FreeBSD as code and time permit. The current approach is to go from Low to High grain parallelism incrementally, using a "mutex push-down" approach to gradually increase the paralellism. The end intent is a graph soloution similar to that in Chorus, using a hierachical lock management mechanism. Feel free to point the DEC people at the Ganger/Patt paper as well, if you think it will help. Regards, Terry Lambert terry@cs.weber.edu --- Any opinions in this posting are my own and not those of my present or previous employers.