Return to BSD News archive
Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!newsroom.utas.edu.au!munnari.OZ.AU!news.hawaii.edu!ames!purdue!lerc.nasa.gov!magnus.acs.ohio-state.edu!math.ohio-state.edu!howland.reston.ans.net!newsfeed.internetmci.com!news.kei.com!nntp.coast.net!col.hp.com!sdd.hp.com!hamblin.math.byu.edu!park.uvsc.edu!usenet From: Terry Lambert <terry@lambert.org> Newsgroups: comp.unix.bsd.freebsd.misc,comp.os.linux.development.system Subject: Re: The better (more suitable)Unix?? FreeBSD or Linux Date: 13 Feb 1996 02:10:10 GMT Organization: Utah Valley State College, Orem, Utah Lines: 90 Message-ID: <4foru3$art@park.uvsc.edu> References: <4er9hp$5ng@orb.direct.ca> <4f9skh$2og@dyson.iquest.net> <4fg8fe$j9i@pell.pell.chi.il.us> <311C5EB4.2F1CF0FB@freebsd.org> <4fjodg$o8k@venger.snds.com> NNTP-Posting-Host: hecate.artisoft.com Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:14180 comp.os.linux.development.system:17794 grif@hill.ucr.edu (Michael Griffith) wrote: [ ... async flag for UFS ... ] ] Cool. It should be the new default. No. ] How about a proof? ] ] Claim: ] Writing metadata synchronously and data asynchronously ] can put a filesystem in a state that has undetectable ] errors. ] ] Proof: ] ] Given, ] ] * A metadata item (inode) M that initially references ] no data blocks. ] * Data blocks D0 .. DN with random (junk) values. ] * Buffer cache equivalents of D0 .. DN, D0' .. DN'. ] ] We assume (in proceeding toward a contradiction) that ] writing sync meta and async data can never leave a filesystem ] in a state where it has undetectable errors. ] ] Perform a sync metadata and async data update on M. In ] memory, the contents of D0' .. DN' are updated to new ] values. On disk, M is synchronously (immediately) updated ] to refer to D0 .. DN. Turn off the power before D0' ] .. DN' can be asynchronously (lazily) written back do ] replace D0 .. DN on disk. The in-memory content of D0' ] .. DN' are lost and the old on-disk values D0 .. DN ] remain. M now references old (junk) values in D0 .. DN. ] Because the contents of D0 .. DN are indistinguishable ] from the contents of D0' .. DN', the filesystem has ] produced an undetectable error. ] ] We have reached a contradiction. Therefore the assumption ] is false and claim holds. This is a false cause argument. It assumes that a block can be reused before the block allocation bitmap has been updated, which is false. The block allocation bitmap is synchronously updated. If I free a block, the block can not be reused until the entry for it has been marked free. If I allocate a block, it must be (a) a new block [no conflict] or (b) a reused block [which is marked free on disk]. If I write the file metadata such that it refers to the block, I must first have allocated the block. Thus I can never be in a state where the validity of invalid data is in question. In the case of a block write and a metadata update without a block allocation bitmap update, the "allocated" block pointed to by the inode will be taken away from the inode by the cleaner process because the bitmap shows that it is not allocated (this is subject to administrative overrride during a verbose fsck; the default action is to free the block). Thus the file system can never have more than one referential integrity flaw at the time of a crash, based on a failed metadata update. The data in the file may very well be lost. This is why the O_WRITESYNC flag exists, and is expected to be implemented, for applications which care, through a multistage commit or via logging (see any introductory text to database theory for confirmation). The point of crash recovery tools on non-journalling, non-logging file systems is to restor referential integrity, not file system contents. On journalling or logging file systems, a second level of crash recovery is possible, guaranteeing a consistent state for files undergoing access during crash, though some transactions may still be lost without hardware support and kernel-internal *synchronous* use of the supporting hardware. Such file state recovery is *secondary* and *inferior* to the process of recovering referential integrity for the file system as a whole. Terry Lambert terry@cs.weber.edu --- Any opinions in this posting are my own and not those of my present or previous employers.