*BSD News Article 61921

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!newsroom.utas.edu.au!munnari.OZ.AU!news.hawaii.edu!ames!purdue!lerc.nasa.gov!magnus.acs.ohio-state.edu!math.ohio-state.edu!howland.reston.ans.net!newsfeed.internetmci.com!news.kei.com!nntp.coast.net!col.hp.com!sdd.hp.com!hamblin.math.byu.edu!park.uvsc.edu!usenet
From: Terry Lambert <terry@lambert.org>
Newsgroups: comp.unix.bsd.freebsd.misc,comp.os.linux.development.system
Subject: Re: The better (more suitable)Unix?? FreeBSD or Linux
Date: 13 Feb 1996 02:10:10 GMT
Organization: Utah Valley State College, Orem, Utah
Lines: 90
Message-ID: <4foru3$art@park.uvsc.edu>
References: <4er9hp$5ng@orb.direct.ca> <4f9skh$2og@dyson.iquest.net> <4fg8fe$j9i@pell.pell.chi.il.us> <311C5EB4.2F1CF0FB@freebsd.org> <4fjodg$o8k@venger.snds.com>
NNTP-Posting-Host: hecate.artisoft.com
Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:14180 comp.os.linux.development.system:17794

grif@hill.ucr.edu (Michael Griffith) wrote:
[ ... async flag for UFS ... ]

] Cool.  It should be the new default.

No.

] How about a proof?
] 
] Claim:
] 	Writing metadata synchronously and data asynchronously
] 	can put a filesystem in a state that has undetectable 
] 	errors.
] 
] Proof:
] 
] 	Given, 
] 
] 	* A metadata item (inode) M that initially references
] 	  no data blocks.
] 	* Data blocks D0 .. DN with random (junk) values.
] 	* Buffer cache equivalents of D0 .. DN, D0' .. DN'.
] 
] 	We assume (in proceeding toward a contradiction) that
] 	writing sync meta and async data can never leave a filesystem
] 	in a state where it has undetectable errors.
] 
] 	Perform a sync metadata and async data update on M.  In
] 	memory, the contents of D0' .. DN' are updated to new
] 	values.  On disk, M is synchronously (immediately) updated
] 	to refer to D0 .. DN.  Turn off the power before D0'
] 	.. DN' can be asynchronously (lazily) written back do
] 	replace D0 .. DN on disk.  The in-memory content of D0'
] 	.. DN' are lost and the old on-disk values D0 .. DN
] 	remain. M now references old (junk) values in D0 .. DN.
] 	Because the contents of D0 .. DN are indistinguishable
] 	from the contents of D0' ..  DN', the filesystem has
] 	produced an undetectable error.  
] 
] 	We have reached a contradiction.  Therefore the assumption
] 	is false and claim holds.

This is a false cause argument.  It assumes that a block can be
reused before the block allocation bitmap has been updated, which
is false.  The block allocation bitmap is synchronously updated.

If I free a block, the block can not be reused until the entry
for it has been marked free.

If I allocate a block, it must be (a) a new block [no conflict] or
(b) a reused block [which is marked free on disk].

If I write the file metadata such that it refers to the block,
I must first have allocated the block.  Thus I can never be in
a state where the validity of invalid data is in question.

In the case of a block write and a metadata update without a
block allocation bitmap update, the "allocated" block pointed to
by the inode will be taken away from the inode by the cleaner
process because the bitmap shows that it is not allocated (this
is subject to administrative overrride during a verbose fsck;
the default action is to free the block).

Thus the file system can never have more than one referential
integrity flaw at the time of a crash, based on a failed metadata
update.

The data in the file may very well be lost.  This is why the
O_WRITESYNC flag exists, and is expected to be implemented,
for applications which care, through a multistage commit or
via logging (see any introductory text to database theory
for confirmation).


The point of crash recovery tools on non-journalling, non-logging
file systems is to restor referential integrity, not file system
contents.  On journalling or logging file systems, a second level
of crash recovery is possible, guaranteeing a consistent state
for files undergoing access during crash, though some transactions
may still be lost without hardware support and kernel-internal
*synchronous* use of the supporting hardware.  Such file state
recovery is *secondary* and *inferior* to the process of recovering
referential integrity for the file system as a whole.


                                        Terry Lambert
                                        terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.