*BSD News Article 63977

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!newshost.nla.gov.au!act.news.telstra.net!psgrain!newsfeed.internetmci.com!in1.uu.net!news.reference.com!cnn.nas.nasa.gov!gizmo.nas.nasa.gov!not-for-mail
From: truesdel@gizmo.nas.nasa.gov (Dave Truesdell)
Newsgroups: comp.unix.bsd.freebsd.misc,comp.os.linux.development.system
Subject: Re: The better (more suitable)Unix?? FreeBSD or Linux
Date: 20 Mar 1996 19:07:09 -0800
Organization: A InterNetNews test installation
Lines: 114
Message-ID: <4iqh4t$s1v@gizmo.nas.nasa.gov>
References: <4gejrb$ogj@floyd.sw.oz.au> <4hirl9$nr7@gizmo.nas.nasa.gov> <Dnu8FD.CK2@pe1chl.ampr.org> <4iajie$9fn@gizmo.nas.nasa.gov> <DoBs05.B19.0.-s@cs.vu.nl>
NNTP-Posting-Host: gizmo.nas.nasa.gov
X-Newsreader: NN version 6.5.0 #61 (NOV)
Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:15779 comp.os.linux.development.system:19756

philip@cs.vu.nl (Philip Homburg) writes:

>In article <4iajie$9fn@gizmo.nas.nasa.gov>,
>Dave Truesdell <truesdel@gizmo.nas.nasa.gov> wrote:

>%First case:  Restoring a large filesystem on a large machine.
>%Here's an example of one of those 8 hour restores I mentioned.  The setup, a
>%500GB disk array, mounted async; 1GB memory (>500GB was allocated to the
>%buffer cache); ~1.5 million i-nodes to restore; running the restore in single
>%user (no update daemon running).  If the restore had been running for several
>%hours, and a hardware glitch crashed the machine, what state do you think the
>%filesystem would be in?  In this situation, data blocks, which are only written
>%on once, would age quickly and get flushed to disk as new data came in.  How
>%about indirect blocks?  They can be updated multiple times a files grow, so
>%they don't age quite as fast.  Directory blocks?  They can get written
>%multiple times, as new files and directories are created, so they don't age
>%quite so fast, either, so they're less likely to get flushed to disk.  The same
>%is true for inode blocks, too.  So, what situation are you left with?  Unless
>%all the metadata gets written to disk, you may have most of your data safely
>%on disk, but if the metadata hasn't been flushed, you may not know what
>%i-nodes have been allocated; which data blocks have been allocated; which
>%data blocks belong to which i-nodes, etc.

>OK, 8 hours for 500GB and ~1.5 million i-nodes.

>Filesystem throughput: 500GB / (8*3600 seconds) = 17MB per second
>	(quite impressive).
>Avarage file size: 500GB / 1.5 million i-nodes = 333KB per file.
>Files per second: (8*3600 seconds) / 1.5 million i-nodes = 52 files /second.

>At these speeds I don't see why you expect blocks to be cached for a long
>time. Furthermore, a filesystem that implements asynch metadata updates can
>still provide a synchronous sync(2) system call.
>Even an asynchronous sync system call which only writes all data to disk
>would be sufficient in this case.

I didn't expect *data* blocks to be cached for a long time, but what does
speed have to do with anything?  As I pointed out above, different buffers age
at different rates depending on how often they are modified.  And, how fast
buffers age determines what stays in the cache and what gets flushed to disk.
However, I'm trying to figure out what the point was of the comment about the
existence of a sync(2) system call?  The only thing that syncing will do is
reduce the extent of the damage, not eliminate it.  Inconsistencies would simply
start to build up again, after the sync completed.  And, on fast systems, 60
seconds between syncs can be a very, very, long time.

>%BTW, Just to see what would happen, I tried to run an fsck on the partial
>%filesystem.  After what seemed like several hundred screens of just about
>%every error that fsck could detect, it finally dumped core.

>That only tells something about the quality of the fsck implementation...

It also tells us how messed up the filesystem was, too.

>%Here's a thought experiment.  Let's take a small filesystem, with only one
>%non-zero length file in it.  Call it file "A".  Delete file "A" and create a
>%second non-zero length file named "B".  Now, crash the system, without
>%completely syncing.  When you go back and examine that filesystem, what will
>%you find?  Will you find file "A" still in existence and intact?  Will you
>%find file "B" in existence and intact?  What would you find if one of "A"'s
>%blocks had been reused by "B"?  If the integrity of the metadata is not
>%maintained, you could find file "A" with a chunk of "B"'s data in it.  The
>%situation gets worse if the reused block is an indirect block.  How would the
>%system interpret data the overwrote an indirect block?

>This does not `destroy' your filesystem: fsck will (should) duplicate
>all blocks shared by multiple files.

Can you tell us of one fsck that does duplicate shared blocks?  And, how does
it handle cases where the blocks are of different types?  A file data block
versus an indirect block or a directory block?  And how will the filesystem
handle interpreting the data in that block at a later time?  Last time I
looked, indirect blocks and file data blocks didn't have any magic cookies
hidden inside to distinguish them.  You don't seem to realize that having a
block allocated to multiple files is a "Bad Thing".  It means that "Bad Things"
have happened to the filesystem.

>%How many of those systems didn't attempt to maintain consistent metadata?
>%I've run V6 on a PDP-11/34 in a half meg of RAM, using a pair of RK05's for
>%a whopping 10MB for the filesystem.  I've written trillion byte files as part
>%of testing new modifications to the filesystem code.  I've tested filesystems
>%that claimed to be great improvements over the FFS, that I've been able to
>%trash (the filesystem could *NOT* be repaired) simply by writing two large
>%files simultaneously.  I've seen many people who think they've invented a
>%"better" filesystem and how often they've been wrong.

>How do you define `could *NOT* be repaired'?

It was an extent based filesystem that a *former* vendor (who will remain
nameless) included in their OS release.  It's equivalent of fsck was built
into the kernel.  (There were *no* external tools for examining and repairing
the filesystem other that those we wrote ourselves.)  Writing the two files on
a freshly created filesystem quickly crashed the OS, and left it in a state
where the OS crashed if the filesystem was mounted.  Our best guess about what
happened, was that writing two files simultaneously caused a large number of
small extents to be allocated and that when the system attempted to coalesce
them into larger extents it stomped on its own data structures.

Filesystems are hard to get right.  And, if you cocky and try to get away with
cutting corners, Mr. Murphy will be happy to teach you a lesson, when you can
least afford it.

> How do you destroy old, untouched files by creating new files or by deleting
> files?

It's easy, all the filesystem code has to do, is do something wrong, or in the
wrong order.  Imagine the chaos if you attempt to delete a file and one of its
indirect blocks had been overwritten with something else?

>					Philip Homburg

-- 
T.T.F.N., Dave Truesdell	truesdel@nas.nasa.gov/postmaster@nas.nasa.gov
Wombat Wrestler/Software Packrat/Baby Wrangler/Newsmaster/Postmaster