Return to BSD News archive
Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!newshost.nla.gov.au!act.news.telstra.net!psgrain!newsfeed.internetmci.com!in1.uu.net!news.reference.com!cnn.nas.nasa.gov!gizmo.nas.nasa.gov!not-for-mail From: truesdel@gizmo.nas.nasa.gov (Dave Truesdell) Newsgroups: comp.unix.bsd.freebsd.misc,comp.os.linux.development.system Subject: Re: The better (more suitable)Unix?? FreeBSD or Linux Date: 20 Mar 1996 19:07:09 -0800 Organization: A InterNetNews test installation Lines: 114 Message-ID: <4iqh4t$s1v@gizmo.nas.nasa.gov> References: <4gejrb$ogj@floyd.sw.oz.au> <4hirl9$nr7@gizmo.nas.nasa.gov> <Dnu8FD.CK2@pe1chl.ampr.org> <4iajie$9fn@gizmo.nas.nasa.gov> <DoBs05.B19.0.-s@cs.vu.nl> NNTP-Posting-Host: gizmo.nas.nasa.gov X-Newsreader: NN version 6.5.0 #61 (NOV) Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:15779 comp.os.linux.development.system:19756 philip@cs.vu.nl (Philip Homburg) writes: >In article <4iajie$9fn@gizmo.nas.nasa.gov>, >Dave Truesdell <truesdel@gizmo.nas.nasa.gov> wrote: >%First case: Restoring a large filesystem on a large machine. >%Here's an example of one of those 8 hour restores I mentioned. The setup, a >%500GB disk array, mounted async; 1GB memory (>500GB was allocated to the >%buffer cache); ~1.5 million i-nodes to restore; running the restore in single >%user (no update daemon running). If the restore had been running for several >%hours, and a hardware glitch crashed the machine, what state do you think the >%filesystem would be in? In this situation, data blocks, which are only written >%on once, would age quickly and get flushed to disk as new data came in. How >%about indirect blocks? They can be updated multiple times a files grow, so >%they don't age quite as fast. Directory blocks? They can get written >%multiple times, as new files and directories are created, so they don't age >%quite so fast, either, so they're less likely to get flushed to disk. The same >%is true for inode blocks, too. So, what situation are you left with? Unless >%all the metadata gets written to disk, you may have most of your data safely >%on disk, but if the metadata hasn't been flushed, you may not know what >%i-nodes have been allocated; which data blocks have been allocated; which >%data blocks belong to which i-nodes, etc. >OK, 8 hours for 500GB and ~1.5 million i-nodes. >Filesystem throughput: 500GB / (8*3600 seconds) = 17MB per second > (quite impressive). >Avarage file size: 500GB / 1.5 million i-nodes = 333KB per file. >Files per second: (8*3600 seconds) / 1.5 million i-nodes = 52 files /second. >At these speeds I don't see why you expect blocks to be cached for a long >time. Furthermore, a filesystem that implements asynch metadata updates can >still provide a synchronous sync(2) system call. >Even an asynchronous sync system call which only writes all data to disk >would be sufficient in this case. I didn't expect *data* blocks to be cached for a long time, but what does speed have to do with anything? As I pointed out above, different buffers age at different rates depending on how often they are modified. And, how fast buffers age determines what stays in the cache and what gets flushed to disk. However, I'm trying to figure out what the point was of the comment about the existence of a sync(2) system call? The only thing that syncing will do is reduce the extent of the damage, not eliminate it. Inconsistencies would simply start to build up again, after the sync completed. And, on fast systems, 60 seconds between syncs can be a very, very, long time. >%BTW, Just to see what would happen, I tried to run an fsck on the partial >%filesystem. After what seemed like several hundred screens of just about >%every error that fsck could detect, it finally dumped core. >That only tells something about the quality of the fsck implementation... It also tells us how messed up the filesystem was, too. >%Here's a thought experiment. Let's take a small filesystem, with only one >%non-zero length file in it. Call it file "A". Delete file "A" and create a >%second non-zero length file named "B". Now, crash the system, without >%completely syncing. When you go back and examine that filesystem, what will >%you find? Will you find file "A" still in existence and intact? Will you >%find file "B" in existence and intact? What would you find if one of "A"'s >%blocks had been reused by "B"? If the integrity of the metadata is not >%maintained, you could find file "A" with a chunk of "B"'s data in it. The >%situation gets worse if the reused block is an indirect block. How would the >%system interpret data the overwrote an indirect block? >This does not `destroy' your filesystem: fsck will (should) duplicate >all blocks shared by multiple files. Can you tell us of one fsck that does duplicate shared blocks? And, how does it handle cases where the blocks are of different types? A file data block versus an indirect block or a directory block? And how will the filesystem handle interpreting the data in that block at a later time? Last time I looked, indirect blocks and file data blocks didn't have any magic cookies hidden inside to distinguish them. You don't seem to realize that having a block allocated to multiple files is a "Bad Thing". It means that "Bad Things" have happened to the filesystem. >%How many of those systems didn't attempt to maintain consistent metadata? >%I've run V6 on a PDP-11/34 in a half meg of RAM, using a pair of RK05's for >%a whopping 10MB for the filesystem. I've written trillion byte files as part >%of testing new modifications to the filesystem code. I've tested filesystems >%that claimed to be great improvements over the FFS, that I've been able to >%trash (the filesystem could *NOT* be repaired) simply by writing two large >%files simultaneously. I've seen many people who think they've invented a >%"better" filesystem and how often they've been wrong. >How do you define `could *NOT* be repaired'? It was an extent based filesystem that a *former* vendor (who will remain nameless) included in their OS release. It's equivalent of fsck was built into the kernel. (There were *no* external tools for examining and repairing the filesystem other that those we wrote ourselves.) Writing the two files on a freshly created filesystem quickly crashed the OS, and left it in a state where the OS crashed if the filesystem was mounted. Our best guess about what happened, was that writing two files simultaneously caused a large number of small extents to be allocated and that when the system attempted to coalesce them into larger extents it stomped on its own data structures. Filesystems are hard to get right. And, if you cocky and try to get away with cutting corners, Mr. Murphy will be happy to teach you a lesson, when you can least afford it. > How do you destroy old, untouched files by creating new files or by deleting > files? It's easy, all the filesystem code has to do, is do something wrong, or in the wrong order. Imagine the chaos if you attempt to delete a file and one of its indirect blocks had been overwritten with something else? > Philip Homburg -- T.T.F.N., Dave Truesdell truesdel@nas.nasa.gov/postmaster@nas.nasa.gov Wombat Wrestler/Software Packrat/Baby Wrangler/Newsmaster/Postmaster