Return to BSD News archive
Path: euryale.cc.adfa.oz.au!newshost.carno.net.au!harbinger.cc.monash.edu.au!munnari.OZ.AU!news.ecn.uoknor.edu!feed1.news.erols.com!news.idt.net!enews.sgi.com!news.corp.sgi.com!fido.asd.sgi.com!tilt.engr.sgi.com!rcc From: rcc@tilt.engr.sgi.com (Ray Chen) Newsgroups: comp.unix.bsd.freebsd.misc,comp.unix.bsd.bsdi.misc,comp.sys.sgi.misc Subject: Re: no such thing as a "general user community" Date: 28 Mar 1997 22:53:59 GMT Organization: Silicon Graphics, Inc., Mountain View, CA Lines: 131 Message-ID: <5hhi67$1gl@fido.asd.sgi.com> References: <331BB7DD.28EC@net5.net> <5hfh2l$i13@flea.best.net> <5hfl3n$a3t@fido.asd.sgi.com> <5hh5n2$9q8@flea.best.net> NNTP-Posting-Host: tilt.engr.sgi.com Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:38028 comp.unix.bsd.bsdi.misc:6501 comp.sys.sgi.misc:29515 Ok, I've stayed out of this so far because I've been busy fixing bugs and making some of those performance improvements people have been grumping about us not doing :-) but I've got to jump in now. In article <5hh5n2$9q8@flea.best.net>, Matt Dillon <dillon@flea.best.net> wrote: > I don't think a log based filesystem will be much of a win over FFS > if FFS is improved a bit. Really, only two improvements need to be > made: (1) ordered I/O, which will fix the file create/delete rate > problem, and (2) sparse sorted directories, which effectively solves > the linear file lookup problem. You know, I have a fair amount of respect for Matt and the work he's done and in particular, I'd like more info about the paging behavior and 16K I/Os. I haven't looked at the IRIX paging code so I may be slandering my compatriots in the VM area but I'm willing to believe that IRIX doesn't do as good a job as it should once it starts to do sustained paging. That's not something our customers typically want to do -- once you start sustained paging, you're off the performance curve anyway and most of customers configure their machines for maximum performance. And I have some suspicions about the 16K I/Os but I need to do some poking around first. Matt, if you can give me more data to make it easier for us to reproduce the scenario's you've seen, I'll try and see that the problems are fixed. Just because most of our customers don't do sustained paging for example, that doesn't mean IRIX shouldn't do it well. But I'm sorry. The filesystem comments are just flat-out wrong. FFS will *always* be slower doing file creates than XFS or for that matter, any good journalling filesystem. The fundamental problem with FFS is that to guarantee safety, when you do file deletion/creation, the writes have to be ordered. The first set of updates have to hit the disk regardless of how you order the directory update and inode deallocation before the second set hits. Otherwise, you can get nice anomalies like a file changing to a named pipe because the inode happened to be a named pipe before it was deleted and reallocated as a file. With a good journalling filesystem, each transaction commit should require 1 log write. And with XFS, we do better than that because we play fancy games so we can batch most of our transaction commits. That way, file creations and deletion typically run without requiring any synchronous I/O. The commit records get batched and written out asynchronously. If you crash before the record is written out, you lose the file creation but your filesystem is still consistent. There's no way an FFS will ever be able to do that. Sure you can run in async mode. But if you crash, even if the async writes are properly ordered, unless all of them went out, you've probably got an inconsistent filesystem. File creation in an FFS-style filesytem requires 2 writes to go out, for example. A directory update and the inode allocation. And to be consistent, they have to both go out or neither. But the writes typically have to go to 2 different places on the drive. So you can't guarantee that both make it out or none. You write one out, you go for the other, something goes wrong, and now you've got an inconsistent filesystem. A journalling filesystem lets you record a set of updates to the filesystem and effectively issue them as one write. At that point, even if the write spans multiple disk sectors, you *can* guarantee that the write looks atomic. That lets you guarantee a consistent filesystem. Life isn't all roses in journalling filesystem land. You still need to flush the filesystem updates as the log wraps around (circular logs, after all). But those can go out async. And if you do the right things, they can also be batched or clustered to further increase your I/O efficiency. Do it right and you can beat an FFS-style filesystem under *any* load. XFS mostly does it right. But I know we can do better and we're working on it. Plus the code can get *very* intricate. Larry McVoy said 45,000 lines. He was only off by a factor of 2 :-). The real number is something like 100,000 lines of kernel code. And that doesn't include the XFS guaranteed rate I/O mechanism. That's just the filesystem. Enough about journalling vs. FFS. People have talked about XFS's main claims to fame. I'd like to set the record straight. XFS's main claim to fame are the S-words: speed, scalability, safety. Speed: we're fast. We hit >300 MB/sec the first day we shipped and that number's been going up ever since. As the I/O hardware gets faster, so do we. We've done >500 MB/sec for something like over a year now. And that was with (now) 4-year-old hardware. Now that the new stuff is out, stay tuned :-). Scalability: we can work on big files and filesystems. 80 GB filesystems are routine. So are 12 GB files. We work on large directories. Put a million files into a directory. The filesystem still runs fast. Safety: if your computer crashes for some reason, the 80 GB filesystem recovers in < 15 seconds and it's just fine. > The cool thing is that both features can be added while maintaining > full compatibility. Add in a somewhat better kernel locking mechanism > and poof, you are done. > > Now, of course you still have to fsck an ffs filesystem verses a log > or journaling filesystem. While this is an important difference, I > consider it minor if the rest of the machine is reliable (i.e. doesn't > crash very often), especially if FFS is further adjusted to set the > clean bit on mounted filesystems that have been synced up and are idle. > > -Matt We have 24x7 customers running high-availability configurations who would disagree with you about fsck. They don't *ever* want to run fsck on a 40 GB filesystem. If they crash, they want to be back up fast. fsck is just too slow. Matt, all the above aside, I'd be grateful if you could forward Larry or myself more concrete data about some of the stuff you've seen, including the paging and 16K I/O problem. I'd like to see those problems fixed. Ray Chen rcc@sgi.com -- Raymond C. Chen, PhD rcc@sgi.com Member of Technical Staff Silicon Graphics, Inc. (SGI) High-End Operating Systems Generic Disclaimer: I speak only for me.