*BSD News Article 92291

Path: euryale.cc.adfa.oz.au!newshost.carno.net.au!harbinger.cc.monash.edu.au!munnari.OZ.AU!news.ecn.uoknor.edu!feed1.news.erols.com!news.idt.net!enews.sgi.com!news.corp.sgi.com!fido.asd.sgi.com!tilt.engr.sgi.com!rcc
From: rcc@tilt.engr.sgi.com (Ray Chen)
Newsgroups: comp.unix.bsd.freebsd.misc,comp.unix.bsd.bsdi.misc,comp.sys.sgi.misc
Subject: Re: no such thing as a "general user community"
Date: 28 Mar 1997 22:53:59 GMT
Organization: Silicon Graphics, Inc., Mountain View, CA
Lines: 131
Message-ID: <5hhi67$1gl@fido.asd.sgi.com>
References: <331BB7DD.28EC@net5.net> <5hfh2l$i13@flea.best.net> <5hfl3n$a3t@fido.asd.sgi.com> <5hh5n2$9q8@flea.best.net>
NNTP-Posting-Host: tilt.engr.sgi.com
Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:38028 comp.unix.bsd.bsdi.misc:6501 comp.sys.sgi.misc:29515

Ok, I've stayed out of this so far because I've been busy
fixing bugs and making some of those performance improvements
people have been grumping about us not doing :-) but I've got
to jump in now.

In article <5hh5n2$9q8@flea.best.net>,
Matt Dillon <dillon@flea.best.net> wrote:
>    I don't think a log based filesystem will be much of a win over FFS
>    if FFS is improved a bit.  Really, only two improvements need to be
>    made:  (1) ordered I/O, which will fix the file create/delete rate
>    problem, and (2) sparse sorted directories, which effectively solves
>    the linear file lookup problem.

You know, I have a fair amount of respect for Matt and the work he's done
and in particular, I'd like more info about the paging behavior and 16K
I/Os.  I haven't looked at the IRIX paging code so I may be slandering my
compatriots in the VM area but I'm willing to believe that IRIX doesn't
do as good a job as it should once it starts to do sustained paging.
That's not something our customers typically want to do -- once you
start sustained paging, you're off the performance curve anyway and
most of customers configure their machines for maximum performance.
And I have some suspicions about the 16K I/Os but I need to do some
poking around first.

Matt, if you can give me more data to make it easier for us to
reproduce the scenario's you've seen, I'll try and see that the
problems are fixed.  Just because most of our customers don't do
sustained paging for example, that doesn't mean IRIX shouldn't do
it well.

But I'm sorry.  The filesystem comments are just flat-out wrong.

FFS will *always* be slower doing file creates than XFS or for that
matter, any good journalling filesystem.

The fundamental problem with FFS is that to guarantee safety, when
you do file deletion/creation, the writes have to be ordered.  The
first set of updates have to hit the disk regardless of how you order
the directory update and inode deallocation before the second set hits.
Otherwise, you can get nice anomalies like a file changing to a named
pipe because the inode happened to be a named pipe before it was deleted
and reallocated as a file.

With a good journalling filesystem, each transaction commit should require
1 log write.  And with XFS, we do better than that because we play fancy
games so we can batch most of our transaction commits.  That way, file
creations and deletion typically run without requiring any synchronous I/O.
The commit records get batched and written out asynchronously.  If you
crash before the record is written out, you lose the file creation but
your filesystem is still consistent.

There's no way an FFS will ever be able to do that.  Sure you can run
in async mode.  But if you crash, even if the async writes are properly
ordered, unless all of them went out, you've probably got an inconsistent
filesystem.

File creation in an FFS-style filesytem requires 2 writes to go out,
for example.  A directory update and the inode allocation.  And to
be consistent, they have to both go out or neither.  But the writes
typically have to go to 2 different places on the drive.  So you
can't guarantee that both make it out or none.  You write one out,
you go for the other, something goes wrong, and now you've got an
inconsistent filesystem.

A journalling filesystem lets you record a set of updates to the
filesystem and effectively issue them as one write.  At that point,
even if the write spans multiple disk sectors, you *can* guarantee
that the write looks atomic.  That lets you guarantee a consistent
filesystem.

Life isn't all roses in journalling filesystem land.  You still
need to flush the filesystem updates as the log wraps around
(circular logs, after all).  But those can go out async.  And
if you do the right things, they can also be batched or clustered
to further increase your I/O efficiency.  Do it right and you can
beat an FFS-style filesystem under *any* load.  XFS mostly does it
right.  But I know we can do better and we're working on it.

Plus the code can get *very* intricate.  Larry McVoy said 45,000 lines.
He was only off by a factor of 2 :-).  The real number is something
like 100,000 lines of kernel code.  And that doesn't include the
XFS guaranteed rate I/O mechanism.  That's just the filesystem.

Enough about journalling vs. FFS.  People have talked about XFS's
main claims to fame.  I'd like to set the record straight.

XFS's main claim to fame are the S-words:  speed, scalability, safety.

Speed:  we're fast.  We hit >300 MB/sec the first day we shipped and
that number's been going up ever since.  As the I/O hardware gets
faster, so do we.  We've done >500 MB/sec for something like over
a year now.  And that was with (now) 4-year-old hardware.  Now that
the new stuff is out, stay tuned :-).

Scalability:  we can work on big files and filesystems.  80 GB
filesystems are routine.  So are 12 GB files.  We work on large
directories.  Put a million files into a directory.  The filesystem
still runs fast.

Safety:  if your computer crashes for some reason, the 80 GB filesystem 
recovers in < 15 seconds and it's just fine.

>    The cool thing is that both features can be added while maintaining
>    full compatibility.  Add in a somewhat better kernel locking mechanism
>    and poof, you are done.
>
>    Now, of course you still have to fsck an ffs filesystem verses a log
>    or journaling filesystem.  While this is an important difference, I
>    consider it minor if the rest of the machine is reliable (i.e. doesn't
>    crash very often), especially if FFS is further adjusted to set the 
>    clean bit on mounted filesystems that have been synced up and are idle.
>
>						-Matt

We have 24x7 customers running high-availability configurations who
would disagree with you about fsck.  They don't *ever* want to run
fsck on a 40 GB filesystem.  If they crash, they want to be back up
fast.  fsck is just too slow.

Matt, all the above aside, I'd be grateful if you could forward
Larry or myself more concrete data about some of the stuff you've
seen, including the paging and 16K I/O problem.  I'd like to see
those problems fixed.

	Ray Chen
	rcc@sgi.com

--
Raymond C. Chen, PhD                 rcc@sgi.com
Member of Technical Staff            Silicon Graphics, Inc. (SGI)
High-End Operating Systems           Generic Disclaimer:  I speak only for me.