*BSD News Article 19330

Newsgroups: comp.os.386bsd.development
Path: sserve!newshost.anu.edu.au!munnari.oz.au!news.Hawaii.Edu!ames!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: Re: Compressing file system ?
Message-ID: <1993Aug9.233341.20631@fcom.cc.utah.edu>
Sender: news@fcom.cc.utah.edu
Organization: Weber State University, Ogden, UT
References: <EgN6R_K00WB5JOZpRc@andrew.cmu.edu>
Date: Mon, 9 Aug 93 23:33:41 GMT
Lines: 98

In article <EgN6R_K00WB5JOZpRc@andrew.cmu.edu> "Alex R.N. Wetmore" <aw2t+@andrew.cmu.edu> writes:
[ ... compressed FS ... ]
>LKM's would seem to be the way to do this.  On the other hand, it seems
>the having shared libraries would give a large enough increase in disk
>space, without all of the headaches of compression (although shared libs
>have enough headaches themselves).

Yes and no (OK, I jumped in 8-)).

LKM's provide the ability to load junk into the kernel; there is still a
piece missing, and that's the ability to attribute the file and to do the
compression on block boundries (as has been suggested).  File attribution
is the most serious issue, and almost requires a Rosenthal or Heidemann to
the idea of vnode stacking to handle.  You *could* load layers as kernel
modules *after* that condition was met, however.

The use of an NFS file system for compressing/decompressing executables is
broken in several ways.  The most important (and most obvious) is the one
which indicates the current handling of ETXTBUSY for using the image as
the program swap store is broken, broken, broken.  Specifically, there is
no way on an NFS server to tell if the remote machine opened the executable
to copy it or to write it or to execute it.  The vnode getting the flagging
is an alias generated by NFS and referring to the local buffer cache pages
in any case.

If I am decompressing a file from a block buffer and then copying the block
buffer out "over the wire" (which may be a local loopback mount), I have
the problem of maintaining attribution state between the client and the
server based on the files intended use on the "remote machine".  An NFS
implementation would work, then but it would work poorly.  It should be
obvious that for a consistant external view, the decompression must occur
on the NFS server, but for a consistent flagging for modification, the
decompression must take place on the client.  A trade-off must be made,
and it will be expensive.

An existing implementation allows reference-delay flagging of executables,
such that the decompression occurs on reference, and the recompression
occurs over an amount of idle time in which no file blocks are requested
or written.  This is clearly broken for many NFS uses (the most glaring
being a remote swap partition, the second most glaring being as backing
store for an executable, where the page requests to the server are timed
out because of the frequencey of client cache hits, but the decompress/
compress cycles occur while the file is in use because not all pages are in
the client cache, or can be flushed from the client cache by large non-hit
activities (like an ls -R of some partition)).  A partial fix that any sane
person would stuff in this code immediately is to disallow the recompress
while the file had an open reference instance.


I have also seen a suggestion that the compression be on block boundries;
this would be acceptable, if a different block handling mechanism were to
be used, such that the cache blocks were all 8k or so and the on disk
blocks were divided into low penalty fragments; with the exception of the
last block in the file, the on-disk blocks would all be considered to
contain 8k worth of data, and the in-core buffers would be filled from
smaller than expected on-disk blocks through a decompressing fbread and a
compressing fbwrite that copied to a seperate block from the vmcache;
this would imply no unification was possible, or if unification was to be
done, there must be a change in the vnode flags to include whether or not
the file is compressed.  This is in line with Peter's "chmod +c" suggestion,
and with the treament of cache buffers as compressed or uncompressed based
on which side of the block I/O interface they were on.

I arbitrarily picked 8k to keep the minimum fragment size at 1k after it
had been halved.

The main problem with this approach is that the fragmentation and allocation
policies on the FS must be drastically changed; for instance, since frags are
used for all file contents, cylinder grouping must be thrown out the window.
In addition, fragments must no longer be considered second class citizens,
and this would certainly interfere with their ability to be shared between
files.  This would also get around the mmap problem, as the 8k buffers hung
off the vnode as "buffer cache" would be in uncompressed format and could
be considered atomic for the purposes of mapping.


Perhaps the best implementation of a compressed file scheme I have seen was
the one in the UCLA Ficus implementation in BSD4.4; it has many of the
attributes I have tagged as "desirable" (noting that the aesthetics are
dictated by teeny, tiny disks rather than clean kernel implementation or
fast access).

I'd encourage anyone who wants to get into building file systems to do so;
there are very few live projects of this type out there, and I can count the
number of commercial file system vendors (that's all they do) on one hand.

I'd also point you to ftp.ucla.com (where you can get the Heidemann papers)
and the various proceedings of Usenix for the past 4 years, where there
have been many papers on this and other file system topics.

Happy hacking!


					Terry Lambert
					terry@icarus.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.