*BSD News Article 3797

Xref: sserve comp.unix.bsd:3844 comp.sys.ibm.pc.hardware:29366
Newsgroups: comp.unix.bsd,comp.sys.ibm.pc.hardware
Path: sserve!manuel!munnari.oz.au!mips!mips!darwin.sura.net!wupost!uunet!mcsun!sunic!aun.uninett.no!barsoom!barsoom!tih
From: tih@barsoom.nhh.no (Tom Ivar Helbekkmo)
Subject: Re: Disklabeling (was Re: Another Adaptec Question)
Message-ID: <tih.714115743@barsoom>
Sender: news@barsoom.nhh.no (USENET News System)
Organization: Norwegian School of Economics
References: <1992Aug16.144341.24052@Informatik.TU-Muenchen.DE> <15776@star.cs.vu.nl> <1992Aug17.173807.2309@Informatik.TU-Muenchen.DE> <#79m_a.alm@netcom.com>
Distribution: world,fj,spec
Date: Tue, 18 Aug 1992 05:29:03 GMT
Lines: 133

alm@netcom.com (Andrew Moore) writes:

>To return to 386BSD and disklabeling a SCSI drive: 
>am I correct in assuming that the specification of # of cylinders,
>sectors per cylinder, etc does not matter at all, so long as the total
>number of blocks is correct?

No, you're not...  It *does* matter to the file system.  Although the
following was written while I was trying to figure out how to set up
SCSI disks correctly for Ultrix, it's relevant to any system using
the Berkeley Fast File System.  This is a summary I posted to the net
in comp.unix.ultrix a while back:

I recently asked how to correctly set up disk partitions for the SCSI
disks connected to our DECstations, specifying some of the problems I
had understanding what was right and wrong.  I've had several very
interesting responses, and feel that I've learned quite a bit of
useful stuff here...  Thanks go to Klaus Steinberger, Walter Wong,
Mike Mitchell, and especially to Stefan Esser, who took the time to
explain a lot of details to me in our email correspondence.

Anyway -- to recap my situation, I wanted to make sure I partitioned
my disks so partition boundaries were placed at cylinder boundaries,
and their sizes worked out properly to an integral number of cylinder
groups, 16 cylinders per group being the default number.  Looking at
the /etc/disktab entries for the disks helped me little, since DEC
obviously hadn't cared about this in their setup, and multiplying
sectors/track by tracks/cylinder by cylinders didn't work out to the
specified total number of sectors on the disks anyway!

Well, it turns out that the situation is more complex than that...

The BSD Fast File System uses certain heuristics to allocate disk
blocks within a partition.  Some of these are supposed to increase
data security (against accidental loss), some to make file access more
efficient.  For instance:

- Groups of inodes are allocated in each cylinder group in the
  partition, and attempts are made to keep file data blocks near the
  inodes describing them.  (Efficiency)

- Each cylinder group has a redundant copy of the superblock, which is
  staggered by one track per cylinder group, to keep them on different
  platters.  Inodes follow the superblock copy, to stagger those as
  well.  (Security)

- File data blocks are allocated for rotational contiguity.  The
  optimal block is not necessarily the one following the previous one;
  if the system is not fast enough to schedule a new disk transfer in
  time, a "rotationally later" block is selected.  If the optimal block
  on the disk is already taken, the same block (or one as closely
  following it as possible) on another track in the same cylinder is
  attempted allocated instead, and so on.  (Efficiency)

There's more to it than this, of course, but note the assumptions
being made here: The file system needs to know the correct disk
geometry; cylinders, heads, and sectors per track.  The product of the
last two of these must be the correct number of sectors per cylinder.
It must also know the rotational speed of the disk, and it assumes
that sectors within tracks are numbered in parallell, so that sector 0
of each track in a cylinder passes the read/write head at the same
time.

Guess what?  This doesn't hold true for SCSI disks!  These disks tend
to do quite a bit of optimization of their own, behind the file
system's back...  For instance:

- Tracks are usually rotationally staggered, to optimize the time to
  sequentially get from the last sector of one track to the first sector
  of the (logically) next one.  This counteracts the rotational delay
  optimization in the file system.

- Spare sectors (for bad block replacement) are usually allocated on
  a per-cylinder basis.  This is a good strategy for optimal disk
  utilization and effective relocation, but it breaks the file
  system's calculation of where cylinder boundaries are, since heads
  multiplied by sectors per track does not equal (usable) sectors
  per cylinder.

- Large SCSI disks tend to use zone bit recording, which means that
  there are more sectors per track on the outer tracks than on the
  inner ones.  Then they lie to the file system about geometry, giving
  it something that works out close to the correct size of the disk.
  Again, this ruins the file system's attempt to intelligently use
  cylinder boundary information, which is guaranteed to be wrong.

So, what do you do if you want optimal performance from a SCSI disk?
Well, as long as the disk does not do zone bit recording, there may
be hope.  SCSI disks can be reparameterized and reformatted.  However,
the number of parameters that you can change varies from disk to disk.
(See the man page entry on 'rzdisk' for more information on how to
examine and change these parameters.)

- If you can set track skew and cylinder skew parameters to zero, thus
  reorienting the geometry of the disk to what the file system expects,
  you can get the timing calculations to work.

- If you can make the disk allocate spare sectors on a per-track basis,
  you can make the cylinder boundary calculations work right, by
  using, say, one spare per track, and telling the file system that
  the disk has one less sector per track than it really does.  (The
  file system doesn't know about spares; it counts usable sectors.)
  This means that tracks with more than one fault will be reallocated
  to the spare cylinders you reserve at the end of the disk, but that
  can't very well be helped.

- If spare sectors can only be allocated on a per-cylinder basis, a
  hack is still possible!  According to Stefan Esser, you can specify
  (through /etc/disktab and/or mkfs) that the disk has only one head,
  with a rather large number of sectors per track (the number of
  actually usable sectors per real cylinder).  He notes, however, that
  a patch to the ufs_alloc() function in the file system is necessary,
  because, as shipped from DEC, it can't handle this large number of
  sectors per track.

It would seem, then, that the correct choice, when you need high disk
throughput, is to get a disk that does not do zone bit recording, and
that can be reparameterized to use a non-staggered layout with a spare
sector per track.  This will normally mean smaller disks, and thus an
increased number of drives to achieve the same storage space -- which
isn't too bad anyway if you're really into speed; e.g. two optimized
drives on each of two SCSI controllers should be much better than two
non-optimized, bigger drives on one controller.

I expect, though, that future versions of the BSD Fast File System
will have knowledge of SCSI disks, and how to use them effectively.
I understand that Sun has already made such changes, resulting in
noticeable improvements.

-tih
--
Tom Ivar Helbekkmo, NHH, Bergen, Norway.  Telephone: +47-5-959205
Postmaster for domain nhh.no.   Internet mail: tih@barsoom.nhh.no