*BSD News Article 7570

Xref: sserve comp.unix.bsd:7620 comp.benchmarks:2332 comp.arch:27997 comp.arch.storage:673
Path: sserve!manuel.anu.edu.au!munnari.oz.au!sgiblab!swrinde!cs.utexas.edu!sun-barr!olivea!sgigate!odin!sgi!igor!jbass
From: jbass@igor.tamri.com (John Bass)
Newsgroups: comp.unix.bsd,comp.benchmarks,comp.arch,comp.arch.storage
Subject: Disk performance issues, was IDE vs SCSI-2 using iozone
Message-ID: <1992Nov7.102940.12338@igor.tamri.com>
Date: 7 Nov 92 10:29:40 GMT
Sender: jbass@dmsd.com
Organization: DMS Design
Lines: 317

Copyright 1992, DMS Design, all rights reserved.

There should be something to learn or debate in this posting for everyone!

(Most of what follows is from a talk I gave this summer at SCO Forum,
and have been for several years been trying to pound into systems designers &
integrators everywhere. Many thanks to doug@sco.com for letting/paying me to
learn so much about PC UNIX systems over the last 3 years and give the
interesting IDE vs SCSI SCO Forum presentation. Too bad the politics
prevented me from using it to vastly improve their filesystem and disk i/o.)

There are a number of significant issues in comparing IDE vs SCSI-2, to
avoid comparing apples to space ships -- this topic is loaded with traps.

I am not a new kid on the block ... been around/in the computer biz since '68,
UNIX since '75, SASI/SCSI from the dark days including design of several
SCSI hostadapters (The 3 day MAC hack published in DDJ, a WD1000
emulating SCSI hostadapter for a proprietary bus, and a single board computer
design). I understand UNIX/Disk systems performance better than most.

For years people have generally claimed SCSI to be faster than ESDI/IDE
for all the wrong reasons ... this was mostly due to the fact that
SCSI drives implemented lookahead caching of a reasonable length before
caching appeared in WD1003 interface compatable controllers (MFM, RLL, ESDI,
IDE). Today, nearly all have lookahead caching. Properly done, ESDI/IDE
should be slightly faster than an equivalent SCSI implementation. This
means hostadapters & drives of equivalent technology.

PC Architecture and I/O subsystem architecture issues are the real answer
to this question ... it is not only a drive technology issue.




First, dumb IDE adapters are WD1003-WHA interface compatible, which means:

	1) the transfer between memory and controller/drive are done in
	software .... IE tight IN16/WRITE16 or READ16/OUT16 loop. The
	IN and OUT instructions run a 286-6mhz bus speeds on all ISA
	machines 286 to 486 ... time invariant ... about 900us per 16bits.
	This will be refered to as Programmed I/O, or PIO.

	2) the controller interrupts once per 512 sector, and the driver
	must transfer the sector. On drives much faster than 10mbit/sec
	the processor is completely saturated during diskio interrupts
	and no concurrent user processing takes place. This is fine for
	DOS, but causes severe performance problems for all multitasking OS's.
	(Poor man's disconnect/reconnect, allows multiple concurrent drives).

	3) The DMA chips on the motherboard are much slower, since they
	lose additional bus cycles to request/release the bus, which is
	why WD/IDE boards are PIO.

	4) Since data transfers hog the processor, the os/application are
	not able to keep up the disk data rate, and WILL LOOSE REVS
	(miss interleave).

	5) for sequential disk reads with lookahead caching, the
	system is completely CPU bound for drives above 10mbit/sec.
	All writes, and reads without lookahead caching lose one
	rev per request, unless a very large low level interleave
	is present. 1:1 is ALMOST ALWAYS the best performing interleave,
	even taking this into account, due to multiple sector requests
	at the filesystem and paging interfaces.

There will be strong a market for high performance IDE hostadapters
in the future, for UNIX, NOVEL, OS/2 and NT ... which are NOT PIO via
IN/OUT instructions. Both ISA memory mapped and Bus Mastering IDE
host adapters should appear soon .... some vendors are even building
RAID IDE hostadapters. I hope this article gets enough press to make
endusers/vars aware enough to start asking for the RIGHT technology.
While the ISA bus used like this is a little slow, it is fast enough
to handle the same transfer rates and number of drives as SCSI. With
a smart IDE drive ... we could also implement DEC RPO5 style rotational
position sensing to improve disk queue scheduling ... worth another
15%-30% in performance.

Local bus IDE may prove one solution.




Secondly, SCSI hostadapters for the PC come in a wide variety of flavors:

	1) (Almost) Fast Bus Mastering (ok, 3rd Party DMA) like the
	Adaptec 1542 and WD7000 series controllers. These are generally
	expensive compared to dumb IDE, but allow full host CPU concurrency
	with the disk data transfers (486 and cached 386 designs more so
	than simple 386 designs).

	2) Dumb (or smart) PIO hostadapters like the ST01 which are
	very cheap, and are CPU hogs with poor performance just like IDE,
	for all the same reasons, plus a few. These are common with
	cheap CD-ROM and Scanner scsi adapters.

What the market really needs are some CHEAP but very dumb IDE and SCSI
adapters that are only a bus to bus interface with a fast Bus Mastering
DMA for each drive. In theory these would be a medium sized gate array
for IDE, plus a 53C?? for SCSI and cost about $40 IDE, and $60 SCSI. For
486 systems they would blow the sockets off even the fastest adapters built
today since the 486 has faster CPU resources to follow SCSI protocol -- more
so than what we find on the fast adapters, and certainly faster than the
crawlingly slow 8085/Z80 adapters. With such IDE would be both faster and
cheaper than SCSI -- maybe we would see more IDE tapes and CD ROMS. Certainly
the products firmware development would be shorter than any SCSI-II effort.

All IDE and SCSI drives have a microprocessor which oversees the bus and
drive operation. Generally this is a VERY SLOW 8 bit micro ... 8048, Z80,
or 8085 core/class CPU. The IDE bus protocol is MUCH simpler than SCSI-2,
which allows IDE drives to be more responsive. Some BIG/FAST/EXPENSIVE
SCSI drives are starting to use 16 micro's to get the performance up.

First generation SCSI drives often had 6-10ms of command overhead in the
drive ... limiting performance to two or three commands per revolution,
which had better be multiple sectors to get any reasonable performance.
SCO's AFS uses 8-16k clusters for this reason, ditto for berkeley fs.

The fastest drives today still have command overhead times of near/over 1ms
(partly command decode, the rest at status/msg_out/disconnect/select sequence).
Most are still in the 2-5 range.

What was fine in microcontrollers for SASI/SCSI-I ... is impossibly
slow with the increased firmware requirements for a conforming SCSI-II!

High performance hostadapters on the ESIA and MC platforms are appearing
that have fast 16 bit micros ... and the current prices reflect not only
the performance .... but reduced volumes as well.

The Adaptec 1542 and WD 7000 hostadapters also use very slow microprocessors
(8085 and Z80 respectively) and also suffer from command overhead problems.
For this reason the Adaptec 1542 introduced queuing multiple requests inside
the hostadapter, to minimize delays between requests that left the drive idle.
For single process disk accesses ... this works just fine ... for multiple
processes, the disk queue sorting breaks down and generates terrible seeking
and a performance reduction of about 60%, unless very special disk sort and
queuing steps are taken. Specifically, this means that steps should be taken
to allow the process associated with the current request to lock the heads
on track during successive io completions and filesystem read ahead operations
to make use of the data in the lookahead cache. IE ... keep other processes
requests out of the hostadapter!  Allowing other regions requests into the
queue flushes the lookahead cache when the sequential stream is broken.

Lookahead caches are very good things ... but fragile ... the filesystem,
disksort, and driver must all attempt to preserve locality long enough to
allow them to work. This is a major problem for many UNIX systems ... DOS
is usally easy .... single process, mostly reads. Other than the several
extent based filesystems (SGI) ... the balance of the UNIX filesystems
fail to maintain locality during allocation of blocks in a file ... some
like the BSD filesystem and SCO's AFS manage short runs ... but not good
enough.  Log structured filesystems without extensive cache memory suffer
and late binding suffer the same problem.

Some controllers are attempting to partition the cache to minimize lookahead
cache flushing ... for a small number of active disk processes/regions.  For
DOS this is ideal, handles the multiple read/write case as well.  With UNIX at
some point the number of active regions will exceed the number of cache
partitions, the resulting cache flushing creates a step discontinuity in
thruput, a reduction with severe hysterisis induced performance problems.

There are several disk related step discontinuity/hysterisis problems that
cause unstable performance walls or limits in most unix systems, even today.
Poor filesystem designs, partitioning strategies, paging strategies are
at the top of my list, which prevent current systems from linearly degrading
with load ... nearly every vendor has done one or more stupid things to
create limiting walls in the performance curve due to a step discontinuity
with a large hysterisis function.

Too much development, without the oversight of a skilled systems architect.




One final note on caching hostadapters ... the filesystem should make better
use of any memory devoted to caching, compared with ANY hostadapter. Unless
there is some unreasonable restriction, the OS should yield better
performance with the additional buffer cache space than the adapter.
What flows in/out of the OS buffer cache generally is completely redundant
if cached in the hostadapter. If the hostadapter shows gains by using
some statistical caching ... vs the fifo cache in UNIX, then the
UNIX cache should be able to get better gains by incorporating the
same statistical cache in addition to the fifo cache. (If you are running
a BINARY DISTRIBUTION and can not modify the OS, then if the hostadapter
cache shows a win ... you are stuck and have no choice except to use it.)

Systems vendors should look at caching adapters performance results and
reflect wining caching strategies back into the OS for even bigger
improvements. There should never be a viable UNIX caching controller
market ... DOS needs all the help it can get.







Now for a few interesting closing notes ...

Even the fastest 486 PC UNIX systems are filesystem CPU bound to between
500KB and 2.5MB/sec ... drive subsystems faster than this are largely
useless (a waste of money) ... especially most RAID designs. Doing
page flipping (not bcopy) to put the stuff into user space can improve
things if aligned properly inside well behaved applications.

The single most critical issue for 486 PC UNIX applications under X/Motif is
disk performance -- both filesystem and paging.  Today's systems are only
getting about 10% of the available or required disk bandwidth to provide
acceptable performance for anything other than trival applications. The
current UNIX filesystem designs and paging algorithms are a hopeless bottle
neck for even uni-processor designs ... MP designs REQUIRE much better if
they are going to support multiple Xterm session using significant X/Motif
applications like FrameMaker or Island Write. RAID performance gains are
not enough to makeup for the poor filesystem/paging algorithms. With the
current filesystem/paging bottleneck, neither MP or Xterms are cost
effective technologies.

For small systems the berkeley 4.2BSD and Sprite LSF filesystems both fail
to maintain head locality ... and as a result overwork the head positioner
resulting in lower performance and early drive failure. With the student
facility model a large number of users with small quotas consume the entire
disk and to the filesystem present a largely ramdom access pattern. With such
a model there is no penalty for spreading files evenly across cylinder groups
or cyclicly across the drive ... in fact it helps minimize excessive long worst
case seeks at deep queue lengths and results in linear degradation without
step discontinuities.

Workstations and medium sized development systems/servers have little if any
queue depth and locality/Sequential allocation of data within a file is
ABSOLUTELY CRITICAL to filesystem AND exec/paging performance.  Compaction of
referenced disk area by migration of frequently accessed files to cylinder and
rotationally optimal locations is also ABSOLUTELY necessary, by order of
reference (directories in search paths, include files, x/motif font and
resource files) to get control of response times for huge applications.  With
this model, less than 5% of the disk is reference in any day, and less than
10% in any week ... 85% has reference times greater than 30 days, if ever.
Any partitioning of the disk is absolutely WRONG ... paging should occur
inside active filesystems to maintain locality of reference (no seeking).
For zoned recorded drives the active region should be the outside tracks,
highest transfer rates and performance ... ditto for fixed frequency drives,
but for highest reliability due to lowest bit density. This is counter the
center of the disk theory ... which is generally wrong for other reasons
as well (when files are sorted by probablity of access and age).

In fact the filesystem should span multiple spindles with replication of
critical/high use files across multiple drives, cylinder locations, and
rotational positions -- for both performance and redundancy. Late binding
of buffer data to disk addresses (preferably long after close) allows not only
best fit allocation, but choice of least loaded drive queue as well, at the
time of queuing the request ... not long before.  Mirroring, stripping,
and most forms of RAID are poor ways to gain lesser redundancy and load
balancing.  Dynamic migration to & retrieval from a jukebox optical disk
or tape backing store both solves the backup problem as well as transparent
recovery across media/drive failures. Strict write ordering and checkpointing
should make FSCK obsolete on restarts ... just a tool for handling media
failures and OS corruptions.

I spent the last 2-1/2 years trying to get SCO accept a contract to
increase filesystem/paging performance by 300%-600% using the above ideas
plus a few, but failed due to NIH, lack of marketing support for performance
enhancements, and some confusion/concern over lack of portability of such
development to the next generation OS platform (ACE/OSF/SVR4/???). I even
proposed doing the development on XENIX without any SCO investment, just a
royalty position if the product was marketable -- marketing didn't want to have
possible XENIX performance enhancements outshine ODT performance efforts,
which if I even got close to my goals would be TINY in comparison. Focused
on POSIX, X/Motif, networked ODT desktops -- they seem to have lost sight
of a small lowcost minimalist character based platform for point of sale
and traditional terminal based var's. I know character based app's are the
stone age ... but xterms just aren't cheap yet.




Shrink wrapped OS's mean fewer Systems Programmer development jobs!

With the success of SCO on PC platforms, there are only three viable
UNIX teams left ... USL/UNIVEL, SunSoft, and SCO. The remaining vendors
have to few shipments and tight margins, they are unable to fund any
significant development effort outside simply porting and maintaining
a relatively stock USL product -- unless subsidized by hardware sales
(hp, apple, ibm, everex, ncr .... etc). A big change from 10 years
ago, when a strong UNIX team was a key part of every systems vendors success.

With the emerging USL Destiny and Microsoft NT battle, price for a
minimalist UNIX application platform will be EVERYTHING, to offset
the DOS/Windows camp's compatability and market lead in desktops. UNIX's
continued existence outside a few niche markets lies largely in USL's
ability to expand distribution channels for Destiny without destroying
SCO, SUN, and other traditional unix suppliers ... failure to do so will
give the press even more amunition that "UNIX is dead".

It may require that USL's royalties be reduced to the point that UNIVEL,
SCO & SUN can profitably re-sell Destiny in the face of an all out price
war with Microsoft NT.  A tough problem for USL in both distribution channel
strategy and revenue capture ... Microsoft has direct retail presence (1 or 2
levels of distribution vs 2, 3 or 4) which improves it's margins significantly
over USL (and ability to cut cost). SCO and SunSoft are going to see per unit
profits tumble 50% to 80% without a corresponding increase in sales in the
near term -- many ENDUSERS/VAR's buying the current bundled products can get
along quite well with the minimalist Destiny product -- as an applications
platform. Developers (a much smaller market) will continue buying the
whole bundle.




Microsoft is VERY large & profitable, sadly USL, SCO & SUN only wish to be.
It might be time to think about building NT experience and freeware clones.
I hope our UNIVERSITY's can focus on training NT Systems jocks, there's going
to be more than enough UNIX guys for quite some time .... even if USL pulls it
off and only loses 60% of the market to NT.

Fortunely for the hardware guys, designs efforts will be up for NT optimized
high performance 486 & RISC systems! and everybody is going to go crazy for
a while.

John Bass, Sr. Engineer, DMS Design			      (415) 615-6706
UNIX Consultant			 Development, Porting, Performance by Design