Return to BSD News archive
Xref: sserve comp.unix.bsd:7620 comp.benchmarks:2332 comp.arch:27997 comp.arch.storage:673 Path: sserve!manuel.anu.edu.au!munnari.oz.au!sgiblab!swrinde!cs.utexas.edu!sun-barr!olivea!sgigate!odin!sgi!igor!jbass From: jbass@igor.tamri.com (John Bass) Newsgroups: comp.unix.bsd,comp.benchmarks,comp.arch,comp.arch.storage Subject: Disk performance issues, was IDE vs SCSI-2 using iozone Message-ID: <1992Nov7.102940.12338@igor.tamri.com> Date: 7 Nov 92 10:29:40 GMT Sender: jbass@dmsd.com Organization: DMS Design Lines: 317 Copyright 1992, DMS Design, all rights reserved. There should be something to learn or debate in this posting for everyone! (Most of what follows is from a talk I gave this summer at SCO Forum, and have been for several years been trying to pound into systems designers & integrators everywhere. Many thanks to doug@sco.com for letting/paying me to learn so much about PC UNIX systems over the last 3 years and give the interesting IDE vs SCSI SCO Forum presentation. Too bad the politics prevented me from using it to vastly improve their filesystem and disk i/o.) There are a number of significant issues in comparing IDE vs SCSI-2, to avoid comparing apples to space ships -- this topic is loaded with traps. I am not a new kid on the block ... been around/in the computer biz since '68, UNIX since '75, SASI/SCSI from the dark days including design of several SCSI hostadapters (The 3 day MAC hack published in DDJ, a WD1000 emulating SCSI hostadapter for a proprietary bus, and a single board computer design). I understand UNIX/Disk systems performance better than most. For years people have generally claimed SCSI to be faster than ESDI/IDE for all the wrong reasons ... this was mostly due to the fact that SCSI drives implemented lookahead caching of a reasonable length before caching appeared in WD1003 interface compatable controllers (MFM, RLL, ESDI, IDE). Today, nearly all have lookahead caching. Properly done, ESDI/IDE should be slightly faster than an equivalent SCSI implementation. This means hostadapters & drives of equivalent technology. PC Architecture and I/O subsystem architecture issues are the real answer to this question ... it is not only a drive technology issue. First, dumb IDE adapters are WD1003-WHA interface compatible, which means: 1) the transfer between memory and controller/drive are done in software .... IE tight IN16/WRITE16 or READ16/OUT16 loop. The IN and OUT instructions run a 286-6mhz bus speeds on all ISA machines 286 to 486 ... time invariant ... about 900us per 16bits. This will be refered to as Programmed I/O, or PIO. 2) the controller interrupts once per 512 sector, and the driver must transfer the sector. On drives much faster than 10mbit/sec the processor is completely saturated during diskio interrupts and no concurrent user processing takes place. This is fine for DOS, but causes severe performance problems for all multitasking OS's. (Poor man's disconnect/reconnect, allows multiple concurrent drives). 3) The DMA chips on the motherboard are much slower, since they lose additional bus cycles to request/release the bus, which is why WD/IDE boards are PIO. 4) Since data transfers hog the processor, the os/application are not able to keep up the disk data rate, and WILL LOOSE REVS (miss interleave). 5) for sequential disk reads with lookahead caching, the system is completely CPU bound for drives above 10mbit/sec. All writes, and reads without lookahead caching lose one rev per request, unless a very large low level interleave is present. 1:1 is ALMOST ALWAYS the best performing interleave, even taking this into account, due to multiple sector requests at the filesystem and paging interfaces. There will be strong a market for high performance IDE hostadapters in the future, for UNIX, NOVEL, OS/2 and NT ... which are NOT PIO via IN/OUT instructions. Both ISA memory mapped and Bus Mastering IDE host adapters should appear soon .... some vendors are even building RAID IDE hostadapters. I hope this article gets enough press to make endusers/vars aware enough to start asking for the RIGHT technology. While the ISA bus used like this is a little slow, it is fast enough to handle the same transfer rates and number of drives as SCSI. With a smart IDE drive ... we could also implement DEC RPO5 style rotational position sensing to improve disk queue scheduling ... worth another 15%-30% in performance. Local bus IDE may prove one solution. Secondly, SCSI hostadapters for the PC come in a wide variety of flavors: 1) (Almost) Fast Bus Mastering (ok, 3rd Party DMA) like the Adaptec 1542 and WD7000 series controllers. These are generally expensive compared to dumb IDE, but allow full host CPU concurrency with the disk data transfers (486 and cached 386 designs more so than simple 386 designs). 2) Dumb (or smart) PIO hostadapters like the ST01 which are very cheap, and are CPU hogs with poor performance just like IDE, for all the same reasons, plus a few. These are common with cheap CD-ROM and Scanner scsi adapters. What the market really needs are some CHEAP but very dumb IDE and SCSI adapters that are only a bus to bus interface with a fast Bus Mastering DMA for each drive. In theory these would be a medium sized gate array for IDE, plus a 53C?? for SCSI and cost about $40 IDE, and $60 SCSI. For 486 systems they would blow the sockets off even the fastest adapters built today since the 486 has faster CPU resources to follow SCSI protocol -- more so than what we find on the fast adapters, and certainly faster than the crawlingly slow 8085/Z80 adapters. With such IDE would be both faster and cheaper than SCSI -- maybe we would see more IDE tapes and CD ROMS. Certainly the products firmware development would be shorter than any SCSI-II effort. All IDE and SCSI drives have a microprocessor which oversees the bus and drive operation. Generally this is a VERY SLOW 8 bit micro ... 8048, Z80, or 8085 core/class CPU. The IDE bus protocol is MUCH simpler than SCSI-2, which allows IDE drives to be more responsive. Some BIG/FAST/EXPENSIVE SCSI drives are starting to use 16 micro's to get the performance up. First generation SCSI drives often had 6-10ms of command overhead in the drive ... limiting performance to two or three commands per revolution, which had better be multiple sectors to get any reasonable performance. SCO's AFS uses 8-16k clusters for this reason, ditto for berkeley fs. The fastest drives today still have command overhead times of near/over 1ms (partly command decode, the rest at status/msg_out/disconnect/select sequence). Most are still in the 2-5 range. What was fine in microcontrollers for SASI/SCSI-I ... is impossibly slow with the increased firmware requirements for a conforming SCSI-II! High performance hostadapters on the ESIA and MC platforms are appearing that have fast 16 bit micros ... and the current prices reflect not only the performance .... but reduced volumes as well. The Adaptec 1542 and WD 7000 hostadapters also use very slow microprocessors (8085 and Z80 respectively) and also suffer from command overhead problems. For this reason the Adaptec 1542 introduced queuing multiple requests inside the hostadapter, to minimize delays between requests that left the drive idle. For single process disk accesses ... this works just fine ... for multiple processes, the disk queue sorting breaks down and generates terrible seeking and a performance reduction of about 60%, unless very special disk sort and queuing steps are taken. Specifically, this means that steps should be taken to allow the process associated with the current request to lock the heads on track during successive io completions and filesystem read ahead operations to make use of the data in the lookahead cache. IE ... keep other processes requests out of the hostadapter! Allowing other regions requests into the queue flushes the lookahead cache when the sequential stream is broken. Lookahead caches are very good things ... but fragile ... the filesystem, disksort, and driver must all attempt to preserve locality long enough to allow them to work. This is a major problem for many UNIX systems ... DOS is usally easy .... single process, mostly reads. Other than the several extent based filesystems (SGI) ... the balance of the UNIX filesystems fail to maintain locality during allocation of blocks in a file ... some like the BSD filesystem and SCO's AFS manage short runs ... but not good enough. Log structured filesystems without extensive cache memory suffer and late binding suffer the same problem. Some controllers are attempting to partition the cache to minimize lookahead cache flushing ... for a small number of active disk processes/regions. For DOS this is ideal, handles the multiple read/write case as well. With UNIX at some point the number of active regions will exceed the number of cache partitions, the resulting cache flushing creates a step discontinuity in thruput, a reduction with severe hysterisis induced performance problems. There are several disk related step discontinuity/hysterisis problems that cause unstable performance walls or limits in most unix systems, even today. Poor filesystem designs, partitioning strategies, paging strategies are at the top of my list, which prevent current systems from linearly degrading with load ... nearly every vendor has done one or more stupid things to create limiting walls in the performance curve due to a step discontinuity with a large hysterisis function. Too much development, without the oversight of a skilled systems architect. One final note on caching hostadapters ... the filesystem should make better use of any memory devoted to caching, compared with ANY hostadapter. Unless there is some unreasonable restriction, the OS should yield better performance with the additional buffer cache space than the adapter. What flows in/out of the OS buffer cache generally is completely redundant if cached in the hostadapter. If the hostadapter shows gains by using some statistical caching ... vs the fifo cache in UNIX, then the UNIX cache should be able to get better gains by incorporating the same statistical cache in addition to the fifo cache. (If you are running a BINARY DISTRIBUTION and can not modify the OS, then if the hostadapter cache shows a win ... you are stuck and have no choice except to use it.) Systems vendors should look at caching adapters performance results and reflect wining caching strategies back into the OS for even bigger improvements. There should never be a viable UNIX caching controller market ... DOS needs all the help it can get. Now for a few interesting closing notes ... Even the fastest 486 PC UNIX systems are filesystem CPU bound to between 500KB and 2.5MB/sec ... drive subsystems faster than this are largely useless (a waste of money) ... especially most RAID designs. Doing page flipping (not bcopy) to put the stuff into user space can improve things if aligned properly inside well behaved applications. The single most critical issue for 486 PC UNIX applications under X/Motif is disk performance -- both filesystem and paging. Today's systems are only getting about 10% of the available or required disk bandwidth to provide acceptable performance for anything other than trival applications. The current UNIX filesystem designs and paging algorithms are a hopeless bottle neck for even uni-processor designs ... MP designs REQUIRE much better if they are going to support multiple Xterm session using significant X/Motif applications like FrameMaker or Island Write. RAID performance gains are not enough to makeup for the poor filesystem/paging algorithms. With the current filesystem/paging bottleneck, neither MP or Xterms are cost effective technologies. For small systems the berkeley 4.2BSD and Sprite LSF filesystems both fail to maintain head locality ... and as a result overwork the head positioner resulting in lower performance and early drive failure. With the student facility model a large number of users with small quotas consume the entire disk and to the filesystem present a largely ramdom access pattern. With such a model there is no penalty for spreading files evenly across cylinder groups or cyclicly across the drive ... in fact it helps minimize excessive long worst case seeks at deep queue lengths and results in linear degradation without step discontinuities. Workstations and medium sized development systems/servers have little if any queue depth and locality/Sequential allocation of data within a file is ABSOLUTELY CRITICAL to filesystem AND exec/paging performance. Compaction of referenced disk area by migration of frequently accessed files to cylinder and rotationally optimal locations is also ABSOLUTELY necessary, by order of reference (directories in search paths, include files, x/motif font and resource files) to get control of response times for huge applications. With this model, less than 5% of the disk is reference in any day, and less than 10% in any week ... 85% has reference times greater than 30 days, if ever. Any partitioning of the disk is absolutely WRONG ... paging should occur inside active filesystems to maintain locality of reference (no seeking). For zoned recorded drives the active region should be the outside tracks, highest transfer rates and performance ... ditto for fixed frequency drives, but for highest reliability due to lowest bit density. This is counter the center of the disk theory ... which is generally wrong for other reasons as well (when files are sorted by probablity of access and age). In fact the filesystem should span multiple spindles with replication of critical/high use files across multiple drives, cylinder locations, and rotational positions -- for both performance and redundancy. Late binding of buffer data to disk addresses (preferably long after close) allows not only best fit allocation, but choice of least loaded drive queue as well, at the time of queuing the request ... not long before. Mirroring, stripping, and most forms of RAID are poor ways to gain lesser redundancy and load balancing. Dynamic migration to & retrieval from a jukebox optical disk or tape backing store both solves the backup problem as well as transparent recovery across media/drive failures. Strict write ordering and checkpointing should make FSCK obsolete on restarts ... just a tool for handling media failures and OS corruptions. I spent the last 2-1/2 years trying to get SCO accept a contract to increase filesystem/paging performance by 300%-600% using the above ideas plus a few, but failed due to NIH, lack of marketing support for performance enhancements, and some confusion/concern over lack of portability of such development to the next generation OS platform (ACE/OSF/SVR4/???). I even proposed doing the development on XENIX without any SCO investment, just a royalty position if the product was marketable -- marketing didn't want to have possible XENIX performance enhancements outshine ODT performance efforts, which if I even got close to my goals would be TINY in comparison. Focused on POSIX, X/Motif, networked ODT desktops -- they seem to have lost sight of a small lowcost minimalist character based platform for point of sale and traditional terminal based var's. I know character based app's are the stone age ... but xterms just aren't cheap yet. Shrink wrapped OS's mean fewer Systems Programmer development jobs! With the success of SCO on PC platforms, there are only three viable UNIX teams left ... USL/UNIVEL, SunSoft, and SCO. The remaining vendors have to few shipments and tight margins, they are unable to fund any significant development effort outside simply porting and maintaining a relatively stock USL product -- unless subsidized by hardware sales (hp, apple, ibm, everex, ncr .... etc). A big change from 10 years ago, when a strong UNIX team was a key part of every systems vendors success. With the emerging USL Destiny and Microsoft NT battle, price for a minimalist UNIX application platform will be EVERYTHING, to offset the DOS/Windows camp's compatability and market lead in desktops. UNIX's continued existence outside a few niche markets lies largely in USL's ability to expand distribution channels for Destiny without destroying SCO, SUN, and other traditional unix suppliers ... failure to do so will give the press even more amunition that "UNIX is dead". It may require that USL's royalties be reduced to the point that UNIVEL, SCO & SUN can profitably re-sell Destiny in the face of an all out price war with Microsoft NT. A tough problem for USL in both distribution channel strategy and revenue capture ... Microsoft has direct retail presence (1 or 2 levels of distribution vs 2, 3 or 4) which improves it's margins significantly over USL (and ability to cut cost). SCO and SunSoft are going to see per unit profits tumble 50% to 80% without a corresponding increase in sales in the near term -- many ENDUSERS/VAR's buying the current bundled products can get along quite well with the minimalist Destiny product -- as an applications platform. Developers (a much smaller market) will continue buying the whole bundle. Microsoft is VERY large & profitable, sadly USL, SCO & SUN only wish to be. It might be time to think about building NT experience and freeware clones. I hope our UNIVERSITY's can focus on training NT Systems jocks, there's going to be more than enough UNIX guys for quite some time .... even if USL pulls it off and only loses 60% of the market to NT. Fortunely for the hardware guys, designs efforts will be up for NT optimized high performance 486 & RISC systems! and everybody is going to go crazy for a while. John Bass, Sr. Engineer, DMS Design (415) 615-6706 UNIX Consultant Development, Porting, Performance by Design