*BSD News Article 65442

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!news.rmit.EDU.AU!news.unimelb.EDU.AU!munnari.OZ.AU!spool.mu.edu!howland.reston.ans.net!cs.utexas.edu!venus.sun.com!nntp-hub2.barrnet.net!news.Stanford.EDU!plastique.Stanford.EDU!mgbaker
From: mgbaker@cs.Stanford.EDU (Mary G. Baker)
Newsgroups: comp.os.linux.development.system,comp.unix.bsd.386bsd.misc,comp.unix.bsd.bsdi.misc,comp.unix.bsd.netbsd.misc,comp.unix.bsd.freebsd.misc
Subject: The Lai/Baker paper, benchmarks, and the world of free UNIX
Date: 11 Apr 1996 04:00:37 GMT
Organization: Computer Science Department, Stanford University.
Lines: 488
Distribution: world
Message-ID: <4ki055$60l@Radon.Stanford.EDU>
NNTP-Posting-Host: plastique.stanford.edu
Originator: mgbaker@plastique.Stanford.EDU
Xref: euryale.cc.adfa.oz.au comp.os.linux.development.system:20936 comp.unix.bsd.386bsd.misc:508 comp.unix.bsd.bsdi.misc:3069 comp.unix.bsd.netbsd.misc:2846 comp.unix.bsd.freebsd.misc:16944


Dear Folks,

I'm writing to give you my comments and proposals concerning the recent
discussion of my paper (co-authored with Kevin Lai) about benchmarking
FreeBSD, Linux and Solaris on the Pentium.  Please don't interpret my
slowness to respond as lack of interest.  I've been waiting until I've
probably received all the email I'm likely to receive on the subject
so that I could be more sure of addressing as many of the relevant issues
as possible.  For those of you who want to look into this yourselves, the
paper is available at http://mosquitonet.Stanford.EDU/mosquitonet.html.

First off, I want to thank Jordan Hubbard very much for posting his
kind apology for his previous comments to the comp.unix.bsd.freebsd.misc
newsgroup.  This was a very gracious thing for him to do, and I include
his apology at the end of this message, since others may be interested
in reading it. (He didn't realize his previous comments had been cross
posted to a number of other newsgroups, so people reading only those
groups will not have had the opportunity to read his followup.)  I
address some of his remaining criticisms and suggestions in this note.
In particular, I've offered to do the remaining NFS benchmarks that we
did not do before.  I also want to thank all of you who forwarded his
comments to me, since I would not have known about the discussion
otherwise.

Additionally, I thank Frank Durda, Alan Cox, and Torbjorn Granlund for
directly sending us their comments, suggestions and concerns in a very
kindly manner.  I address the issues they brought up as well.



The rest of this (long) note is organized as follows:  First, I'll
answer some of the many questions we've had regarding our purpose in
doing these benchmarks, our choice of systems and system versions, and
our methodology.

The second section is where I address Jordan's and others'
comments and criticisms about our benchmarks, our paper, and our
presentation of the paper at USENIX.  This second section contains
a list of the mistakes, inaccuracies and omissions in our work that we
so far know about.  Those who've done much benchmarking work will
already know that benchmarking is always an on-going process for
several reasons.  First, the systems we benchmark are continually
changing, so any past performance results may not apply in the future.
Second, while we are all (I hope) as careful as possible to be fair, to
avoid measurement/methodology mistakes, and to interpret results
correctly, I still learn something about the actual process of
benchmarking after every benchmarking study.  It seems there's almost
always more information one should have collected or another way to
collect the information one got.

Finally, the last section gives a few concluding remarks I have about the
whole subject of free UNIXes and such.


I know this is a long note.  I'll try to construct it so that the various
sections make sense in isolation.  That way you can skip to whatever, if
anything, interests you.


I.  Why we did what we did
-----------------------------

1) Purpose of the benchmarks:

Several people have asked about the purpose of the paper.  Our "hidden
agenda," if we really had one, was to determine just how much trouble
my research group would be in if we used a freely-distributed OS as a
base upon which to build.  We're working on mobile and wireless
computing issues, and we want to be able to distribute any kernel and
driver stuff we end up doing. It may surprise you to hear that we were
warned by a number of respected researchers that we would be laughed
off the block if we used a non-commercial OS for our research base.  We
were told that these systems are "toy systems" in many ways.  Please
remember, though, that this was 2 years ago when I was starting up my
project, and the world has changed a lot since then.  Perhaps the
warning made more sense then.

The particular problem we were warned about is that other researchers
would be unable to interpret our performance results, because
non-commercial OSes are somehow "uncalibrated" and may not behave in
well-understood ways.  We were warned about Linux particularly, since
it is not derived from as respected a parent system as BSD UNIX.

So we set out to see just how much trouble we'd be in and compared the
systems in some ways that were of interest to us, although they are
nowhere near to being exhaustive.  We published our results, at a
particular snapshot in time, because we got tons of queries from people
wanting to know the answers.  We didn't post anything about our project
(as far as I remember), so this must have happened mostly by word of
mouth.  There seemed to be enough interest to send the paper to USENIX.

We found the performance differences between systems (on our benchmarks, our
hardware, using the particular versions we used, and at that point in time)
to be much less than we'd expected.  For us, on the benchmarks we cared
about for our particular project, the performance differences weren't enough
to dictate a choice between systems.  That doesn't mean they aren't
significant to others or that other benchmarks wouldn't show significant
differences.

By the time of the presentation at USENIX, we felt we had at least some
evidence to show that the free systems we looked at are definitely not toys.
My research group decided it was fine to go ahead and use one of them.


2) Choice of OSes:

Several people have asked why we picked the systems we picked for
benchmarking -- both which OSes and which versions of those OSes.
As we carefully explain in the paper, we had the following constraints:

	a) The system must be easily available (over the net or off of a
	CDROM) for installation.

	b) It must be cheap (under $100 in the summer of 1995).  I hadn't
	much research funding at the time, so this really was a consideration.
	(My group is better off now, fortunately.)

	c) The system had to run on our hardware: an Intel Pentium P54C-100MHz
	system with an Intel Plato motherboard, 32 megabytes of main memory,
	two 2-gigabyte disks (one a Quantum Empire 2100 SCSI disk, the other an
	HP 3725 SCIS disk), a standard 10-Megabit/second Ethernet card
	(3Com Etherlink III 3c509), and an NCR 53c810 PCI SCSI controller with
	no on-board caches.

	d) The system had to have a sufficiently large user community that we
	could locate it easily.

This left us with Linux, FreeBSD and Solaris.  Several other obvious choices
didn't run on our SCSI controller at that time.

3) Why we picked the versions of the OSes we picked:

The decision for which we have received more flak is our choice of
versions of the systems to benchmark.  To be entirely fair, we felt
that we had to choose the most recent major release that was commonly
available for regular users at our cut-off date of October 31, 1995.
This meant we did not test unreleased, beta, or development versions.
We used Linux 1.2.8 from the Slackware Distribution, FreeBSD version
2.0.5R, and Solaris version 2.4.  This meant we did not use the beta
version of Solaris 2.5, which upset quite a few people since a lot of
work was put into improving that version over the previous one.  We
also upset some Linux users, since a great many users run the 1.3.X
development versions rather than the "official stable" versions.
However, it didn't seem to be fair to go with a development version in
one system and not another.

4) Our methodology:

We did not have source code available for all three systems (the
Solaris source code was too expensive for us), so we used what some
call the "black box" approach to benchmarking.  We usually attempted to
explain curious results through external testing and further
measurement rather than through investigations of kernel code,
profiling, or the use of funny counters on the Pentium processor.  We
measured elapsed wall-clock time for the benchmarks, rather than
system time, etc.  This was simply because, to us, in our environment,
we really only cared about how long the benchmarks took.  I should
point out that there are lots of arguments against this approach,
although I'm aware of arguments against most of the other approaches
that I know about as well.

In most cases, we ran the systems single-user.  As John Dyson points
out, this totally ignores scalability issues.  Since some of the worst
problems with systems only show up under conditions of load, this means
we're missing a lot of the most useful information.  For our context
switch benchmarks, at least, we did measure benchmark time for
increasing numbers of ready processes.  This is how we uncovered the
very interesting behavior of Solaris 2.4 when the number of ready
processes increases above 32 and again above 64.  (There are other
issues concerning our context switch benchmark, though, that I'll
mention in the section below.)  In our own selfish outlook on the
world, however, we're hoping not to run our systems under much load.
We'll see over time if we're so lucky.  For those of you choosing
systems to run for web servers and such, our lack of information about
behavior under load is understandably disappointing.



II.  Response to comments, criticisms, etc.
-------------------------------------------

1) Failure to benchmark against the full range of servers.

Jordan's major technical criticism, if I don't misinterpret him, is that
while we tested Linux, FreeBSD and Solaris clients running the same
benchmark against a Linux 1.2.8 file server, we did not test their
interoperability against all three systems running as servers.  I think
this is a very valid point.  Therefore, I offered to do the remaining
benchmarks.  He also proposes this possibility in his followup note.

This brings up the issue again of which versions of the OSes to use,
etc.  I don't think it makes sense to use the same ones we used before,
since at this point they're so old.  FreeBSD has had major releases
since then, and so has Solaris.  My guess is that the "X" numbers in
the 1.3.X versions of Linux are getting so large that they may do
another non-development release pretty soon.  (The performance of the
system has certainly changed substantially.)  At that point, I think it
would be fair to everyone to do at least the NFS benchmarks and
probably some of the others as well.  Should we do this sooner?  I
don't know.  Maybe.  I welcome suggestions.  Perhaps if we only post our
results to the net it wouldn't be considered unfair to use a
development version of Linux?


2) Failure to let FreeBSD, Linux, and Solaris developers examine our
measurement methodology before publication:

We actually explicitly avoided this.  We somehow thought it would be
easier to prove we had no bias for one system over another if we didn't
consult anyone in the actual development groups.  It's not that we
thought anyone in one of the development groups would lobby us for
something unfair, but that we thought the outside world might see this
as being inappropriate.  Instead, we consulted a bunch of other people
(whom we acknowledge in the paper).  *However*, I've decided Jordan is
right.  If we do this again, we'll just make it available to all of the
development groups and hope they take an equal interest so that it's
fairly reviewed by all.  (Besides, we *did* talk to some of the Solaris
developers to figure out what was going on with the jump in context
switch times at 32 and 64 processes.  We didn't consult them about
benchmark methodology, since that was against our rules, but we did ask
them if they knew of any reasons for the behavior we discovered.)


3) Dismissing the importance of the benchmarks:

One of Jordan's other comments to me on the phone was that our presentation
of the paper at USENIX angered him because we presented the benchmarks
and then said that they didn't matter and that we'd use some other metric
for choosing between systems.  This isn't exactly how we meant this
to come across, and I apologize for how it sounded.  What we meant
to say was that the performance differences on the benchmarks we most
care about for our environment (and on our hardware at the time we ran them
and using the versions we used) weren't great enough to us to dictate which
system we should choose.  That's why we ended up using some very subjective
and impossible-to-calibrate non-metrics for picking a system to run.


4) Our claims about "support" for the different systems:

Our comparisons of "support" for the different systems upset Jordan
greatly, because he points out that most people assume "tech support"
for the word "support."  I bet he's right, and I'm sorry I didn't think
of that.  I certainly didn't mean to imply anything about anyone
providing tech support for FreeBSD.  We really didn't look at that at
all.  Tech support was one of the things we were interested in, but not
in the usual sense.  What we meant by this was a list of several items,
although most of it boils down to "size of contributing user community
and on-line resources."  For example, we considered on-line
repositories of lots of free software for the systems to be important
to us.  In our paper we list both FreeBSD and Linux as doing very well
this way.  If our presentation somehow came across as indicating
otherwise, my sincere apologies.

People make new contributions to both systems all the time, so the
relative differences may well have changed since our examination of
this.  At the time we did our benchmarks, there seemed to be more
hardware drivers available for Linux, with FreeBSD being next, and
Solaris last.  Since Solaris is a commercial system and doesn't give
its source code away for free, it's hard to blame them for this
situation, but it was important to us.  We needed a system with a high
likelihood that new bizarre hardware would be supported in a timely
manner.  I can't say whether this is a good or bad way to choose a
system, but it mattered to us.

The other thing we (Kevin) did, which I'd pretty much forgotten about
when I talked with Jordan, was to scan the newsgroups for answers to
our installation problems.  The idea was that if it seemed to be a
problem that others might have, somebody would already have asked about
it on the newsgroup if there were a large enough user community.  We
didn't have much luck that way with Solaris.  Perhaps this indicates
there isn't as large a Solaris x86 user community, but maybe it's
because they offer commercial support contracts and so people don't use
the newsgroup as much for this sort of thing.

With Linux we found our questions and their answers on the newsgroup.
This was usually true for FreeBSD as well.  The only exception was for
our troubles partitioning the disk under FreeBSD.  (There seemed to be
lots of people with this problem, given the number of postings, but we
were unable to understand the solution from the postings.  I assume
somebody responded individually to each of the people posting this
problem, it's just that this didn't help for the timid amongst us who
tend to read rather than write to the newsgroups.  Or maybe the answer
is really there and we just failed to locate or interpret it
correctly.) Although in our phone conversation Jordan suggested a
similar experiment for testing levels of support, I told him that I
personally doubt this is a solid metric for making such comparisons.  I
feel that it depends too much on what sort of problem you encounter
with which system.  We therefore did not report any of this in our
paper!  We concentrated instead on availability of drivers, free
compilers, etc.


5) Errors regarding hardware supported by FreeBSD:

Thanks to Frank Durda for pointing this out.  We incorrectly stated
that version 2.0.5R of FreeBSD did not support our Panasonic/Creative
Labs CD-ROM drive.  This is untrue for 2.0.5R.  It was true for a
much older version (1.22) that we used. I'm ashamed to say that we
assumed it was still true of 2.0.5R since the documentation for the
older version stated:

    "FreeBSD does NOT support drives connected to a Sound Blaster or
    non-SCSI SONY or Panasonic drives.  A general rule of thumb when
    selecting a CDROM drive for FreeBSD use is to buy a very standard
    SCSI model; they cost more, but deliver very solid performance in
    return.  Do not be fooled by very cheap drives that, in turn,
    deliver VERY LOW performance!  As always, you get what you pay
    for."

We interpreted this to mean that FreeBSD felt our drive was inferior
and therefore didn't (wouldn't) support it.  We should have
double-checked with the new version!


6) Errors regarding the Linux TCP implementation.

Thanks to Alan Cox for pointing this out.  We stated that the Linux
TCP uses a 1 frame window across the loopback interface.  We said this
merely because our traces of some of the network stuff looked like
this was the case.  Alan pointed out the error and showed us some
other reasons for the behavior we observed.


7) Context switch benchmark problems;

There are problems with our context switch benchmarks, in that they
measure not just context switch time but other activities as well.  Our
results include the overhead of the pipe operations used as well as
some cache behavior.  They also don't say anything about behavior with
processes of different sizes, etc.  We mention some of these problems
in the paper, but it can still be misleading.  We attempted to find
some other better ways to measure context switch but came across
problems for each of them in the different systems.  (We do measure
pipe overhead separately, so it can be subtracted out.  In fact, the
graph we presented at USENIX differs from the one in the paper since we
subtracted out the pipe overhead for the presentation.)


8) Memory performance measurement problems:  

Torbjorn Granlund has pointed out some interesting things about
the cache interface on the Pentium.  We're still looking into that, so
I can't say anything useful about that yet.




Section III: Conclusions
--------------------------

I have just a couple of things to say in conclusion.  Before all this I
hadn't read many of the relevant newsgroups other than to scan for
articles on particular subjects.  I was struck in the last couple of
days by the amount of back and forth between users of the different
systems.  Maybe that's inevitable, and competition is very good for
systems, but the tenor of some of the discussions seemed potentially
destructive.

One of the reasons (in my humble opinion) that a lot of the world is
running Windows is due to a missed opportunity on the part of
commercial UNIX vendors.  A while back they had a chance to get
together and push the system into all sorts of places it hadn't been
before.  Instead, they ended up fragmenting and arguing and charging
high prices for a system that some folks just wanted to run on a
personal computer.  I think market share for everyone was reduced
because of this.  It's the analogy of pie slices.  You can argue over
the size of your respective pie slices, or you can increase the size of
the pie for everyone.  The latter is what I'd love to see.

I think the free UNIXes have an opportunity to increase the size of the
pie, for themselves and for commercial vendors.  Maybe it's the last
chance...  And part of the issue is looking at what characteristics are
important in a system for much of the world.  Performance is important,
or we wouldn't have done our benchmarks, but it's not the only thing.
Sometimes those of us in academic environments (and others as well) get
too wrapped up in performance issues, since they're so interesting to
look at.  But maybe it would be better if many of us considered other
characteristics, such as packaging and documentation, not merely
important, but truly as sexy as performance.

Please send any flames to me directly.  I admit to being a bit of a wimp
in dealing with on-line flaming -- it's a character flaw, but it's probably
not changing real soon.


		Thanks very much for your consideration,
		Mary

		Stanford University
		mgbaker@cs.stanford.edu
		http://plastique.stanford.edu/~mgbaker


**************************************************************************
Jordan's followup note:


Return-Path: jkh@FreeBSD.org
Received: from Sunburn.Stanford.EDU (Sunburn.Stanford.EDU [171.64.67.178]) by plastique.Stanford.EDU (8.7.3/8.7.3) with ESMTP id PAA05442 for <mgbaker@plastique.stanford.edu>; Mon, 8 Apr 1996 15:57:49 -0700 (PDT)
Received: from time.cdrom.com (time.cdrom.com [204.216.27.226]) by Sunburn.Stanford.EDU (8.7.1/8.7.1) with ESMTP id PAA14146 for <mgbaker@cs.stanford.edu>; Mon, 8 Apr 1996 15:57:47 -0700 (PDT)
Posted-Date: Mon, 8 Apr 1996 15:57:47 -0700 (PDT)
Received: from time.cdrom.com (localhost [127.0.0.1]) by time.cdrom.com (8.7.5/8.6.9) with SMTP id PAA13139 for <mgbaker@cs.stanford.edu>; Mon, 8 Apr 1996 15:57:30 -0700 (PDT)
Sender: jkh@time.cdrom.com
Message-ID: <316999D7.167EB0E7@FreeBSD.org>
Date: Mon, 08 Apr 1996 15:57:27 -0700
From: "Jordan K. Hubbard" <jkh@FreeBSD.org>
Organization: Walnut Creek CDROM
X-Mailer: Mozilla 3.0b2 (X11; I; FreeBSD 2.2-CURRENT i386)
MIME-Version: 1.0
Newsgroups: comp.unix.bsd.freebsd.misc
CC: mgbaker@cs.stanford.edu
Subject: My recent comments on the Baker/ Lai paper at USENIX
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Some of you may recall a recent message of mine where I called
the benchmarking paper presented at the '96 USENIX "a travesty",
"utter garbage" and used other highly emotionally-charged labels
of that nature in denouncing what I felt to be an unfair presentation.

Well, Mary Baker called me on the phone today and was, quite
understandably, most upset with my characterization of her paper and the
fact that I did not even see fit to present my issues with it to her
personally, resulting in her having to hear about this 3rd-hand.

While it must be said that I still have some issues with the testing
matrix they used (most specifically for NFS client/server testing) and
do not feel that the paper met *my* highly perfectionist standards for
those who would dare to tread the lava flow that is public systems
benchmarking, I do fully understand and acknowledge that both the words
I used and the forum I used to express them (this newsgroup) were highly
inappropriate and merit a full apology to Mary and her group.  I can
scarcely mutter about scientific integrity from one side of my mouth
while issuing angry and highly emotional statements from the other,
especially in a forum that gives the respondant little fair opportunity
to respond.  Mary, my apologies to you and your group.  I was totally
out of line.

Those who know me also know that I've spent the last 3 years of my life
working very hard on FreeBSD and, as the Public Relations guy,
struggling to close the significant PR gap we're faced with.  When I see
something which I feel something to be unfair to FreeBSD, I react rather
strongly and sometimes excessively.  I've no excuse for this, but it
nonetheless sometimes occurs.  This last USENIX was one of many highs
and lows for us, the highs coming from contacts established or renewed
and the lows coming from presentations like Larry McVoy's lmbench talk
which started as a reasonable presentation of lmbench itself and turned
into a full pulpit-pounding propaganda piece for Linux, using
comparisons that weren't even *fair* much less accurate.  I did confront
Larry over this one personally, and only mention it at all because it
immediately followed the Baker/Lai talk and is no doubt significantly
responsible for my coming away from that two hour session with a
temporary hatred for anyone even mentioning the word "benchmark" - a
degree of vitriol which no doubt unfairly biased me more heavily against
the B/L talk than I would have been otherwise.  Overall, USENIX was
quite a bit more about highs than lows, and it's really only this
session that stood out on the low side for me.

As I told Mary, I do hope that future benchmarking attempts will at
least pay us (in the free software world overall) the courtesy of a
contact before the measurement runs, not to give us a chance to unfairly
bias the results in any way (which would only reflect badly on us) but
to simply review the testing methodologies used and make comments where
we feel certain changes might be made to improve the objectivity and
fairness of the results.

I do think that benchmarking is important and that many types of useful
"real world" data can be derived from them - our very own John Dyson
puts significant amounts of time and effort into running benchmarks with
Linux and other operating systems so as to "keep FreeBSD honest" in our
own performance, looking for areas where they've made improvements which
we have not.  People like Mary Baker perform a very useful service in
attempting to do the same thing for the academic world, and I definitely
do not want my initially harsh words to discourage her and her students
from doing further objective analysis of free and commercial operating
systems - quite the contrary.  It's a dirty, thankless job, but somebody
has to do it.

If she and her students would care to run another comparative benchmark
suite using the more advanced technologies that have since been released
by both the Linux and *BSD groups, endevouring also to expand the base
of hardware that's being tested (since PCs are a notoriously mixed bag
when it comes to issues like this), I would be pleased to extend every
cooperation from the FreeBSD Project.
--- 
- Jordan Hubbard
  President, FreeBSD Project