*BSD News Article 92956

Path: euryale.cc.adfa.oz.au!newshost.carno.net.au!harbinger.cc.monash.edu.au!munnari.OZ.AU!news.mel.connect.com.au!news.syd.connect.com.au!news.bri.connect.com.au!fjholden.OntheNet.com.au!not-for-mail
From: Tony Griffiths <tonyg@OntheNet.com.au>
Newsgroups: comp.unix.bsd.freebsd.misc,comp.unix.bsd.bsdi.misc,comp.sys.sgi.misc
Subject: Re: no such thing as a "general user community"
Date: Mon, 07 Apr 1997 10:37:33 +1000
Organization: On the Net (ISP on the Gold Coast, Australia)
Lines: 118
Message-ID: <334841CD.16B4@OntheNet.com.au>
References: <331BB7DD.28EC@net5.net> <5hnam9$393@hoopoe.psc.edu> <5hp7p3$1qb@fido.asd.sgi.com> <5hqc45$hlm@flea.best.net> <5i397n$eva@nyheter.chalmers.se>
Reply-To: tonyg@OntheNet.com.au
NNTP-Posting-Host: swanee.nt.com.au
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailer: Mozilla 3.0 (WinNT; I)
To: Mats Olsson <matso@dtek.chalmers.se>
Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:38601 comp.unix.bsd.bsdi.misc:6603 comp.sys.sgi.misc:29738

Mats Olsson wrote:
> 
> In article <5hqc45$hlm@flea.best.net>,
> Matt Dillon <dillon@flea.best.net> wrote:
> >    Larry, no matter what the results, you can't seriously be advocating
> >    that testing two OS's on two different platforms is scientific (!).
> >    Well?  Yes?  No?
> 
>     That depends if the difference between the platforms is significant in
> light of the results and what you try to show. Ie, if the results are very
> similar and you are trying to show that one OS is better than the other,
> then the differences between the systems must be carefully analyzed to
> see if they are significant.
> 
>     So, testing two OS'es on two differenent platforms isn't necessarily
> bad science. How the collected data is used can be bad science.

Anyone doing benchmarks has to be EXTREMELY careful about the
environment!  Even the "smallest" difference can have unforseen
consequences.  A 'real life' example that cost me some skin off
my back...

(a) Benchmark a DECsystem-10 (actually a DECSYSTEM-20 but loading
    TOPS-10 instead of TOPS-20!  What's TOPS I hear you say?  That's
    another story).  Run the customer's tests over a weekend and
    collect all the printout and console logs, et al.

(b) Customer likes the result and buys the system.

(c) System is installed at customer site and lucky me gets sold with
    it.  First job is to re-run the benchmarks to "prove" that we
    didn't cheat first time round.  This is where things start to go
    wrong!!!  The first thing to note is that two of the three disks
    delivered are NOT the same as in the original benchmark.  In fact,
    they are 2.5 times bigger and 3 times faster in transfer rate!  You
    little ripper, you say!  Bigger, faster disks at the same price.
    What a nice vendor DEC is!!!

(d) Run benchmarks on new system and everything goes swimmingly EXCEPT
    for the "Interactive Responsiveness" test.  It goes from 0.8s to
    1.2s !

    A blink of an eye, I say...  A 50% increase says the customer.  Both
    of us are right but the customer refuses to pay the last $200,000 of
    the contract until the 'problem' is fixed!!!

Ok, so now it's a matter of poring over the printouts and logs to see
what
is happening.  After several days it hits me between the eyes... On the
original benchmark, the system averaged 2%-3% idle time over a 1 hr
benchmark run.  On the new system, idle is 0% over the same period.  The
bigger, faster disks are allowing more jobs to run in a shorter time
and,
as a consequence, the compute queue is now MUCH deeper.  This is causing
the increase (decrease) in the responsiveness.

I report this to the customer, written up nice and scientifically.
Unofficially, their technical people agree with me but the benchmark
still
"fails" to meet their requirements so still the outstanding $200,000. 
My
management/sales person is becoming nervous and my blood pressure is
going
up considerably.

More time spent exercising the grey matter...  OK, if the compute queue
is
too deep, why not reduce the period of time that processes can stay in
the
high priority queue (PQ1) before they are forced into the lower priority
queue (PQ2).  So, I take the system again and reduce the number of
quanta
that a process can have while in PQ1 from the default of 7 (7 x 20ms in
a
50Hz country) to only 1.  Low and behold, the 'responsiveness' test now
goes from 1.2s down to 0.6s while all other tests still meet or exceed
the 
original benchmark.  Great!  Report results to the customer assuming
that
they will be delighted (and pay the outstanding monies!).  Nope!!! 
You've
changed the parameters so "Do not pass GO, do not collect 200,000".

At a blood pressure of 250 over 180, I charge out and head back to the
DEC
office swearing and cursing.  My recommendation to management is to pull
the plug on the system (which is at this time in full productive use)
and
refund the money already payed.  This does not go down to well, as can
be
imagined.  ;-)

Finally, after several calming cups of coffee an idea comes to us.  The 
"You've changed the parameters" quote of the customer hits us between
the
eyes...  Yes we have, we've given them bigger, faster disks than in the
original benchmark.  Swap them out and we should then be able to
reproduce
the original results within the desired limits (-5% <-> +5%).

Arrange a meeting with the customer and tell them of our intended "fix"
to
the responsiveness 'problem'.

After a short pause, they agree to accept the system as it now stands 
and WAVE the benchmark requirements of system acceptance and, btw, thank
you for doing all this tuning work to determine the optimal operating
parameters!!!   

I am ready to KILL, KILL, KILL. 
> 
>     /Mats

The moral of the story is... "DON'T CHANGE ANYTHING WHEN BENCHMARKING!"
Even the slighest difference in h/w or s/w will jump up and bite you on
the bum!

Tony