*BSD News Article 86938

Path: euryale.cc.adfa.oz.au!newshost.carno.net.au!harbinger.cc.monash.edu.au!munnari.OZ.AU!news.mel.connect.com.au!news.syd.connect.com.au!phaedrus.kralizec.net.au!news.mel.aone.net.au!grumpy.fl.net.au!news.webspan.net!newsfeeds.sol.net!hammer.uoregon.edu!arclight.uoregon.edu!enews.sgi.com!ames!cnn.nas.nasa.gov!news
From: Hugh LaMaster <lamaster@nas.nasa.gov>
Newsgroups: comp.unix.bsd.freebsd.misc,comp.arch,comp.benchmarks,comp.sys.super
Subject: Re: benchmarking discussion at Usenix?
Date: Fri, 17 Jan 1997 15:45:52 -0800
Organization: NASA Ames Research Center
Lines: 72
Distribution: inet
Message-ID: <32E00F30.15FB@nas.nasa.gov>
References: <5am7vo$gvk@fido.asd.sgi.com> <32D3EE7E.794B@nas.nasa.gov> <32D53CB1.41C6@mti.sgi.com> <32DAD735.59E2@nas.nasa.gov> <32DD6761.167E@mti.sgi.com>
NNTP-Posting-Host: jeeves.nas.nasa.gov
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailer: Mozilla 3.0 (X11; U; IRIX 5.2 IP12)
Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:34166 comp.arch:62495 comp.benchmarks:18949 comp.sys.super:6869

Dror Maydan wrote:
> 
> Hugh LaMaster wrote:

> > same on all three.  Numbers, anyone?

BTW, there are some more recent numbers in the
following online paper (there seems to be no
information on whether it has received further
publication - it seems to have been done as a
class project):

  "Cache Behaviour of the SPEC95 Benchmark Suite",
  Sanjoy Dasgupta and Edouard Servan-Schreiber

  http://http.cs.berkeley.edu/~dasgupta/paper/rep/rep.html

The paper looks at a subset of SPEC95, on a SPARC.
The paper suggests that 32 Byte block sizes are optimal
for SPEC95, for small (< 128KB) caches, on the machine 
in question - (presumably 32-bit addresses).  It appears 
to me from this data that a 64-bit address machine would 
likely do better with 64 Byte blocks, since the optimum is 
leaning that direction already.  Larger cache sizes also do
better with larger blocks, so machines with larger unified 
L1/L2 caches would likely do better with larger blocks.  
In short, it looks like the vendors have probably already 
done a pretty good job of optimizing their machines to run SPEC95.  
[Surprise, surprise].

> My point was that different machines do have different line sizes, and
> the differences are quite large.  On the SGI R10000, the secondary line
> size is 128 Bytes. On some IBM Power 2's, the line size is 256 Bytes.
> I'm pretty sure that some other vendors use 32 Byte line sizes.
> Why different vendors use different line sizes is probably related to
> both system issues and to which types of applications they try to
> optimize.  

We seem to be in raging agreement up to this point.

>            But, it is irrelevant to the benchmarking issue.

I still like to think that microbenchmarks like lmbench and STREAM,
larger benchmarks like SPEC95, and full-sized application performance,
could be correlated, and even "understood" starting from basic machine
performance.  So, I think I disagree with the above statement. 

                                                                The
issue
> is that lmbench measures the latency for fetching a single pointer.  On
> such a benchmark a large-line machine will look relatively worse
> compared to the competition than if instead one used a benchmark that
> measured the latency of fetching a cache line.

Certainly true.  Of course, in some cases, the machines which have
long main memory latencies are *also* the same machines with poor
bandwidth.

> Now which benchmark is "better".  I think both are interesting.  Which
> is more relevant to a typical integer application?  I don't know.

I don't think there is any doubt that the latencies of the (entire) 
memory hierarchy are a major determinant of "integer" performance.
For engineering-and-scientific code, the picture is murkier.  Some
codes are pretty much 100% bandwidth determined.  Others are not
much different from "integer" performance [assuming you have a modern,
fast, FP implementation].  It is actually the middle ground that is
most "interesting": the codes which can't be trivially transformed
to contiguous memory references, which have independent computed
indices, 
and so on.  This is the area where "concurrency", as distinguished from
the ratio of bandwidth:latency, gets interesting.