*BSD News Article 86493

Path: euryale.cc.adfa.oz.au!newshost.carno.net.au!harbinger.cc.monash.edu.au!munnari.OZ.AU!news.ecn.uoknor.edu!feed1.news.erols.com!howland.erols.net!news.mathworks.com!enews.sgi.com!ames!cnn.nas.nasa.gov!news
From: Hugh LaMaster <lamaster@nas.nasa.gov>
Newsgroups: comp.unix.bsd.freebsd.misc,comp.arch,comp.benchmarks,comp.sys.super
Subject: Re: benchmarking discussion at Usenix?
Date: Wed, 08 Jan 1997 10:59:10 -0800
Organization: NASA Ames Research Center
Lines: 84
Distribution: inet
Message-ID: <32D3EE7E.794B@nas.nasa.gov>
References: <5am7vo$gvk@fido.asd.sgi.com>
NNTP-Posting-Host: jeeves.nas.nasa.gov
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailer: Mozilla 3.0 (X11; U; IRIX 5.2 IP12)
CC: hlamaster@mail.arc.nasa.gov
Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:33839 comp.arch:62312 comp.benchmarks:18759 comp.sys.super:6844

Larry McVoy wrote:

> I'll be at Usenix and I thought I might put together a little BOF to discuss
> benchmarking issues, if people are interested.  Topics could include
> 
>         . lmbench 2.0 changes
>                 - finer grain accuracy
>                 - web benchmarks
>                 - scaling/load
>                 - multi pointer memory latency

IMHO, lmbench currently attempts to do exactly the right thing wrt
what is commonly understood as memory latency.  I would vehemently
argue against *replacing* it with multipointer memory latency.

"Multipointer memory latency" is another way of referring
to one type of concurrency.

Question: Aside from (pure) latency (currently measured by lmbench)
and bandwidth (measured by lmbench and STREAM), does concurrency matter?

Answer: I think so.  There are a number of cases of interest.
"Bandwidth" is (often) based on stride-1 concurrency.  Also interesting:
Concurrency with stride-N.  Concurrency on gather/scatter.

Putting everything in units of time (seconds x 10 ^^ -N) to:
latency (lmbench):         fetch          random address or datum
stride-1 (1/bandwidth):    fetch&process  time/unit of contiguous data
stride-N (1/bandwidth):    fetch&process  every N-th datum
gather/scatter (1/B-W):    fetch&process  random data
subword (1/bandwidth):     fetch&process  8/16/(32) data bits 
                                          within word 

For some machines, the bandwidth doesn't vary much, with roughly 
constant bandwidth over all possibilities; on some machines, extra
load/store paths allow 2-3X improvements; on some machines, subword
instructions (e.g. VIS and so on) vastly speed up "vector/parallel"
operations within a word.  A major battle of the early 80's was the
CDC Cyber 205 vs. Cray-1/S.  The Cyber 205 had greater stride-1
and gather-scatter bandwidth, the Cray-1/S better latency and stride-N
performance.  Each machine had its applications where it outperformed
the other.  All these types of bandwidths and concurrencies are 
worthwhile to examine systematically.  Most "scalar" code, including
compilers, tend to be dominated by latency, while many engineering,
scientific, and graphics/image processing applications tend to be
more bandwidth-intensive.
 
>         . osbench
>                 - Steve Kleiman suggested that I (or someone) grab a big hunk
>                  of OS code and port it to userland and call it osbench.  This
>                  is an interesting idea.

An interesting idea.  Various papers over the years (e.g. Alan J. Smith
at U.C. Berkeley?) have noted that when real hardware is instrumented, 
operating systems tend to have much higher cache miss rates than 
predicted by applications.  By porting an application to user-land,
hopefully at least that much of OS behavior could be captured.  [There
are obviously a lot of difficult-to-simulate OS activities as well...]

>         . freespec
>                 - I'm unhappy about the current spec.  I'd like to build a
>                   freeSpec97 that is similar to spec (uses the same tests)
>                   but has lmbench style reporting rules (cc -O/f77 -O) and
>                   is free of any charges.  Any interest?

Great idea.

>         . others?

Hank Dietz (et al.) of Purdue is keeping the "global aggregate function" 
flame alive - these are functions which are globally shared by a
N parallel processors/processes, such as barrier synch, and bitwise
functions such as broadcast, AND, OR, NAND, NOR - and pointing out
how important these are (and how useful they are/were on the machines
which implemented them quickly in hardware).  See the following Website 
for details:

  http://garage.ecn.purdue.edu/~papers/Arch/

A useful benchmark would be to compute these times as a function
of N processes.  A portable reference version using SysV IPC would
set the reference behavior and an upper bound on performance.
[Some hardware implementations provide shockingly good performance
compared to others.]