Return to BSD News archive
Path: euryale.cc.adfa.oz.au!newshost.carno.net.au!harbinger.cc.monash.edu.au!munnari.OZ.AU!news.ecn.uoknor.edu!feed1.news.erols.com!howland.erols.net!news.mathworks.com!enews.sgi.com!ames!cnn.nas.nasa.gov!news From: Hugh LaMaster <lamaster@nas.nasa.gov> Newsgroups: comp.unix.bsd.freebsd.misc,comp.arch,comp.benchmarks,comp.sys.super Subject: Re: benchmarking discussion at Usenix? Date: Wed, 08 Jan 1997 10:59:10 -0800 Organization: NASA Ames Research Center Lines: 84 Distribution: inet Message-ID: <32D3EE7E.794B@nas.nasa.gov> References: <5am7vo$gvk@fido.asd.sgi.com> NNTP-Posting-Host: jeeves.nas.nasa.gov Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Mozilla 3.0 (X11; U; IRIX 5.2 IP12) CC: hlamaster@mail.arc.nasa.gov Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:33839 comp.arch:62312 comp.benchmarks:18759 comp.sys.super:6844 Larry McVoy wrote: > I'll be at Usenix and I thought I might put together a little BOF to discuss > benchmarking issues, if people are interested. Topics could include > > . lmbench 2.0 changes > - finer grain accuracy > - web benchmarks > - scaling/load > - multi pointer memory latency IMHO, lmbench currently attempts to do exactly the right thing wrt what is commonly understood as memory latency. I would vehemently argue against *replacing* it with multipointer memory latency. "Multipointer memory latency" is another way of referring to one type of concurrency. Question: Aside from (pure) latency (currently measured by lmbench) and bandwidth (measured by lmbench and STREAM), does concurrency matter? Answer: I think so. There are a number of cases of interest. "Bandwidth" is (often) based on stride-1 concurrency. Also interesting: Concurrency with stride-N. Concurrency on gather/scatter. Putting everything in units of time (seconds x 10 ^^ -N) to: latency (lmbench): fetch random address or datum stride-1 (1/bandwidth): fetch&process time/unit of contiguous data stride-N (1/bandwidth): fetch&process every N-th datum gather/scatter (1/B-W): fetch&process random data subword (1/bandwidth): fetch&process 8/16/(32) data bits within word For some machines, the bandwidth doesn't vary much, with roughly constant bandwidth over all possibilities; on some machines, extra load/store paths allow 2-3X improvements; on some machines, subword instructions (e.g. VIS and so on) vastly speed up "vector/parallel" operations within a word. A major battle of the early 80's was the CDC Cyber 205 vs. Cray-1/S. The Cyber 205 had greater stride-1 and gather-scatter bandwidth, the Cray-1/S better latency and stride-N performance. Each machine had its applications where it outperformed the other. All these types of bandwidths and concurrencies are worthwhile to examine systematically. Most "scalar" code, including compilers, tend to be dominated by latency, while many engineering, scientific, and graphics/image processing applications tend to be more bandwidth-intensive. > . osbench > - Steve Kleiman suggested that I (or someone) grab a big hunk > of OS code and port it to userland and call it osbench. This > is an interesting idea. An interesting idea. Various papers over the years (e.g. Alan J. Smith at U.C. Berkeley?) have noted that when real hardware is instrumented, operating systems tend to have much higher cache miss rates than predicted by applications. By porting an application to user-land, hopefully at least that much of OS behavior could be captured. [There are obviously a lot of difficult-to-simulate OS activities as well...] > . freespec > - I'm unhappy about the current spec. I'd like to build a > freeSpec97 that is similar to spec (uses the same tests) > but has lmbench style reporting rules (cc -O/f77 -O) and > is free of any charges. Any interest? Great idea. > . others? Hank Dietz (et al.) of Purdue is keeping the "global aggregate function" flame alive - these are functions which are globally shared by a N parallel processors/processes, such as barrier synch, and bitwise functions such as broadcast, AND, OR, NAND, NOR - and pointing out how important these are (and how useful they are/were on the machines which implemented them quickly in hardware). See the following Website for details: http://garage.ecn.purdue.edu/~papers/Arch/ A useful benchmark would be to compute these times as a function of N processes. A portable reference version using SysV IPC would set the reference behavior and an upper bound on performance. [Some hardware implementations provide shockingly good performance compared to others.]