*BSD News Article 86877

Path: euryale.cc.adfa.oz.au!newshost.carno.net.au!harbinger.cc.monash.edu.au!munnari.OZ.AU!news.mel.connect.com.au!news.mel.aone.net.au!grumpy.fl.net.au!news.webspan.net!www.nntp.primenet.com!nntp.primenet.com!news.sprintlink.net!news-peer.sprintlink.net!howland.erols.net!feed1.news.erols.com!news-xfer.netaxs.com!news.structured.net!uunet!in3.uu.net!192.70.231.3!spstimes.sps.mot.com!newsdist.sps.mot.com!risc.sps.mot.com!ibmoto.com!news
From: Skipper Smith <skipper@ibmoto.com>
Newsgroups: comp.unix.bsd.freebsd.misc,comp.arch,comp.benchmarks,comp.sys.super
Subject: Re: benchmarking discussion at Usenix?
Date: Thu, 16 Jan 1997 13:36:14 -0500
Organization: IBM/Motorola Somerset Design Center
Lines: 77
Distribution: inet
Message-ID: <32DE751E.3F54@ibmoto.com>
References: <5am7vo$gvk@fido.asd.sgi.com> <32D3EE7E.794B@nas.nasa.gov> <32D53CB1.41C6@mti.sgi.com> <32DAD735.59E2@nas.nasa.gov>
NNTP-Posting-Host: tigerlily.ibmoto.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailer: Mozilla 2.01 (X11; I; AIX 2)
Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:34131 comp.arch:62465 comp.benchmarks:18909 comp.sys.super:6863

Hugh LaMaster wrote:
> 
> Dror Maydan wrote:
> 
> > One more interesting category is the latency accessing objects bigger
> > than 4 bytes.  On many cache based machines accessing everything in a
> > cache line is just as fast as accessing one element.  I've never seen
> > measurements, but my guess is that many data elements in compilers are
> > bigger than 4 bytes; i.e., spatial locality works for compilers.
> 
> Well, optimum cache line sizes have been studied extensively.
> I'm sure there must be tables in H&P et al. showing hit rate
> as a function of line size and total cache size.  For reasonably
> large caches, I think the optimum used to be near 16 Bytes for
> 32-bit byte-addressed machines.  I don't know that I have seen more
> recent tables for 64-bit code on, say, Alpha, but my guess is that
> 32 bytes is probably superior to 16 bytes given the larger address
> sizes, not to mention alignment considerations.  Just a guess.
> Also, we often (but not always) have two levels of cache now,
> and sometimes three, and the optimum isn't necessarily the
> same on all three.  Numbers, anyone?

There are at least a couple of other concerns that must be addressed
when looking at optimal cache block lengths.  For example
1) Interrupt latency
2) Data bus width
3) Cache blocking

1) No matter what, your cache sector (a piece of a block which can be
independently valid of the rest of the block) can not be so large that
attempts to load it will cause interrupt latency to be excessive.  This
is at odds with the benefits derived from locality of reference.  Since
data that is near to what you are currently fetching is more likely to
be used in the near future, there is an obvious benefit to fetching as
much data at one time as you can, particularly since bursting makes it
much more efficient to continue a memory access then it is to start a
new one.  

2) Since worst case interrupt latency puts an upper limit on how much
time we are willing to have the bus in use, then the best case amount of
data we can associate with each sector is more dependent on bus width
then whether a processor is 64-bit or 32-bit.  If we presume that 4 or 8
beats (data accesses) is the optimal number (acceptable interrupt
latency + most local data feasible), then a 32-bit data bus will yield
16 or 32 byte sectors, a 64-bit data bus will yield 32 or 64 byte
sectors, and a 128-bit data bus will yield 64 or 128 byte sectors.  If
that is adequate to reach our cache organization goals then fine,
otherwise we might assign more than sector to a block and permit
additional data associated with one tag and allow the additional data to
be brought in on a "time-available" basis.  See the PowerPC 601 chip as
an example (or, for that matter, the MC68030) of a chip that used two
sectors per block to achieve cache organization goals while still
keeping interrupt latency at an acceptable level.

3) Finally, it must be remembered that while locality of reference is
the rule, there are likely to be many exceptions in any given group of
algorithms.  Because of this, caches need to have a way to deal with the
fact that the cache is going to be blocked while it is being accessed.
If one goes off and does an 8 or 16 beat access because you need one
byte in that sector, how long should you be expected to twiddle your
thumbs waiting for access to a different block in the cache (or the bus)
when the stride of your memory accesses takes your next access outside
of that sector?  While this time can be minimized by implementing cache
load buffers, these bring up different challenges with their own impacts
that get harder to avoid at each level.  

I don't have any hard numbers, but the industry seems to have decided
that cache sector sizes equal to 4 or 8 data beats (with the bulk
choosing 4... 8 is usually an available choice or a side effect) is best
and that, when organizational purposes require it,  multiple sectors per
block are acceptable but are generally avoided.  Therefore, you need to
see your bus width as to what the "line" size should be.

-- 
Skipper Smith
Somerset Design Center
All opinions are my own and not those of my employer