Return to BSD News archive
Path: euryale.cc.adfa.oz.au!newshost.carno.net.au!harbinger.cc.monash.edu.au!munnari.OZ.AU!news.mel.connect.com.au!news.mel.aone.net.au!grumpy.fl.net.au!news.webspan.net!www.nntp.primenet.com!nntp.primenet.com!news.sprintlink.net!news-peer.sprintlink.net!howland.erols.net!feed1.news.erols.com!news-xfer.netaxs.com!news.structured.net!uunet!in3.uu.net!192.70.231.3!spstimes.sps.mot.com!newsdist.sps.mot.com!risc.sps.mot.com!ibmoto.com!news From: Skipper Smith <skipper@ibmoto.com> Newsgroups: comp.unix.bsd.freebsd.misc,comp.arch,comp.benchmarks,comp.sys.super Subject: Re: benchmarking discussion at Usenix? Date: Thu, 16 Jan 1997 13:36:14 -0500 Organization: IBM/Motorola Somerset Design Center Lines: 77 Distribution: inet Message-ID: <32DE751E.3F54@ibmoto.com> References: <5am7vo$gvk@fido.asd.sgi.com> <32D3EE7E.794B@nas.nasa.gov> <32D53CB1.41C6@mti.sgi.com> <32DAD735.59E2@nas.nasa.gov> NNTP-Posting-Host: tigerlily.ibmoto.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Mozilla 2.01 (X11; I; AIX 2) Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:34131 comp.arch:62465 comp.benchmarks:18909 comp.sys.super:6863 Hugh LaMaster wrote: > > Dror Maydan wrote: > > > One more interesting category is the latency accessing objects bigger > > than 4 bytes. On many cache based machines accessing everything in a > > cache line is just as fast as accessing one element. I've never seen > > measurements, but my guess is that many data elements in compilers are > > bigger than 4 bytes; i.e., spatial locality works for compilers. > > Well, optimum cache line sizes have been studied extensively. > I'm sure there must be tables in H&P et al. showing hit rate > as a function of line size and total cache size. For reasonably > large caches, I think the optimum used to be near 16 Bytes for > 32-bit byte-addressed machines. I don't know that I have seen more > recent tables for 64-bit code on, say, Alpha, but my guess is that > 32 bytes is probably superior to 16 bytes given the larger address > sizes, not to mention alignment considerations. Just a guess. > Also, we often (but not always) have two levels of cache now, > and sometimes three, and the optimum isn't necessarily the > same on all three. Numbers, anyone? There are at least a couple of other concerns that must be addressed when looking at optimal cache block lengths. For example 1) Interrupt latency 2) Data bus width 3) Cache blocking 1) No matter what, your cache sector (a piece of a block which can be independently valid of the rest of the block) can not be so large that attempts to load it will cause interrupt latency to be excessive. This is at odds with the benefits derived from locality of reference. Since data that is near to what you are currently fetching is more likely to be used in the near future, there is an obvious benefit to fetching as much data at one time as you can, particularly since bursting makes it much more efficient to continue a memory access then it is to start a new one. 2) Since worst case interrupt latency puts an upper limit on how much time we are willing to have the bus in use, then the best case amount of data we can associate with each sector is more dependent on bus width then whether a processor is 64-bit or 32-bit. If we presume that 4 or 8 beats (data accesses) is the optimal number (acceptable interrupt latency + most local data feasible), then a 32-bit data bus will yield 16 or 32 byte sectors, a 64-bit data bus will yield 32 or 64 byte sectors, and a 128-bit data bus will yield 64 or 128 byte sectors. If that is adequate to reach our cache organization goals then fine, otherwise we might assign more than sector to a block and permit additional data associated with one tag and allow the additional data to be brought in on a "time-available" basis. See the PowerPC 601 chip as an example (or, for that matter, the MC68030) of a chip that used two sectors per block to achieve cache organization goals while still keeping interrupt latency at an acceptable level. 3) Finally, it must be remembered that while locality of reference is the rule, there are likely to be many exceptions in any given group of algorithms. Because of this, caches need to have a way to deal with the fact that the cache is going to be blocked while it is being accessed. If one goes off and does an 8 or 16 beat access because you need one byte in that sector, how long should you be expected to twiddle your thumbs waiting for access to a different block in the cache (or the bus) when the stride of your memory accesses takes your next access outside of that sector? While this time can be minimized by implementing cache load buffers, these bring up different challenges with their own impacts that get harder to avoid at each level. I don't have any hard numbers, but the industry seems to have decided that cache sector sizes equal to 4 or 8 data beats (with the bulk choosing 4... 8 is usually an available choice or a side effect) is best and that, when organizational purposes require it, multiple sectors per block are acceptable but are generally avoided. Therefore, you need to see your bus width as to what the "line" size should be. -- Skipper Smith Somerset Design Center All opinions are my own and not those of my employer