Return to BSD News archive
Newsgroups: comp.os.386bsd.misc Path: sserve!newshost.anu.edu.au!harbinger.cc.monash.edu.au!msuinfo!agate!library.ucla.edu!europa.eng.gtefsd.com!MathWorks.Com!panix!zip.eecs.umich.edu!umn.edu!csus.edu!netcom.com!hasty From: hasty@netcom.com (Amancio Hasty Jr) Subject: Pentium secrets Message-ID: <hastyCsqJ3L.L1o@netcom.com> Organization: Netcom Online Communications Services (408-241-9760 login: guest) Distribution: comp.os.386bsd.misc Date: Sun, 10 Jul 1994 17:26:09 GMT Lines: 249 Hi, I am re-posting this article with the hope that it can further improve FreeBSD's performance :) Happy Reading, Amancio ------------------------ start hacking -------------------------------- Article 17295 of comp.sys.intel: Xref: netcom.com comp.sys.intel:17295 Path: netcom.com!csus.edu!csulb.edu!nic-nac.CSU.net!usc!cs.utexas.edu!asuvax!chnews!ornews.intel.com!news.jf.intel.com!news.jf.intel.com!glew From: glew@ichips.intel.com (Andy Glew) Newsgroups: comp.sys.intel Subject: Re: "Pentium Secrets" Date: 10 Jul 1994 06:52:35 GMT Organization: Intel Corp., Hillsboro, Oregon Lines: 216 Message-ID: <GLEW.94Jul9235235@pdx007.intel.com> References: <2ucnjf$hrk@hearst.cac.psu.edu> <Cs6Fws.9s@murdoch.acc.Virginia.EDU> <2utmvh$nl4@vkhdib01.hda.hydro.com> NNTP-Posting-Host: pdx007.intel.com In-reply-to: terjem@hda.hydro.com's message of 30 Jun 1994 05:59:13 GMT Now that Terje Mathisen has published in Byte most of the details about the Pentium(tm) processor performance counters - a facility that has come to be called EMON, standing for Event Monitoring - I'd like to add a few notes to protect Intel's interests. Note that I am *NOT* doing this as an official representative of Intel. I write the following to try and prevent people from writing non-portable code that will cause both of us headaches. (1) One of the biggest reasons for EMON being kept "secret" was that Intel does not want to get forced into a compatibility corner by EMON. I.e. we want to have the freedom to change the EMON counters in arbitrary ways in the future, e.g. by changing event codes, e.g. taking statistics that are meaningless on one processor and replacing them by things more useful. Therefore, portable software should not depend on the existence of the EMON facility, or on particular event codes or register formats. The EMON facility should be considered model specific, useful for tuning code on a particular model. I can almost 100% guarantee that Pentium(tm) processor EMON code will *not* run on P6. We do *not* want anybody except a university researcher to do things like using EMON data to do processor cache affinity process scheduling (to take one possible application from an earlier, pre-Intel, life). {On the other hand, I'd like university researchers to consider doing things like that. It's a good area for research. We just don't want to freeze this feature now.} (2) Furthermore, *anything* in MSR space is model specific, and not portable unless Intel makes great big bold letter statements to the contrary. "MSR" stands for "Model Specific Register" after all. (3) RDMSR(MSR=10h) versus RDTSC: yes, indeed, MSR=10h is the TimeStamp Counter (TSC). However, accessing this via RDMSR and WRMSR is *not* portable. RDTSC is the *portable*, architectural, way of accessing the timestamp counter. It's faster, and it has certain other conveniences. Please avoid using RDMSR(MSR=10). There is no portable way of writing the TSC. WRMSR(MSR=10h) works to a degree, but is non-portable. Moreover, arbitrary writeability is *not* guaranteed - it may not be possible to write any arbitrary bit pattern to the counter. (4) TSC semantics: I'd also like to emphasize a few points about the TSC. The TSC's architectural purpose is as a *timestamp* counter - a value that is guaranteed to be monotonically increasing (modulo wrap), every time it is read. On the Pentium(tm) processor, RDTSC just happens to also be a clock count, which is useful for performance monitoring. However, that performance monitoring usage is model specific. Portable software should not depend on it being a measure of absolute time, although it will nearly always be a measure of the amount of work a processor can complete. Hell - it's not even clear how to measure absolute time in terms of clock cycles on future processors. There are processors from other companies that are capable of continuously varying the clock, dynamically changing frequency to save power. So "CPU clocks" would be useless as a measure of absolute time. In a particular system and implementation, where the software is written with knowledge of the system clocking strategy and the model of CPU in use, it may be acceptable to use RDTSC as a measure of absolute time. E.g. I might be willing to do that myself in a benchmarketing war. But generic software that will run on many different platforms should not do this. Usage in a DLL or shared library may be advised. (5) Don't write TSC! Furthermore, one of the first things an OS developer is going to do on seeing TSC is to wonder "Should TSC be a global, or should TSC be context switched so that it can be a process (or thread) virtual time?" The answer is, emphatically, NO! TSC should not be context switched (forgetting secure OS issues for the moment). Recall that WRMSR(MSR=10h) is not a portable way of writing the TSC. Furthermore, the ability to write arbitrary values is *not* guaranteeed. Therefore, do *not* write the TSC. Instead, if you must play games with providing global, per process, or per-thread times, do the smart thing and provide an offset that your library code can add to the raw TSC value to get the appropriate correction. Use the classic HI;LO;HI algorithm to read the two values "atomically": E.g. volatile global int64 AbsoluteTimeOffset; volatile global int64 ProcessTimeOffset; volatile global int64 ProcessUserTimeOffset; volatile global int64 ThreadTimeOffset; .... int64 ReadUserTime() { int64 off1, off2, tsc; off1 = ProcessUserTimeOffset; tsc = RDTSC(); /* an appropriately fenced asm function */ off2 = ProcessUserTimeOffset; if( off1 == off2 ) return off1 + tsc; else /* do something special to handle this case. E.g. retry, or return off2+tsc (which can only be done if there are conventions on permitted range of values. or do an OS call to make atomic, or... */ } This is better long-run, because you can then implement arbitrary varietis of timers. (6) Fencing of RDTSC: Remember that RDTSC is a timestamp counter? That guarantees that successive invocations always return different, monotonically increasing, values. I.e. it makes a statement about the ordering of RDTSC instructions. But it doesn't make any statement at all about the ordering of RDTSC with *other* instructions. So, e.g. if you are trying to use RDTSC to time a single instruction, as in a = RDTSC() MOV mem, eax /* Store eax to memory */ b = RDTSC It is entirely possible that the second RDTSC may execute *before* the instruction under test, e.g. MOV mem, eax E.g. on the Pentium(tm) processor, writes may be buffered - so the second RDTSC may be executed before the buffered write gets done. This is a simple case. On future processors, there may be many more examples of such overlap. If you really want to measure a particular instruction, you must insert the appropriate fewncing directives. The easiest "serializing" instruction is CPUID. So, to really time an individual store, you must do: CPUID a = RDTSC() CPUID MOV mem, eax /* Store eax to memory */ CPUID b = RDTSC CPUID and then account for the time of the CPUID serializations. (Warning: it is also possible for a system board to be built that prevents CPUID from properly serializing. I'd discount the possibility, except that such a system will be a little bit faster, and will run nearly all, but notr all, software. I.e. it's tempting. So check first, if you can.) For Joe Programmer trying to time his code, this is overkill - you probably don't care about a few cycles of noise due to overlap, if you are RDTSC'ing, e.g., at the beginning of every function call and return. So leave out the CPUIDs in this usage model. (7) Extensibility of EMON Mathisen's Byte article says "An obvious extension for Intel's next CPU... would be to use all 64 bits of MSR 11h and add two more stat counters as MSR 14h and 15h". Remember point (1) above? Well, I can't tell you anything about P6, how many counters it has, or whether P6 has EMON at all, but I feel obliged to tell you: MSR 11h does not work in anywhere near the same way on P6 as it does on the Pentium(tm) processor. So please do not provide OS level facilities to program MSR 11h and expect exactly the same thing to work on P6. (8) Finally, I'd like to share a very useful feature that *is* documented, but which very few people have picked up on about the Pentium(tm) EMON counters. There are pins called something like "PM0/BP0" and "PM1/BP1" documented in the Pentium(tm) processor data sheet (the names change a bit in different revs of the document). "PM" stands for "Performance Monitoring". E.g. these external pins can be configured to toggle when an internal event like a BTB hit occurs. They can also be configured to toggle on overflow of the counter - so you can load a value like -1000 into it, and cause the pins to wiggle after a 1000 cache misses. A summer student working for me has soldered a wire from the PM0 pin to the NMI pin of the processor (we cut the existing trace to NMI, didn't need a failsafe timer), and now we can get an interrupt every 1000 cache misses. The NMI handler reprograms the EMON counter, et voila, we have an interesting form of statistical profiling that is *not* based on time. Very useful for tuning programs - you can see exactly where the performance problems occur. CONCLUSION: Please bear the above in mind if using the facilities Terje Mathisen reverse engineered and documented in Byte. I'm afraid that I can't tell you about anything in Appendix H [*], but at least I can try to prevent y'all getting tied up into compatibility problems on the reverse engineered stuff. [*] Amusingly, I don't even have a copy of Appendix H at the moment. I left it out on my desk one night, and a security guard on a sweep swiped it, and hasn't given it back in over a week. :-( ---------- Oh, and by the way: at the end of his article Terje says: "I hope that Intel makes official information available to all programmers and that such useful features are incorporated into other architectures such as Alpha, PowerPC, and SPARC. Hearsay and postings on the net make me 100% confident that various DEC Alpha and HP PA implementations have performance counters and high res timers. I'm not so sure about PowerPC, but I do note that IBM's Power2 architecture has a great big wishlist of performance counters, discussed (but not to the extent you could use them) in an articile in the second volume of RS6000 papers. I.e. I'm pretty sure that a whole slew of other chips have them, but they are undocumented in exactly the same way Intel's EMON is. For probably the very same reasons. I'm not defending this, just pointing it out. -- Andy "Krazy" Glew, glew@ichips.intel.com, Intel, M/S JF1-19, 5200 NE Elam Young Pkwy, Hillsboro, Oregon 97124-6497. Place URGENT in email subject line for mail filter prioritization. DISCLAIMER: private posting, not representative of employer. -- FREE unix, gcc, tcp/ip, X, open-look, netaudio, tcl/tk, MIME, midi,sound at freebsd.cdrom.com:/pub/FreeBSD Amancio Hasty, Consultant | Home: (415) 495-3046 | e-mail hasty@netcom.com | ftp-site depository of all my work: | sunvis.rtpnc.epa.gov:/pub/386bsd/X