Return to BSD News archive
Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!news.rmit.EDU.AU!news.unimelb.EDU.AU!munnari.OZ.AU!spool.mu.edu!howland.reston.ans.net!cs.utexas.edu!venus.sun.com!nntp-hub2.barrnet.net!news.Stanford.EDU!plastique.Stanford.EDU!mgbaker From: mgbaker@cs.Stanford.EDU (Mary G. Baker) Newsgroups: comp.os.linux.development.system,comp.unix.bsd.386bsd.misc,comp.unix.bsd.bsdi.misc,comp.unix.bsd.netbsd.misc,comp.unix.bsd.freebsd.misc Subject: The Lai/Baker paper, benchmarks, and the world of free UNIX Date: 11 Apr 1996 04:00:37 GMT Organization: Computer Science Department, Stanford University. Lines: 488 Distribution: world Message-ID: <4ki055$60l@Radon.Stanford.EDU> NNTP-Posting-Host: plastique.stanford.edu Originator: mgbaker@plastique.Stanford.EDU Xref: euryale.cc.adfa.oz.au comp.os.linux.development.system:20936 comp.unix.bsd.386bsd.misc:508 comp.unix.bsd.bsdi.misc:3069 comp.unix.bsd.netbsd.misc:2846 comp.unix.bsd.freebsd.misc:16944 Dear Folks, I'm writing to give you my comments and proposals concerning the recent discussion of my paper (co-authored with Kevin Lai) about benchmarking FreeBSD, Linux and Solaris on the Pentium. Please don't interpret my slowness to respond as lack of interest. I've been waiting until I've probably received all the email I'm likely to receive on the subject so that I could be more sure of addressing as many of the relevant issues as possible. For those of you who want to look into this yourselves, the paper is available at http://mosquitonet.Stanford.EDU/mosquitonet.html. First off, I want to thank Jordan Hubbard very much for posting his kind apology for his previous comments to the comp.unix.bsd.freebsd.misc newsgroup. This was a very gracious thing for him to do, and I include his apology at the end of this message, since others may be interested in reading it. (He didn't realize his previous comments had been cross posted to a number of other newsgroups, so people reading only those groups will not have had the opportunity to read his followup.) I address some of his remaining criticisms and suggestions in this note. In particular, I've offered to do the remaining NFS benchmarks that we did not do before. I also want to thank all of you who forwarded his comments to me, since I would not have known about the discussion otherwise. Additionally, I thank Frank Durda, Alan Cox, and Torbjorn Granlund for directly sending us their comments, suggestions and concerns in a very kindly manner. I address the issues they brought up as well. The rest of this (long) note is organized as follows: First, I'll answer some of the many questions we've had regarding our purpose in doing these benchmarks, our choice of systems and system versions, and our methodology. The second section is where I address Jordan's and others' comments and criticisms about our benchmarks, our paper, and our presentation of the paper at USENIX. This second section contains a list of the mistakes, inaccuracies and omissions in our work that we so far know about. Those who've done much benchmarking work will already know that benchmarking is always an on-going process for several reasons. First, the systems we benchmark are continually changing, so any past performance results may not apply in the future. Second, while we are all (I hope) as careful as possible to be fair, to avoid measurement/methodology mistakes, and to interpret results correctly, I still learn something about the actual process of benchmarking after every benchmarking study. It seems there's almost always more information one should have collected or another way to collect the information one got. Finally, the last section gives a few concluding remarks I have about the whole subject of free UNIXes and such. I know this is a long note. I'll try to construct it so that the various sections make sense in isolation. That way you can skip to whatever, if anything, interests you. I. Why we did what we did ----------------------------- 1) Purpose of the benchmarks: Several people have asked about the purpose of the paper. Our "hidden agenda," if we really had one, was to determine just how much trouble my research group would be in if we used a freely-distributed OS as a base upon which to build. We're working on mobile and wireless computing issues, and we want to be able to distribute any kernel and driver stuff we end up doing. It may surprise you to hear that we were warned by a number of respected researchers that we would be laughed off the block if we used a non-commercial OS for our research base. We were told that these systems are "toy systems" in many ways. Please remember, though, that this was 2 years ago when I was starting up my project, and the world has changed a lot since then. Perhaps the warning made more sense then. The particular problem we were warned about is that other researchers would be unable to interpret our performance results, because non-commercial OSes are somehow "uncalibrated" and may not behave in well-understood ways. We were warned about Linux particularly, since it is not derived from as respected a parent system as BSD UNIX. So we set out to see just how much trouble we'd be in and compared the systems in some ways that were of interest to us, although they are nowhere near to being exhaustive. We published our results, at a particular snapshot in time, because we got tons of queries from people wanting to know the answers. We didn't post anything about our project (as far as I remember), so this must have happened mostly by word of mouth. There seemed to be enough interest to send the paper to USENIX. We found the performance differences between systems (on our benchmarks, our hardware, using the particular versions we used, and at that point in time) to be much less than we'd expected. For us, on the benchmarks we cared about for our particular project, the performance differences weren't enough to dictate a choice between systems. That doesn't mean they aren't significant to others or that other benchmarks wouldn't show significant differences. By the time of the presentation at USENIX, we felt we had at least some evidence to show that the free systems we looked at are definitely not toys. My research group decided it was fine to go ahead and use one of them. 2) Choice of OSes: Several people have asked why we picked the systems we picked for benchmarking -- both which OSes and which versions of those OSes. As we carefully explain in the paper, we had the following constraints: a) The system must be easily available (over the net or off of a CDROM) for installation. b) It must be cheap (under $100 in the summer of 1995). I hadn't much research funding at the time, so this really was a consideration. (My group is better off now, fortunately.) c) The system had to run on our hardware: an Intel Pentium P54C-100MHz system with an Intel Plato motherboard, 32 megabytes of main memory, two 2-gigabyte disks (one a Quantum Empire 2100 SCSI disk, the other an HP 3725 SCIS disk), a standard 10-Megabit/second Ethernet card (3Com Etherlink III 3c509), and an NCR 53c810 PCI SCSI controller with no on-board caches. d) The system had to have a sufficiently large user community that we could locate it easily. This left us with Linux, FreeBSD and Solaris. Several other obvious choices didn't run on our SCSI controller at that time. 3) Why we picked the versions of the OSes we picked: The decision for which we have received more flak is our choice of versions of the systems to benchmark. To be entirely fair, we felt that we had to choose the most recent major release that was commonly available for regular users at our cut-off date of October 31, 1995. This meant we did not test unreleased, beta, or development versions. We used Linux 1.2.8 from the Slackware Distribution, FreeBSD version 2.0.5R, and Solaris version 2.4. This meant we did not use the beta version of Solaris 2.5, which upset quite a few people since a lot of work was put into improving that version over the previous one. We also upset some Linux users, since a great many users run the 1.3.X development versions rather than the "official stable" versions. However, it didn't seem to be fair to go with a development version in one system and not another. 4) Our methodology: We did not have source code available for all three systems (the Solaris source code was too expensive for us), so we used what some call the "black box" approach to benchmarking. We usually attempted to explain curious results through external testing and further measurement rather than through investigations of kernel code, profiling, or the use of funny counters on the Pentium processor. We measured elapsed wall-clock time for the benchmarks, rather than system time, etc. This was simply because, to us, in our environment, we really only cared about how long the benchmarks took. I should point out that there are lots of arguments against this approach, although I'm aware of arguments against most of the other approaches that I know about as well. In most cases, we ran the systems single-user. As John Dyson points out, this totally ignores scalability issues. Since some of the worst problems with systems only show up under conditions of load, this means we're missing a lot of the most useful information. For our context switch benchmarks, at least, we did measure benchmark time for increasing numbers of ready processes. This is how we uncovered the very interesting behavior of Solaris 2.4 when the number of ready processes increases above 32 and again above 64. (There are other issues concerning our context switch benchmark, though, that I'll mention in the section below.) In our own selfish outlook on the world, however, we're hoping not to run our systems under much load. We'll see over time if we're so lucky. For those of you choosing systems to run for web servers and such, our lack of information about behavior under load is understandably disappointing. II. Response to comments, criticisms, etc. ------------------------------------------- 1) Failure to benchmark against the full range of servers. Jordan's major technical criticism, if I don't misinterpret him, is that while we tested Linux, FreeBSD and Solaris clients running the same benchmark against a Linux 1.2.8 file server, we did not test their interoperability against all three systems running as servers. I think this is a very valid point. Therefore, I offered to do the remaining benchmarks. He also proposes this possibility in his followup note. This brings up the issue again of which versions of the OSes to use, etc. I don't think it makes sense to use the same ones we used before, since at this point they're so old. FreeBSD has had major releases since then, and so has Solaris. My guess is that the "X" numbers in the 1.3.X versions of Linux are getting so large that they may do another non-development release pretty soon. (The performance of the system has certainly changed substantially.) At that point, I think it would be fair to everyone to do at least the NFS benchmarks and probably some of the others as well. Should we do this sooner? I don't know. Maybe. I welcome suggestions. Perhaps if we only post our results to the net it wouldn't be considered unfair to use a development version of Linux? 2) Failure to let FreeBSD, Linux, and Solaris developers examine our measurement methodology before publication: We actually explicitly avoided this. We somehow thought it would be easier to prove we had no bias for one system over another if we didn't consult anyone in the actual development groups. It's not that we thought anyone in one of the development groups would lobby us for something unfair, but that we thought the outside world might see this as being inappropriate. Instead, we consulted a bunch of other people (whom we acknowledge in the paper). *However*, I've decided Jordan is right. If we do this again, we'll just make it available to all of the development groups and hope they take an equal interest so that it's fairly reviewed by all. (Besides, we *did* talk to some of the Solaris developers to figure out what was going on with the jump in context switch times at 32 and 64 processes. We didn't consult them about benchmark methodology, since that was against our rules, but we did ask them if they knew of any reasons for the behavior we discovered.) 3) Dismissing the importance of the benchmarks: One of Jordan's other comments to me on the phone was that our presentation of the paper at USENIX angered him because we presented the benchmarks and then said that they didn't matter and that we'd use some other metric for choosing between systems. This isn't exactly how we meant this to come across, and I apologize for how it sounded. What we meant to say was that the performance differences on the benchmarks we most care about for our environment (and on our hardware at the time we ran them and using the versions we used) weren't great enough to us to dictate which system we should choose. That's why we ended up using some very subjective and impossible-to-calibrate non-metrics for picking a system to run. 4) Our claims about "support" for the different systems: Our comparisons of "support" for the different systems upset Jordan greatly, because he points out that most people assume "tech support" for the word "support." I bet he's right, and I'm sorry I didn't think of that. I certainly didn't mean to imply anything about anyone providing tech support for FreeBSD. We really didn't look at that at all. Tech support was one of the things we were interested in, but not in the usual sense. What we meant by this was a list of several items, although most of it boils down to "size of contributing user community and on-line resources." For example, we considered on-line repositories of lots of free software for the systems to be important to us. In our paper we list both FreeBSD and Linux as doing very well this way. If our presentation somehow came across as indicating otherwise, my sincere apologies. People make new contributions to both systems all the time, so the relative differences may well have changed since our examination of this. At the time we did our benchmarks, there seemed to be more hardware drivers available for Linux, with FreeBSD being next, and Solaris last. Since Solaris is a commercial system and doesn't give its source code away for free, it's hard to blame them for this situation, but it was important to us. We needed a system with a high likelihood that new bizarre hardware would be supported in a timely manner. I can't say whether this is a good or bad way to choose a system, but it mattered to us. The other thing we (Kevin) did, which I'd pretty much forgotten about when I talked with Jordan, was to scan the newsgroups for answers to our installation problems. The idea was that if it seemed to be a problem that others might have, somebody would already have asked about it on the newsgroup if there were a large enough user community. We didn't have much luck that way with Solaris. Perhaps this indicates there isn't as large a Solaris x86 user community, but maybe it's because they offer commercial support contracts and so people don't use the newsgroup as much for this sort of thing. With Linux we found our questions and their answers on the newsgroup. This was usually true for FreeBSD as well. The only exception was for our troubles partitioning the disk under FreeBSD. (There seemed to be lots of people with this problem, given the number of postings, but we were unable to understand the solution from the postings. I assume somebody responded individually to each of the people posting this problem, it's just that this didn't help for the timid amongst us who tend to read rather than write to the newsgroups. Or maybe the answer is really there and we just failed to locate or interpret it correctly.) Although in our phone conversation Jordan suggested a similar experiment for testing levels of support, I told him that I personally doubt this is a solid metric for making such comparisons. I feel that it depends too much on what sort of problem you encounter with which system. We therefore did not report any of this in our paper! We concentrated instead on availability of drivers, free compilers, etc. 5) Errors regarding hardware supported by FreeBSD: Thanks to Frank Durda for pointing this out. We incorrectly stated that version 2.0.5R of FreeBSD did not support our Panasonic/Creative Labs CD-ROM drive. This is untrue for 2.0.5R. It was true for a much older version (1.22) that we used. I'm ashamed to say that we assumed it was still true of 2.0.5R since the documentation for the older version stated: "FreeBSD does NOT support drives connected to a Sound Blaster or non-SCSI SONY or Panasonic drives. A general rule of thumb when selecting a CDROM drive for FreeBSD use is to buy a very standard SCSI model; they cost more, but deliver very solid performance in return. Do not be fooled by very cheap drives that, in turn, deliver VERY LOW performance! As always, you get what you pay for." We interpreted this to mean that FreeBSD felt our drive was inferior and therefore didn't (wouldn't) support it. We should have double-checked with the new version! 6) Errors regarding the Linux TCP implementation. Thanks to Alan Cox for pointing this out. We stated that the Linux TCP uses a 1 frame window across the loopback interface. We said this merely because our traces of some of the network stuff looked like this was the case. Alan pointed out the error and showed us some other reasons for the behavior we observed. 7) Context switch benchmark problems; There are problems with our context switch benchmarks, in that they measure not just context switch time but other activities as well. Our results include the overhead of the pipe operations used as well as some cache behavior. They also don't say anything about behavior with processes of different sizes, etc. We mention some of these problems in the paper, but it can still be misleading. We attempted to find some other better ways to measure context switch but came across problems for each of them in the different systems. (We do measure pipe overhead separately, so it can be subtracted out. In fact, the graph we presented at USENIX differs from the one in the paper since we subtracted out the pipe overhead for the presentation.) 8) Memory performance measurement problems: Torbjorn Granlund has pointed out some interesting things about the cache interface on the Pentium. We're still looking into that, so I can't say anything useful about that yet. Section III: Conclusions -------------------------- I have just a couple of things to say in conclusion. Before all this I hadn't read many of the relevant newsgroups other than to scan for articles on particular subjects. I was struck in the last couple of days by the amount of back and forth between users of the different systems. Maybe that's inevitable, and competition is very good for systems, but the tenor of some of the discussions seemed potentially destructive. One of the reasons (in my humble opinion) that a lot of the world is running Windows is due to a missed opportunity on the part of commercial UNIX vendors. A while back they had a chance to get together and push the system into all sorts of places it hadn't been before. Instead, they ended up fragmenting and arguing and charging high prices for a system that some folks just wanted to run on a personal computer. I think market share for everyone was reduced because of this. It's the analogy of pie slices. You can argue over the size of your respective pie slices, or you can increase the size of the pie for everyone. The latter is what I'd love to see. I think the free UNIXes have an opportunity to increase the size of the pie, for themselves and for commercial vendors. Maybe it's the last chance... And part of the issue is looking at what characteristics are important in a system for much of the world. Performance is important, or we wouldn't have done our benchmarks, but it's not the only thing. Sometimes those of us in academic environments (and others as well) get too wrapped up in performance issues, since they're so interesting to look at. But maybe it would be better if many of us considered other characteristics, such as packaging and documentation, not merely important, but truly as sexy as performance. Please send any flames to me directly. I admit to being a bit of a wimp in dealing with on-line flaming -- it's a character flaw, but it's probably not changing real soon. Thanks very much for your consideration, Mary Stanford University mgbaker@cs.stanford.edu http://plastique.stanford.edu/~mgbaker ************************************************************************** Jordan's followup note: Return-Path: jkh@FreeBSD.org Received: from Sunburn.Stanford.EDU (Sunburn.Stanford.EDU [171.64.67.178]) by plastique.Stanford.EDU (8.7.3/8.7.3) with ESMTP id PAA05442 for <mgbaker@plastique.stanford.edu>; Mon, 8 Apr 1996 15:57:49 -0700 (PDT) Received: from time.cdrom.com (time.cdrom.com [204.216.27.226]) by Sunburn.Stanford.EDU (8.7.1/8.7.1) with ESMTP id PAA14146 for <mgbaker@cs.stanford.edu>; Mon, 8 Apr 1996 15:57:47 -0700 (PDT) Posted-Date: Mon, 8 Apr 1996 15:57:47 -0700 (PDT) Received: from time.cdrom.com (localhost [127.0.0.1]) by time.cdrom.com (8.7.5/8.6.9) with SMTP id PAA13139 for <mgbaker@cs.stanford.edu>; Mon, 8 Apr 1996 15:57:30 -0700 (PDT) Sender: jkh@time.cdrom.com Message-ID: <316999D7.167EB0E7@FreeBSD.org> Date: Mon, 08 Apr 1996 15:57:27 -0700 From: "Jordan K. Hubbard" <jkh@FreeBSD.org> Organization: Walnut Creek CDROM X-Mailer: Mozilla 3.0b2 (X11; I; FreeBSD 2.2-CURRENT i386) MIME-Version: 1.0 Newsgroups: comp.unix.bsd.freebsd.misc CC: mgbaker@cs.stanford.edu Subject: My recent comments on the Baker/ Lai paper at USENIX Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Some of you may recall a recent message of mine where I called the benchmarking paper presented at the '96 USENIX "a travesty", "utter garbage" and used other highly emotionally-charged labels of that nature in denouncing what I felt to be an unfair presentation. Well, Mary Baker called me on the phone today and was, quite understandably, most upset with my characterization of her paper and the fact that I did not even see fit to present my issues with it to her personally, resulting in her having to hear about this 3rd-hand. While it must be said that I still have some issues with the testing matrix they used (most specifically for NFS client/server testing) and do not feel that the paper met *my* highly perfectionist standards for those who would dare to tread the lava flow that is public systems benchmarking, I do fully understand and acknowledge that both the words I used and the forum I used to express them (this newsgroup) were highly inappropriate and merit a full apology to Mary and her group. I can scarcely mutter about scientific integrity from one side of my mouth while issuing angry and highly emotional statements from the other, especially in a forum that gives the respondant little fair opportunity to respond. Mary, my apologies to you and your group. I was totally out of line. Those who know me also know that I've spent the last 3 years of my life working very hard on FreeBSD and, as the Public Relations guy, struggling to close the significant PR gap we're faced with. When I see something which I feel something to be unfair to FreeBSD, I react rather strongly and sometimes excessively. I've no excuse for this, but it nonetheless sometimes occurs. This last USENIX was one of many highs and lows for us, the highs coming from contacts established or renewed and the lows coming from presentations like Larry McVoy's lmbench talk which started as a reasonable presentation of lmbench itself and turned into a full pulpit-pounding propaganda piece for Linux, using comparisons that weren't even *fair* much less accurate. I did confront Larry over this one personally, and only mention it at all because it immediately followed the Baker/Lai talk and is no doubt significantly responsible for my coming away from that two hour session with a temporary hatred for anyone even mentioning the word "benchmark" - a degree of vitriol which no doubt unfairly biased me more heavily against the B/L talk than I would have been otherwise. Overall, USENIX was quite a bit more about highs than lows, and it's really only this session that stood out on the low side for me. As I told Mary, I do hope that future benchmarking attempts will at least pay us (in the free software world overall) the courtesy of a contact before the measurement runs, not to give us a chance to unfairly bias the results in any way (which would only reflect badly on us) but to simply review the testing methodologies used and make comments where we feel certain changes might be made to improve the objectivity and fairness of the results. I do think that benchmarking is important and that many types of useful "real world" data can be derived from them - our very own John Dyson puts significant amounts of time and effort into running benchmarks with Linux and other operating systems so as to "keep FreeBSD honest" in our own performance, looking for areas where they've made improvements which we have not. People like Mary Baker perform a very useful service in attempting to do the same thing for the academic world, and I definitely do not want my initially harsh words to discourage her and her students from doing further objective analysis of free and commercial operating systems - quite the contrary. It's a dirty, thankless job, but somebody has to do it. If she and her students would care to run another comparative benchmark suite using the more advanced technologies that have since been released by both the Linux and *BSD groups, endevouring also to expand the base of hardware that's being tested (since PCs are a notoriously mixed bag when it comes to issues like this), I would be pleased to extend every cooperation from the FreeBSD Project. --- - Jordan Hubbard President, FreeBSD Project