Return to BSD News archive
Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!nntp.coast.net!zombie.ncsc.mil!news.duke.edu!eff!news.umbc.edu!cs.umd.edu!not-for-mail From: torek@elf.bsdi.com (Chris Torek) Newsgroups: comp.unix.bsd.bsdi.misc,comp.unix.advocacy Subject: Re: multiple httpds vs threads vs ... (was BSDI Vs. NT...) Date: 27 Dec 1995 12:28:54 -0800 Organization: Berkeley Software Design, Inc. Lines: 131 Message-ID: <4bsaa6$8l8@elf.bsdi.com> References: <taxfree.3.00C439A1@primenet.com> <4bmsjp$7lv@elf.bsdi.com> <4bnsb8$1qe@Mercury.mcs.com> <bakulDK7u6M.LrM@netcom.com> <4bri4a$q2r@noao.edu> <4brlt5$8un@sungy.germany.sun.com> Reply-To: torek@bsdi.com NNTP-Posting-Host: elf.bsdi.com Xref: euryale.cc.adfa.oz.au comp.unix.bsd.bsdi.misc:1861 comp.unix.advocacy:12684 I had not intended to get into a long discussion here, but it was Christmas and I allowed myself some extra net-time... :-) In article <4bmsjp$7lv@elf.bsdi.com> I wrote: >It seems almost too obvious for words that, all else being equivalent, >threads will be lighter-weight than processes ... In article <4bnsb8$1qe@Mercury.mcs.com> les@MCS.COM (Leslie Mikesell) asks: >But is this still true if you are doing i/o? That is, don't you >have to go from kernel to application address space on every >packet anyway? In typical machines, crossing a protection domain (kernel <-> user) is cheaper than changing an address space (userA <-> userB). That is, if you have kernel threads, system calls made from different threads within one process can be switched faster than system calls from different processes. The actual cost difference is, as I said, machine-dependent, and usually moderate. For a while the major part of the cost was cache and TLB flushing (or more precisely, recovery therefrom), and was scaling (worsening) with processor speed, so many modern CPUs have multiple `contexts' and label cache entries (lines and TLB slots) with a `context' or `process' or `address-space' ID. This makes a thread-and-address-space switch be about the same work as a thread switch alone, except that the cache size is, in effect, divided by the number of active address spaces. (These machines also take an extra hit when you exceed the number of IDs, whether it be 8, 16, 64, 1024, 4096, or what-have-you.) These differences do indeed tend to get overwhelmed by other factors, at least on `balanced' architectures. The Pentium is not particularly well balanced, however---it is quite fast and needs good cache and TLB performance, but lacks context identifiers. Relatively speaking, switches are more expensive on a Pentium than, well, pretty much anything else modern. Leslie Mikesell also notes that: >Somewhere you've got to sort the packets out and decide which code >can run. If this can be done better than the unix scheduler can >do it, why not fix that instead? This is not quite the right question, since the scheduler is solving a more general problem, but does bring up the point that if you have threads, you still have to schedule the threads, and this is just about the same job as scheduling processes. (In many systems, thread scheduling actually gets tied in with complicated `ganging' schemes designed to avoid livelock and starvation problems, so that the final system, with kernel threads, is much slower than the original process-only system ever was. However, we can, for this discussion at least, wave our hands and claim that this is merely an implementation detail. One can inflict the same sludge on a process-based system that shares memory via mapping; its association with threads is mainly historical. Fundamentally, the problem of scheduling something to run, whether it is a thread or a process, is pretty much the same, no matter where you move it.) I also suggested the use of select() or poll(), and he asks: >Doesn't that just double your syscalls/second? This one is a bit complicated. The short answer is `no'; in fact, because poll/select asks about a set of pending requests, the overhead actually goes *down* as the load increases. Note that a system call costs about the same as an in-process context switch through the kernel, i.e., a kernel thread switch---most of the work is crossing the protection domain (e.g., saving and loading a few registers, then changing the CPU's access mode from `user' to `kernel'). So, in the light load case, yes, it does, but if there are N clients on one server, the polling system uses N+1 system calls to service them, while the threaded system uses N system calls and N switches. If you get lucky, the switches combine with the system calls---they can be done together---and you have N+1 versus N. If the scheduler makes bad decisions, the polling system tends to outperform the threaded system, which makes up to 2N calls. In either case the difference tends to be marginal anyway. I wrote: >[Using] poll() or select() ... a process can implement its own >`threading' (except that these threads can be even lighter-weight >than kernel threads). In article <bakulDK7u6M.LrM@netcom.com> bakul@netcom.com (Bakul Shah) adds: >Yet another approach is to use the `callback' model of programming. Actually, that was what I had in mind when I wrote the parenthetical comment. However, a subroutine call `context switch' (you can see how slippery the word `context' can be :-) ) is isomorphic to a thread switch, at least in the abstract. In practise, calls tend to be more efficient, if only because they happen so much more often that the machines tend to be designed to *make* them efficient. He also points out that: >select()/poll() ... do not scale well ... and [ensuring] _fair_ service >is provided is not easy. These are both true, although select/poll scales better than multiple threads or processes (passing those 1000 descriptors through one call is `more efficient' than 1000 things each passing one descriptor through one call, since the call overhead gets amortized; note that we are assuming the programmer implements his select/poll calls efficiently for the system). The fairness issue falls right back into the scheduling described above: use processes---and get *them* right in the kernel---and this all goes away. (Say, does anyone else hear Rob Pike echoing in the background? :-) ) (Incidentally, it sure is a lot easier to sit back and say things like: `ideally, this costs about the same as that' or `these should be more efficient than those' than it is to go out and measure them and try to explain or prove why the measurements do or do not conform to the theories.... :-) ) And the only BSD/OS-specific note: >I was also surprised to find that NT's select() implementation can >handle many more fds than BSDi ... As Rich Stevens and Casper Dik point out in <4bri4a$q2r@noao.edu> and <4brlt5$8un@sungy.Germany.Sun.COM> respectively, applications can define FD_SETSIZE themselves (or you can define it in the Makefile, e.g., add -DFD_SETSIZE=1024 to CFLAGS). The program must also be run with a higher `openfiles' limit than the default of 64. Once you go beyond 256 active descriptors, however, you will trip across a kernel bug in select(). This has been fixed for 2.1. -- In-Real-Life: Chris Torek, Berkeley Software Design Inc El Cerrito, CA Domain: torek@bsdi.com +1 510 234 3167