*BSD News Article 58264

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!nntp.coast.net!zombie.ncsc.mil!news.duke.edu!eff!news.umbc.edu!cs.umd.edu!not-for-mail
From: torek@elf.bsdi.com (Chris Torek)
Newsgroups: comp.unix.bsd.bsdi.misc,comp.unix.advocacy
Subject: Re: multiple httpds vs threads vs ... (was BSDI Vs. NT...)
Date: 27 Dec 1995 12:28:54 -0800
Organization: Berkeley Software Design, Inc.
Lines: 131
Message-ID: <4bsaa6$8l8@elf.bsdi.com>
References: <taxfree.3.00C439A1@primenet.com> <4bmsjp$7lv@elf.bsdi.com> <4bnsb8$1qe@Mercury.mcs.com> <bakulDK7u6M.LrM@netcom.com> <4bri4a$q2r@noao.edu> <4brlt5$8un@sungy.germany.sun.com>
Reply-To: torek@bsdi.com
NNTP-Posting-Host: elf.bsdi.com
Xref: euryale.cc.adfa.oz.au comp.unix.bsd.bsdi.misc:1861 comp.unix.advocacy:12684

I had not intended to get into a long discussion here, but it was
Christmas and I allowed myself some extra net-time... :-)

In article <4bmsjp$7lv@elf.bsdi.com> I wrote:
>It seems almost too obvious for words that, all else being equivalent,
>threads will be lighter-weight than processes ...

In article <4bnsb8$1qe@Mercury.mcs.com> les@MCS.COM (Leslie Mikesell)
asks:
>But is this still true if you are doing i/o?  That is, don't you
>have to go from kernel to application address space on every
>packet anyway?

In typical machines, crossing a protection domain (kernel <-> user)
is cheaper than changing an address space (userA <-> userB).  That
is, if you have kernel threads, system calls made from different
threads within one process can be switched faster than system calls
from different processes.

The actual cost difference is, as I said, machine-dependent, and
usually moderate.  For a while the major part of the cost was cache
and TLB flushing (or more precisely, recovery therefrom), and was
scaling (worsening) with processor speed, so many modern CPUs have
multiple `contexts' and label cache entries (lines and TLB slots)
with a `context' or `process' or `address-space' ID.  This makes
a thread-and-address-space switch be about the same work as a thread
switch alone, except that the cache size is, in effect, divided by
the number of active address spaces.  (These machines also take an
extra hit when you exceed the number of IDs, whether it be 8, 16,
64, 1024, 4096, or what-have-you.)

These differences do indeed tend to get overwhelmed by other factors,
at least on `balanced' architectures.  The Pentium is not particularly
well balanced, however---it is quite fast and needs good cache and
TLB performance, but lacks context identifiers.  Relatively speaking,
switches are more expensive on a Pentium than, well, pretty much
anything else modern.

Leslie Mikesell also notes that:

>Somewhere you've got to sort the packets out and decide which code
>can run.  If this can be done better than the unix scheduler can
>do it, why not fix that instead?

This is not quite the right question, since the scheduler is solving
a more general problem, but does bring up the point that if you
have threads, you still have to schedule the threads, and this is
just about the same job as scheduling processes.  (In many systems,
thread scheduling actually gets tied in with complicated `ganging'
schemes designed to avoid livelock and starvation problems, so that
the final system, with kernel threads, is much slower than the
original process-only system ever was.  However, we can, for this
discussion at least, wave our hands and claim that this is merely
an implementation detail.  One can inflict the same sludge on a
process-based system that shares memory via mapping; its association
with threads is mainly historical.  Fundamentally, the problem of
scheduling something to run, whether it is a thread or a process,
is pretty much the same, no matter where you move it.)

I also suggested the use of select() or poll(), and he asks:

>Doesn't that just double your syscalls/second?

This one is a bit complicated.  The short answer is `no'; in fact,
because poll/select asks about a set of pending requests, the
overhead actually goes *down* as the load increases.  Note that a
system call costs about the same as an in-process context switch
through the kernel, i.e., a kernel thread switch---most of the work
is crossing the protection domain (e.g., saving and loading a few
registers, then changing the CPU's access mode from `user' to
`kernel').  So, in the light load case, yes, it does, but if there
are N clients on one server, the polling system uses N+1 system
calls to service them, while the threaded system uses N system
calls and N switches.  If you get lucky, the switches combine with
the system calls---they can be done together---and you have N+1
versus N.  If the scheduler makes bad decisions, the polling system
tends to outperform the threaded system, which makes up to 2N calls.
In either case the difference tends to be marginal anyway.

I wrote:
>[Using] poll() or select() ... a process can implement its own
>`threading' (except that these threads can be even lighter-weight
>than kernel threads).

In article <bakulDK7u6M.LrM@netcom.com> bakul@netcom.com (Bakul Shah)
adds:
>Yet another approach is to use the `callback' model of programming.

Actually, that was what I had in mind when I wrote the parenthetical
comment.  However, a subroutine call `context switch' (you can see
how slippery the word `context' can be :-) ) is isomorphic to a
thread switch, at least in the abstract.  In practise, calls tend
to be more efficient, if only because they happen so much more
often that the machines tend to be designed to *make* them efficient.

He also points out that:

>select()/poll() ... do not scale well ... and [ensuring] _fair_ service
>is provided is not easy.

These are both true, although select/poll scales better than multiple
threads or processes (passing those 1000 descriptors through one
call is `more efficient' than 1000 things each passing one descriptor
through one call, since the call overhead gets amortized; note that
we are assuming the programmer implements his select/poll calls
efficiently for the system).  The fairness issue falls right back
into the scheduling described above: use processes---and get *them*
right in the kernel---and this all goes away.  (Say, does anyone
else hear Rob Pike echoing in the background? :-) )

(Incidentally, it sure is a lot easier to sit back and say things
like: `ideally, this costs about the same as that' or `these should
be more efficient than those' than it is to go out and measure them
and try to explain or prove why the measurements do or do not
conform to the theories.... :-) )

And the only BSD/OS-specific note:

>I was also surprised to find that NT's select() implementation can
>handle many more fds than BSDi ...

As Rich Stevens and Casper Dik point out in <4bri4a$q2r@noao.edu>
and <4brlt5$8un@sungy.Germany.Sun.COM> respectively, applications
can define FD_SETSIZE themselves (or you can define it in the
Makefile, e.g., add -DFD_SETSIZE=1024 to CFLAGS).  The program must
also be run with a higher `openfiles' limit than the default of
64.  Once you go beyond 256 active descriptors, however, you will
trip across a kernel bug in select().  This has been fixed for 2.1.
-- 
In-Real-Life: Chris Torek, Berkeley Software Design Inc
El Cerrito, CA	Domain:	torek@bsdi.com	+1 510 234 3167