*BSD News Article 67174

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!news.mira.net.au!news.vbc.net!samba.rahul.net!rahul.net!a2i!bug.rahul.net!rahul.net!a2i!in-news.erinet.com!en.com!uunet!in2.uu.net!news.artisoft.com!usenet
From: Terry Lambert <terry@lambert.org>
Newsgroups: comp.unix.bsd.freebsd.misc
Subject: Re: Symmetric Multi-Processing
Date: Sat, 27 Apr 1996 15:15:21 -0700
Organization: Me
Lines: 197
Message-ID: <31829C79.40223F23@lambert.org>
References: <3180D16D.41C6@wcom.com> <4lr4q9$788@agate.berkeley.edu> <31813FF4.527A0A7@lambert.org> <4ltb6l$j6k@hops.entertain.com>
NNTP-Posting-Host: hecate.artisoft.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailer: Mozilla 2.01 (X11; I; Linux 1.1.76 i486)

Darryl Watson wrote:
] 
] In article <31813FF4.527A0A7@lambert.org>, Terry Lambert
<terry@lambert.org> says:
] 
]         [snip]
]
] >Starting with 2.0.5, the patches in pub on freefall.cdrom.com
] >will give you low grain SMP, like Linux has recently released.
] >
] >
] >] Sorry if you were only looking for a *BSD solution.  Linux is
] >] the only commonly used free-UNIX available on PCs which has
] >] SMP capability.
] >
] >Wrong.  BSD beat Linux to SMP by more than a year.
] >
] >BSD beat linux to a unified VM/buffer cache by more than a
] >year as well (and still counting).
]
] Terry (or anyone): In the case of FreeBSD, what is 'low-grain'
] symmetric multiprocessing?

Granularity refers to the level of allowable kernel rentrancy.

A low granularity implementation has kernel reeentrancy
guarded by mutex to prevent more than one processor being in
the kernel at a time.

Both the current Linux and BSD SMP implementation are "low grain".

My personal machine is running "median grain".  This means
that multiple processors can be in the kernel in certain
case-restricted circumstances.  Basically, my system call
trap code, some standard system calls, and the FFS and several
other file systems on my machine are multiply entrant.  The
interrupt code, and the exception code are not multiply
entrant.

I am using single entry/exit and case-by-case pushdown to
increase the granularity on a per system call basis.
Eventually, the entire kernel will be reentrant.


] Does it mean that (I hope I hope) each time a process is
] created, it is assigned the most available CPU?  Or does it
] mean something like, once you login, you are assigned a CPU,
] all your processes use that same CPU, etc.?

Neither.  That is a scheduler issue, and isn't really
related to granularity, it's related to symmetry and
scalability.

In the current (Jack Vogel) scheduler model, processors are
assignable anonymous resources.

This means that if a process is in the read-to-run state, it
will be assigned the first free processor avialable.  This
has advantages and disadvantages, and I expect the model will
change.  I haven't really looked at changing it, and scheduling
is really someone else's department.  I'm mostly an FS geek.
8-).

The main advantage is that, with per processor page allocation
pools and the use of a hierarchical locking system (which will
be necessary for any effectively scalable model, anyway), an
anonymous resource model is infinitely scalable.  This assumes
locking of the root node in IX (intention exclude) mode to
avoid processor synchronization of mutex data unless they are
affecting each others resources.  The use of hierarchical
locks allow computation of transitive closure over the directed
graph which describes the localable unique system resources,
and their subresource.  It means deadlock avoidance, rather
than detection.

The main disadvantage is that there is a significant loss of
cache locality; in point of fact, there wants to be load
calculation pe processor for per processor RTR queues, and
use of the queus to ensure "preferential" loading of code.

That is, if you ran on processor 'A' last time, you will be
more likely to run on processor 'B' next time.

I expect best results would be achieve from a fuzzy logic
system to determine queue-to-queue transition.  After all,
for kernel threads, there should be a "preference" to not run
on the smae processor as memebrs of kernel threads in the same
"thread group" (process).


There is no "processor assignment at login", which would be
a rather bogus idea.  8-).


] Is there a FAQ about the SMP capabilities of FreeBSD?

No, there isn't.


] Is SMP capability embedded in FreeBSD 2.1.0-R?  Or is it a
] FreeBSD-Current thingie?

It's a "non-integrated tree thingie".  There are seperately
applyable patches for 28 Oct 1994, and another set of recent
patches by Poul-Henning Kamp, some of which are mine based on
updates to the Vogel patches to -current's initialization
and other architecture changes.


] Secondly, what is a 'Unified VM/buffer', as opposed to any
] other swapping scheme?  I have seen several references to
] this as a feature of the OS.

It's an advocacy counter-argument.  8-).  My natural reaction
to an advocacy article is a surgical counter-strike.


It means that the buffered I/O is done as a result of demand
paging instead of being done with a seperate I/O subsystem.

Basically, we don't need to call bmap on every I/O, and there
are not cache coherency problems when doing I/O to mmap'ed
files.

It also means that there are not resource limits hit by one or
the other that are artifically imposed (ie: most resource
contention systems between the VM and I/O subsystems are
completely resolved.


For instance, the SVR4 and Solaris "SLAB allocator" based VM
systems have resource problems with LRU limitation when you
mmap a large number of files and rotor through their pages
quickly.  You can see this by running a large link (ld in
SVR4 uses mmap on the object and library files and directly
references the symbol space), and watching the resource
contention as the VM system trashes the LRU and so trashes
the buffer cache.

In FreeBSD, it's a relatively simple change to add per vnode
working set quotas because of the unified VM.  If you want to
try this, remember that when a vnode hits its working set, you
want to LRU within the vnode page list itself... pages forced
from the vnode go to the front of the system LRU list (ie: it's
just a modified insertion algorithm).

The SVR4/Solaris soloution for this problem is to run programs
with critical user response requirements in a new scheduling
class (UnixWare, for instance, runs the X Server in a "fixed"
scheduling class, which ensures a minimum amount of time; since
the resource starvation is pages, not CPU time, this is only
partially effective in resolving the "move mouse, wiggle cursor"
problem).

John Dyson and David Greenman are the main VM gurus for FreeBSD,
they have a lot more detail.

I expect that BSD will go to a modified zone/slab allocation
technique in pursuit of variable persistance objects (short,
medium, and long persistence kernel objects want to be
allocated in sperate regions to avoid kernel memory
fragmentation) and SMP/kernel multithreading.  For SMP, zone
allocation is superior to slab allocation in ensuring processor
locality and thus scalability to a large number of processors;
it's what Sequent uses.  The "8 processor concurrency limit"
frequently quoted for Intel machines is actually a limit based
on resource synchronization.  If you divide the resources into
"common" and "per processor" zones, then you do not need to
globally synchronize for anything but interzone access.  Even
then, you can limit contention to the processors affected, if
you use a hierarchically arranged lock manager with "self" and
"other" locks per processor (again, the hierarchy plays an
important role in the ability to compute transitive closure
across a dual zone lock state).  Of course, you'd want to
allocate on the basis of slabs within any given zone.

For a multithreaded UP kernel, the hierarchy locks can be
converted, at compile time, into test/set semaphores, so there
is no additional code for SMP above UP with multithreading, if
correctly implemented.  A UP kernel without multithreading
*could* be built by nulling out the calls (this assumes single
entry/single exit is used to ensure coherent lock state in
both compilation cases).  But it should be nboted that
mulithreading yielded a 160% performance improvement in UFS
in SVR4, even after the mutex overhead, in the UP kernel case.
Kernel reeentrancy/multithreading is desirable even in the
absence of multiple processors.

Anyway, that's more than enough for now.  8-).


					Regards,
                                        Terry Lambert
                                        terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.