Return to BSD News archive
Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!news.mira.net.au!news.vbc.net!samba.rahul.net!rahul.net!a2i!bug.rahul.net!rahul.net!a2i!in-news.erinet.com!en.com!uunet!in2.uu.net!news.artisoft.com!usenet From: Terry Lambert <terry@lambert.org> Newsgroups: comp.unix.bsd.freebsd.misc Subject: Re: Symmetric Multi-Processing Date: Sat, 27 Apr 1996 15:15:21 -0700 Organization: Me Lines: 197 Message-ID: <31829C79.40223F23@lambert.org> References: <3180D16D.41C6@wcom.com> <4lr4q9$788@agate.berkeley.edu> <31813FF4.527A0A7@lambert.org> <4ltb6l$j6k@hops.entertain.com> NNTP-Posting-Host: hecate.artisoft.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Mozilla 2.01 (X11; I; Linux 1.1.76 i486) Darryl Watson wrote: ] ] In article <31813FF4.527A0A7@lambert.org>, Terry Lambert <terry@lambert.org> says: ] ] [snip] ] ] >Starting with 2.0.5, the patches in pub on freefall.cdrom.com ] >will give you low grain SMP, like Linux has recently released. ] > ] > ] >] Sorry if you were only looking for a *BSD solution. Linux is ] >] the only commonly used free-UNIX available on PCs which has ] >] SMP capability. ] > ] >Wrong. BSD beat Linux to SMP by more than a year. ] > ] >BSD beat linux to a unified VM/buffer cache by more than a ] >year as well (and still counting). ] ] Terry (or anyone): In the case of FreeBSD, what is 'low-grain' ] symmetric multiprocessing? Granularity refers to the level of allowable kernel rentrancy. A low granularity implementation has kernel reeentrancy guarded by mutex to prevent more than one processor being in the kernel at a time. Both the current Linux and BSD SMP implementation are "low grain". My personal machine is running "median grain". This means that multiple processors can be in the kernel in certain case-restricted circumstances. Basically, my system call trap code, some standard system calls, and the FFS and several other file systems on my machine are multiply entrant. The interrupt code, and the exception code are not multiply entrant. I am using single entry/exit and case-by-case pushdown to increase the granularity on a per system call basis. Eventually, the entire kernel will be reentrant. ] Does it mean that (I hope I hope) each time a process is ] created, it is assigned the most available CPU? Or does it ] mean something like, once you login, you are assigned a CPU, ] all your processes use that same CPU, etc.? Neither. That is a scheduler issue, and isn't really related to granularity, it's related to symmetry and scalability. In the current (Jack Vogel) scheduler model, processors are assignable anonymous resources. This means that if a process is in the read-to-run state, it will be assigned the first free processor avialable. This has advantages and disadvantages, and I expect the model will change. I haven't really looked at changing it, and scheduling is really someone else's department. I'm mostly an FS geek. 8-). The main advantage is that, with per processor page allocation pools and the use of a hierarchical locking system (which will be necessary for any effectively scalable model, anyway), an anonymous resource model is infinitely scalable. This assumes locking of the root node in IX (intention exclude) mode to avoid processor synchronization of mutex data unless they are affecting each others resources. The use of hierarchical locks allow computation of transitive closure over the directed graph which describes the localable unique system resources, and their subresource. It means deadlock avoidance, rather than detection. The main disadvantage is that there is a significant loss of cache locality; in point of fact, there wants to be load calculation pe processor for per processor RTR queues, and use of the queus to ensure "preferential" loading of code. That is, if you ran on processor 'A' last time, you will be more likely to run on processor 'B' next time. I expect best results would be achieve from a fuzzy logic system to determine queue-to-queue transition. After all, for kernel threads, there should be a "preference" to not run on the smae processor as memebrs of kernel threads in the same "thread group" (process). There is no "processor assignment at login", which would be a rather bogus idea. 8-). ] Is there a FAQ about the SMP capabilities of FreeBSD? No, there isn't. ] Is SMP capability embedded in FreeBSD 2.1.0-R? Or is it a ] FreeBSD-Current thingie? It's a "non-integrated tree thingie". There are seperately applyable patches for 28 Oct 1994, and another set of recent patches by Poul-Henning Kamp, some of which are mine based on updates to the Vogel patches to -current's initialization and other architecture changes. ] Secondly, what is a 'Unified VM/buffer', as opposed to any ] other swapping scheme? I have seen several references to ] this as a feature of the OS. It's an advocacy counter-argument. 8-). My natural reaction to an advocacy article is a surgical counter-strike. It means that the buffered I/O is done as a result of demand paging instead of being done with a seperate I/O subsystem. Basically, we don't need to call bmap on every I/O, and there are not cache coherency problems when doing I/O to mmap'ed files. It also means that there are not resource limits hit by one or the other that are artifically imposed (ie: most resource contention systems between the VM and I/O subsystems are completely resolved. For instance, the SVR4 and Solaris "SLAB allocator" based VM systems have resource problems with LRU limitation when you mmap a large number of files and rotor through their pages quickly. You can see this by running a large link (ld in SVR4 uses mmap on the object and library files and directly references the symbol space), and watching the resource contention as the VM system trashes the LRU and so trashes the buffer cache. In FreeBSD, it's a relatively simple change to add per vnode working set quotas because of the unified VM. If you want to try this, remember that when a vnode hits its working set, you want to LRU within the vnode page list itself... pages forced from the vnode go to the front of the system LRU list (ie: it's just a modified insertion algorithm). The SVR4/Solaris soloution for this problem is to run programs with critical user response requirements in a new scheduling class (UnixWare, for instance, runs the X Server in a "fixed" scheduling class, which ensures a minimum amount of time; since the resource starvation is pages, not CPU time, this is only partially effective in resolving the "move mouse, wiggle cursor" problem). John Dyson and David Greenman are the main VM gurus for FreeBSD, they have a lot more detail. I expect that BSD will go to a modified zone/slab allocation technique in pursuit of variable persistance objects (short, medium, and long persistence kernel objects want to be allocated in sperate regions to avoid kernel memory fragmentation) and SMP/kernel multithreading. For SMP, zone allocation is superior to slab allocation in ensuring processor locality and thus scalability to a large number of processors; it's what Sequent uses. The "8 processor concurrency limit" frequently quoted for Intel machines is actually a limit based on resource synchronization. If you divide the resources into "common" and "per processor" zones, then you do not need to globally synchronize for anything but interzone access. Even then, you can limit contention to the processors affected, if you use a hierarchically arranged lock manager with "self" and "other" locks per processor (again, the hierarchy plays an important role in the ability to compute transitive closure across a dual zone lock state). Of course, you'd want to allocate on the basis of slabs within any given zone. For a multithreaded UP kernel, the hierarchy locks can be converted, at compile time, into test/set semaphores, so there is no additional code for SMP above UP with multithreading, if correctly implemented. A UP kernel without multithreading *could* be built by nulling out the calls (this assumes single entry/single exit is used to ensure coherent lock state in both compilation cases). But it should be nboted that mulithreading yielded a 160% performance improvement in UFS in SVR4, even after the mutex overhead, in the UP kernel case. Kernel reeentrancy/multithreading is desirable even in the absence of multiple processors. Anyway, that's more than enough for now. 8-). Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.