*BSD News Article 12672

Path: sserve!newshost.anu.edu.au!munnari.oz.au!spool.mu.edu!howland.reston.ans.net!bogus.sura.net!udel!sbcs.sunysb.edu!stark.UUCP!gene
From: gene@stark.uucp (Gene Stark)
Newsgroups: comp.os.386bsd.bugs
Subject: Excessive Interrupt Latencies
Date: 15 Mar 93 11:57:56
Organization: Gene Stark's home system
Lines: 49
Distribution: world
Message-ID: <GENE.93Mar15115756@stark.stark.uucp>
NNTP-Posting-Host: stark.uucp

I have been trying to get some insight into the *real* problems underlying
the "com: silo overflow" problems.  By hacking in some instrumentation
using the recently posted high-precision "microtime" routine, I have been
able to convince myself that the problem is that the latency between the
time the com hardware requests an interrupt and the time control reaches
the comintr routine is often as much as 400us (on my 486DX/33) and can be
as long as 1.5ms or more.  In addition, the system sometimes seems to get
into a state where latencies over 1ms seem to be the norm, rather than the
exception.

I hacked up a version of the com driver for which the comintr routine
runs at splhigh, performing minimal service and queueing work to run later
at spltty.  This code dramatically decreases, but does not completely
eliminate, silo overflow errors.  I conclude that some portion of the system
is occasionally running for periods over 1ms with interrupts masked.
This seems excessive, and I have been trying to track down where this might
be occurring.

One problem in trying to figure out what is going on is that it is very
difficult to track priority levels through the code in locore.s.
I have a sneaking suspicion that under certain circumstances control
is leaving the context switcher and reaching a user process in the system
at splhigh when it shouldn't.  This would cause a long stretch of system code
to be executed with interrupts masked, producing the observed latencies.

In trying to understand what is happening, I came across the following code
in locore.s (occurs about line 1302, at the end of "swtch"):

	movl	%ecx,_curproc		# into next process
	movl	%edx,_curpcb

	/* pushl	PCB_IML(%edx)
	call	_splx
	popl	%eax*/

	movl	%edx,%eax		# return (1);
	ret

The thing that concerns me is the commented-out code for restoring the
priority level from the pcb.  It looks to me like when this code is
called at splclock(), (see for example, the end of "cpu_exit" in
vm_machdep.c), then control could be returning to a user process in the
system at splclock, instead of whatever priority it ought to be running at.

Can anyone shed some light on this?

							- Gene Stark
--
							stark@cs.sunysb.edu