Return to BSD News archive
Path: sserve!newshost.anu.edu.au!munnari.oz.au!news.Hawaii.Edu!ames!agate!howland.reston.ans.net!xlink.net!math.fu-berlin.de!fub!unlisys!max.IN-Berlin.DE!not-for-mail From: berry@moritz.IN-Berlin.DE (Stefan Behrens) Newsgroups: comp.os.386bsd.questions Subject: Re: drive light on and locked -- w/ patch Date: 21 Jul 1993 00:19:45 +0200 Organization: Private Site in Berlin, Germany Lines: 309 Message-ID: <22hr2f$hm@moritz.in-berlin.de> References: <22gd69$6j3@hrd769.brooks.af.mil> NNTP-Posting-Host: moritz.in-berlin.de Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Summary: patch for 386bsd0.1 pk0.2.4 that detects and cleares wd-ctlr lockups In article <22gd69$6j3@hrd769.brooks.af.mil> burgess@hrd769.brooks.af.mil (Dave Burgess) writes: >There was some discussion about three weeks ago about the system hang >where the hard drive locks up with the the drive light lit. Someone, >whose name I have regretfully forgotten, posted either a description of >a change that (s)he had made to the system that reset the drive. I cannot remember seeing a patch for it myself, nor can egrep find one. But I also had the problem with the locked IDE controller since I installed 386bsd one year ago. It's easy to recognize the situation of the locked driver. It's easy to fix the situation. But I still don't know why this happens, why the driver looses the interrupt from the controller, or maybe why the controller doesn't generate the expected interrupt. I wrote some code to catch the problem. For every block-read or write request it will be checked whether the controller answers in time. This is easy to do for IDE controllers and doesn't mean much overhead. >In case I dreamed it (which has happened more than once), I propose a >simple solution (sweets from the sweet :-). > >On a disk {read,select} start up an alarm that times out in three >seconds. On a successful operation from the disk, clear the alarm. >When the alarm expires, that means that no disk activity has succeeded >in the last three seconds, which would seem to me to be a good indicator >that the drive/controller has siezed up again. This really is what my code does. But it isn't necessary to start an timeout for every request since IDE controller are so stupid, they can only handle one request at once. So my code does the following: o in general -- for every attached drive every two seconds a function will be called which checks for an in-work request, for which the controller doesn't answer o in detail: - on a per drive basis an function will be called periodically and a timeout counter will be managed - when a read/write request is started, the counter is set to two - when the expected interrup comes in it will be cleared with zero - in the timeout function, when the counter is > 0, it's decremented o if it's decremented the first time (so it's one then) nothing is done o if it's decremented the second time (and zero then) the controller is timed out... o ...when it's timed out: - status/debug info will be printed - the failure will be logged - the controller will be reset in order to put it in a known state - the request will be restarted in a `sector by sector' way, this means multi-sector transfers are split up I used code similar to this in a 386bsd0.1 pk0.2.2 environment with the Barsoom wd-driver (used in NetBSD too) and Bruce's intr/npx/com stuff for month. This new patch (and first public posting) is against 386bsd0.1 pk0.2.4. It should be a two line change to use it for NetBSD but I didn't try it. The code for the detection of the problem is very well tested and very old. It won't hurt systems that don't show the problems. The code for solving the problem and for restarting the request is newer. I have two machines up which run that code. One with only one IDE drive, and one with two IDE drives and some SCSI devices. The first and newer computer never had problems with locked wd controllers. But the code doesn't hurt either. The second machine which is a server used to fail very often because of this. Now it detects the lock, resets the drive and restarts the request. The problem is finally solved for me. >Note: I am using the current-sources from sun-lamp using sup. NetBSD uses the Barsoom wd-driver. It should be easy to change the following patch for this driver. >The drive does not fail like this with the 0.8 >released kernel, or from the sources that came out with the original 0.8 >release. It happens in ``cooperation'' with some other drivers, e.g. with the com driver or with the we-ethernet driver (for me). Maybe someone who knows more about IDE controllers can comment on this: in the situation where the controller gets locked, the status is - inb(wdc+wd_status) --> WDCS_READY|WDCS_SEEKCMPLT|WDCS_DRQ which means the drive is ready, seek completed and the data request bit is set - inb(wdc+wd_error) --> is 0 - the request is a multi sector block read request - the action that helps for me is to restart the request with `du->du->dk_skip = 0;' and `du->dk_flags |= DKFL_SINGLE;' - without enforcing the redo in single sector steps the restart didn't succeed Ok, after that boring talking here's the patch against 386bsd0.1 pk0.2.4. Try it, it won't hurt but it will save many system lockups! *** /tmp/,RCSt1000483 Tue Jul 20 23:45:22 1993 --- wd.c Tue Jul 20 23:33:01 1993 *************** *** 56,61 **** --- 56,63 ---- * 17 May 93 Rodney W. Grimes Fixed all 1000000 to use WDCTIMEOUT, * and increased to 1000000*10 for new * intr-0.1 code. + * 15 Jul 93 Stefan Behrens Added real timeout code to catch + * hanging controller */ /* TODO:peel out buffer at low ipl, speed improvement */ *************** *** 150,155 **** --- 152,174 ---- int wddebug; #endif + /* + * counter for lost int detection. + * Three values are used: + * 2 -- this is the initial value when the timeout is armed + * 1 -- timed out once, give it one more chance + * 0 -- timeout not armed, idle + * The per-drive-counter is an overkill here, for wd-controller a + * per-controller-counter would be enough. But this way it doesn't + * add new restrictions to the driver, and it's simple. + */ + int wdtimeout_counter[_NWD]; + + /* + * during recovery of timed out requests this counter is used + */ + int wdtimeout_retry[_NWD]; + struct isa_driver wddriver = { wdprobe, wdattach, "wd", }; *************** *** 160,165 **** --- 179,185 ---- int wdcontrol(struct buf *); int wdsetctlr(dev_t, struct disk *); int wdgetctlr(int, struct disk *); + int wdtimeout(caddr_t); /* * Probe for controller. *************** *** 227,232 **** --- 247,257 ---- du->dk_port = dvp->id_iobase; } + wdtimeout_retry[unit] = + wdtimeout_counter[unit] = 0; /* not armed yet */ + wdtimeout((caddr_t) unit); /* initially set timeout */ + + /* print out description of drive, suppressing multiple blanks*/ if(wdgetctlr(unit, du) == 0) { int i, blank; *************** *** 402,408 **** (bp->b_flags & B_READ) ? "read" : "write", bp->b_bcount, blknum); else ! printf(" %d)%x", du->dk_skip, inb(wdc+wd_altsts)); #endif addr = (int) bp->b_un.b_addr; if (du->dk_skip == 0) --- 427,433 ---- (bp->b_flags & B_READ) ? "read" : "write", bp->b_bcount, blknum); else ! printf(" %d)%x", du->dk_skip, inb(du->dk_port+wd_altsts)); #endif addr = (int) bp->b_un.b_addr; if (du->dk_skip == 0) *************** *** 537,543 **** } /* if this is a read operation, just go away until it's done. */ ! if (bp->b_flags & B_READ) return; /* ready to send data? */ timeout = 0; --- 562,573 ---- } /* if this is a read operation, just go away until it's done. */ ! if (bp->b_flags & B_READ) { ! wdtimeout_counter[unit] = 2; /* arm timeout counter */ ! if (wdtimeout_retry[unit]) ! printf("wd.c: retry block read\n"); ! return; ! } /* ready to send data? */ timeout = 0; *************** *** 561,566 **** --- 591,599 ---- outsw (wdc+wd_data, addr+du->dk_skip * DEV_BSIZE, DEV_BSIZE/sizeof(short)); du->dk_bc -= DEV_BSIZE; + wdtimeout_counter[unit] = 2; /* arm timeout counter */ + if (wdtimeout_retry[unit]) + printf("wd.c: retry block write\n"); /* never seen this */ } /* Interrupt routine for the controller. Acknowledge the interrupt, check for *************** *** 588,593 **** --- 621,630 ---- du = wddrives[wdunit(bp->b_dev)]; wdc = du->dk_port; + wdtimeout_counter[wdunit(bp->b_dev)] = 0; /* unarm counter */ + wdtimeout_retry[wdunit(bp->b_dev)] = 0; /* start from zero */ + + #ifdef WDDEBUG printf("I "); #endif *************** *** 1349,1352 **** --- 1386,1463 ---- } return(0); } + + /* + * called periodically every two seconds for each attached drive. + * check if the drive didn't answer in time. + */ + int + wdtimeout(caddr_t arg) + /* arg is # of unit */ + { + int x = splbio(); /* XXX kept all the time */ + register int unit = (int) arg; + + if (wdtimeout_counter[unit]) { /* armed? */ + if (--wdtimeout_counter[unit] == 0) { /* timed out? */ + struct disk *du = wddrives[unit]; + int wdc = du->dk_port; + + wdtimeout_retry[unit]++; + + /* log failure */ + printf("wd.c: wd%d timed out, retry #%d\n", + unit, wdtimeout_retry[unit]); + + /* reset ctlr and redo request */ + switch (wdtimeout_retry[unit]) { + case 1: /* for me one retry is enough :-) */ + case 2: + case 3: + /* + * print some status info + * + * The values I see for my system are: + * status=58, error=0 + * That means status is: + * WDCS_READY Selected drive is ready + * WDCS_SEEKCMPLT Seek complete + * WDCS_DRQ Data request bit. + * Does anyone know a reason for the timeout? + */ + printf("wd.c: wd%d status %x, error %x\n", + unit, inb(wdc+wd_status), + inb(wdc+wd_error)); + /* + * reset the device, give it a known state + */ + outb(wdc+wd_ctlr, (WDCTL_RST|WDCTL_IDS)); + DELAY(1000); + outb(wdc+wd_ctlr, WDCTL_IDS); + DELAY(1000); + (void) inb(wdc+wd_error); /* XXX! */ + outb(wdc+wd_ctlr, WDCTL_4BIT); + /* + * we'll redo the xfer sector by sector. + * This is the trick that helps here! + */ + du->dk_skip = 0; /* start at #0 again */ + du->dk_flags |= DKFL_SINGLE; /* slow down */ + break; + default: /* give up -- never happened for me */ + panic("cannot solve problem with hanging wd ctrl"); + break; + } + + /* restart request */ + wdstart(); + } + } + + /* plan next timeout */ + timeout(wdtimeout, unit, 200); + splx(x); + return (0); + } + #endif -- Stefan (berry@max.IN-Berlin.DE)