Return to BSD News archive
Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!nntp.coast.net!news.kei.com!newsfeed.internetmci.com!inet-nntp-gw-1.us.oracle.com!news.caldera.com!news.cc.utah.edu!park.uvsc.edu!usenet From: Terry Lambert <terry@lambert.org> Newsgroups: comp.unix.bsd.freebsd.misc,comp.os.linux.development.system Subject: Re: The better (more suitable)Unix?? FreeBSD or Linux Date: 14 Feb 1996 21:45:40 GMT Organization: Utah Valley State College, Orem, Utah Lines: 84 Message-ID: <4ftl64$fjs@park.uvsc.edu> References: <4er9hp$5ng@orb.direct.ca> <311250C2.2781E494@public.uni-hamburg.de> <strenDM7Gr4.Cn2@netcom.com> <DMD8rr.oIB@isil.lloke.dna.fi> <4f9skh$2og@dyson.iquest.net> <DMI5Mt.768@pe1chl.ampr.org> <4fophn$ahl@park.uvsc.edu> <DMrCtI.3KC@pe1chl.ampr.org> NNTP-Posting-Host: hecate.artisoft.com Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:13869 comp.os.linux.development.system:17432 rob@pe1chl.ampr.org (Rob Janssen) wrote: ] >You are confusing file system structure errors with file system ] >content errors. ] ] >File system content errors are the responsibility of an application, ] >unless you go to a log structured file system with user accessable ] >transaction tracking interfaces into the log to ensure implied ] >state across multi-file applications is also consistent. ] ] No, I am referring to the situation where an application has written ] data to a file, the system crashes, and then the file contains other ] (garbage) data after restart. While fsck reports no errors. This is a situation which is possible only through administrative error and acceptance of non-default actions in the process of a fsck. The block which was not written should not have been present in the bitmap, and thus the file should have not referenced the block unless the administrator overrode the default in a manual fsck. ] I once spent quite some time tracking down why UUCP was hanging. The ] system had crashed at the moment uux had created a lockfile. A file ] with 10 bytes of binary garbage existed on the disk after the restart. ] This is clearly an indication of this problem. The file would not be ] 10 bytes if the application hadn't done the correct write (and probably ] even the close), yet the data was not the expected ASCII PID. ] What made this one nasty, is that the UUCP programs read the file, do an ] atoi() on it, and then use kill() to check if this PID is existing to ] know if the lockfile is valid or stale. This failed to work because the ] atoi returned zero, UUCP (Taylor) did a kill(-1,0) which of course ] succeeded and thus the lockfile was assumed to be valid and never ] removed. The call would have been "kill( 0, 0)" if atoi returned 0, which would, of course, have returned 0 (success). Bogus lockfile data is the reason modern implementations use a 4 byte binary integer instead of an ASCII representation of the number. At the very least, a well-written program would check for the special cases and ensured the PID was > 1 to avoid kill() side-effects. At worst, it would have done an isdigit() on the least significant digit in the lockfile read buffer. That aside, a system crash could only result in truncated file contents if the proper administrative recovery options were taken: for a 10 byte file, it was either an immediate file the 10 bytes were in the inode and so were valid) or it was a direct block (the block pointer is required to point to a buffer in the block allocation bitmap). For a newly allocated block, the block allocation bitmap is equired to show the block as unallocated *in the on disk copy*, which was written using a synchronus write and so is known to be on disk, before the allocation is allowable. Therefore the administrator must have set the "allocated" bit during the fsck (overrriding the default action), rather than causing the inode to have the block revoked. Then the administrator would have had to ignore the warning message that the file had no blocks allocated to it, letting the atoi() in the program retrieve a zero-filled page, and issue an additional override to keep the file. You were bitten by administrator error, in combination with several application errors, not a "sync vs. async" error. PS: What happened to the commands in you /etc/rc to remove lock files following a reboot? PPS: Why was the application violating the UUCP locking protocol by not ignoring the lockfile if it was older than 90 minutes anyway? Terry Lambert terry@cs.weber.edu --- Any opinions in this posting are my own and not those of my present or previous employers.