*BSD News Article 61525

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!nntp.coast.net!news.kei.com!newsfeed.internetmci.com!inet-nntp-gw-1.us.oracle.com!news.caldera.com!news.cc.utah.edu!park.uvsc.edu!usenet
From: Terry Lambert <terry@lambert.org>
Newsgroups: comp.unix.bsd.freebsd.misc,comp.os.linux.development.system
Subject: Re: The better (more suitable)Unix?? FreeBSD or Linux
Date: 14 Feb 1996 21:45:40 GMT
Organization: Utah Valley State College, Orem, Utah
Lines: 84
Message-ID: <4ftl64$fjs@park.uvsc.edu>
References: <4er9hp$5ng@orb.direct.ca> <311250C2.2781E494@public.uni-hamburg.de> <strenDM7Gr4.Cn2@netcom.com> <DMD8rr.oIB@isil.lloke.dna.fi> <4f9skh$2og@dyson.iquest.net> <DMI5Mt.768@pe1chl.ampr.org> <4fophn$ahl@park.uvsc.edu> <DMrCtI.3KC@pe1chl.ampr.org>
NNTP-Posting-Host: hecate.artisoft.com
Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:13869 comp.os.linux.development.system:17432

rob@pe1chl.ampr.org (Rob Janssen) wrote:
] >You are confusing file system structure errors with file system
] >content errors.
] 
] >File system content errors are the responsibility of an application,
] >unless you go to a log structured file system with user accessable
] >transaction tracking interfaces into the log to ensure implied
] >state across multi-file applications is also consistent.
] 
] No, I am referring to the situation where an application has written
] data to a file, the system crashes, and then the file contains other
] (garbage) data after restart.  While fsck reports no errors.

This is a situation which is possible only through administrative
error and acceptance of non-default actions in the process of a
fsck.

The block which was not written should not have been present in
the bitmap, and thus the file should have not referenced the block
unless the administrator overrode the default in a manual fsck.


] I once spent quite some time tracking down why UUCP was hanging.  The
] system had crashed at the moment uux had created a lockfile.  A file
] with 10 bytes of binary garbage existed on the disk after the restart.
] This is clearly an indication of this problem.  The file would not be
] 10 bytes if the application hadn't done the correct write (and probably
] even the close), yet the data was not the expected ASCII PID.
] What made this one nasty, is that the UUCP programs read the file, do an
] atoi() on it, and then use kill() to check if this PID is existing to
] know if the lockfile is valid or stale.  This failed to work because the
] atoi returned zero, UUCP (Taylor) did a kill(-1,0) which of course
] succeeded and thus the lockfile was assumed to be valid and never
] removed.

The call would have been "kill( 0, 0)" if atoi returned 0, which
would, of course, have returned 0 (success).

Bogus lockfile data is the reason modern implementations use a
4 byte binary integer instead of an ASCII representation of the
number.  At the very least, a well-written program would check
for the special cases and ensured the PID was > 1 to avoid kill()
side-effects.  At worst, it would have done an isdigit() on the
least significant digit in the lockfile read buffer.


That aside, a system crash could only result in truncated file
contents if the proper administrative recovery options were
taken: for a 10 byte file, it was either an immediate file
the 10 bytes were in the inode and so were valid) or it was
a direct block (the block pointer is required to point to a
buffer in the block allocation bitmap).

For a newly allocated block, the block allocation bitmap is
equired to show the block as unallocated *in the on disk copy*,
which was written using a synchronus write and so is known to
be on disk, before the allocation is allowable.

Therefore the administrator must have set the "allocated" bit
during the fsck (overrriding the default action), rather than
causing the inode to have the block revoked.

Then the administrator would have had to ignore the warning
message that the file had no blocks allocated to it, letting
the atoi() in the program retrieve a zero-filled page, and issue
an additional override to keep the file.

You were bitten by administrator error, in combination with
several application errors, not a "sync vs. async" error.


PS:	What happened to the commands in you /etc/rc to remove
	lock files following a reboot?

PPS:	Why was the application violating the UUCP locking
	protocol by not ignoring the lockfile if it was older
	than 90 minutes anyway?


                                        Terry Lambert
                                        terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.