[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Working around a flaky LMFS disk
Date: Wed, 21 Sep 88 13:25 EDT
From: Reti@riverside.scrc.symbolics.com
Date: Tue, 20 Sep 88 11:30 EDT
From: barmar@Think.COM (Barry Margolin)
The LMFS free list is an internal file of bits; if it got errors, you wouldn't
be able to bring up the LMFS. Is this the case?
What I meant was that it includes free blocks, i.e. blocks that are ON
the free list. I haven't seen any errors that I could attribute to the
freelist bitmap itself having been trashed.
The bitmap is redundant information, in the sense that you could walk over
all the files in the partition and reconstruct it. This is exactly what
the salvager does.
LMFS is not very good at dealing with hard disk errors in its
partitions. The only tool available is LMFS:FIX-FILE, and it is only
prepared to deal with simple problems like an ECC error. It drops into
the debugger when handed one of the affected files.
What kind of error does it get? I'd expect different sorts of errors depending
upon whether or not the header of the file was in the bad section of the
disk. I think I remember FIX-FILE dealing correctly with a search error
once in the past.
It appears to die when trying to CLOSE the file. I think this may have
to do with the fact that another process had the file open at the time.
It tries to do a :FORCE-OUTPUT on the stream and loses when that tries
to write to a bad block. I believe I sent a backtrace of this to
Customer-Reports.
As I understand it, it is not possible to use SI:FIX-FEP-FILE on a LMFS
partition if you want LMFS to be able to use it afterward. LMFS
maintains linear offsets into the partition files, so splicing out a bad
block would cause many of these offsets to be incorrect.
Yes; however, you could replace the block with a newly-allocated block of
zeros instead of splicing it out, and that would keep the addresses in sync.
At least some in-house versions of the tool let you do that, I don't know
if the shipped version in whichever release you are running does.
Is that what the "Zero" response in FIX-FEP-FILE does? I've always
assumed that it tried to Zero the block itself. In that case it sounds
reasonable. However, I suspect that some of the FEP's free pages are
also on the bad surface, and FIX-FEP-FILE doesn't check before splicing
it in.
Has anyone got any suggestions before I reinitialize the entire LMFS and
start a reload?
Yes; I'd write a function to copy as many of the blocks of the lmfs file
to another known-good disk (zeroing the blocks which it can't read from the
bad disk), then change the FSPT to point to it instead of the
original disk. Upon reinitializing LMFS afterwards, you should have a
situation where the lmfs is consistent, doesn't get hardware disk errors but
will have these 'mysterious' sections of zeros. LMFS:FIX-FILE ought to be
able to deal with these.
I'll give some of these things a try.
barmar