[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Working around a flaky LMFS disk



    Date: Wed, 21 Sep 88 13:25 EDT
    From: Reti@riverside.scrc.symbolics.com

	Date: Tue, 20 Sep 88 11:30 EDT
	From: barmar@Think.COM (Barry Margolin)

    The LMFS free list is an internal file of bits; if it got errors, you wouldn't
    be able to bring up the LMFS.  Is this the case?

What I meant was that it includes free blocks, i.e. blocks that are ON
the free list.  I haven't seen any errors that I could attribute to the
freelist bitmap itself having been trashed.

    The bitmap is redundant information, in the sense that you could walk over
    all the files in the partition and reconstruct it.  This is exactly what
    the salvager does.

	LMFS is not very good at dealing with hard disk errors in its
	partitions.  The only tool available is LMFS:FIX-FILE, and it is only
	prepared to deal with simple problems like an ECC error.  It drops into
	the debugger when handed one of the affected files.
    What kind of error does it get?  I'd expect different sorts of errors depending
    upon whether or not the header of the file was in the bad section of the
    disk.  I think I remember FIX-FILE dealing correctly with a search error
    once in the past.

It appears to die when trying to CLOSE the file.  I think this may have
to do with the fact that another process had the file open at the time.
It tries to do a :FORCE-OUTPUT on the stream and loses when that tries
to write to a bad block.  I believe I sent a backtrace of this to
Customer-Reports.

	As I understand it, it is not possible to use SI:FIX-FEP-FILE on a LMFS
	partition if you want LMFS to be able to use it afterward.  LMFS
	maintains linear offsets into the partition files, so splicing out a bad
	block would cause many of these offsets to be incorrect.
    Yes; however, you could replace the block with a newly-allocated block of
    zeros instead of splicing it out, and that would keep the addresses in sync.
    At least some in-house versions of the tool let you do that, I don't know
    if the shipped version in whichever release you are running does.

Is that what the "Zero" response in FIX-FEP-FILE does?  I've always
assumed that it tried to Zero the block itself.  In that case it sounds
reasonable.  However, I suspect that some of the FEP's free pages are
also on the bad surface, and FIX-FEP-FILE doesn't check before splicing
it in.

	Has anyone got any suggestions before I reinitialize the entire LMFS and
	start a reload?
    Yes; I'd write a function to copy as many of the blocks of the lmfs file 
    to another known-good disk (zeroing the blocks which it can't read from the
    bad disk), then change the FSPT to point to it instead of the
    original disk.  Upon reinitializing LMFS afterwards, you should have a 
    situation where the lmfs is consistent, doesn't get hardware disk errors but
    will have these 'mysterious' sections of zeros.  LMFS:FIX-FILE ought to be
    able to deal with these.

I'll give some of these things a try.

                                                barmar