[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Working around a flaky LMFS disk

    Date: Tue, 20 Sep 88 11:30 EDT
    From: barmar@Think.COM (Barry Margolin)

    Our main LMFS file server has developed a problem with one of its disks,
    and Symbolics Customer Support has so far been unable to provide me with
    any suggestions better than a full reload from backups.  I figured that
    before I undertook this drastic measure, I'd see if anyone else has
    dealt with this kind of problem before.

    Configuration: Symbolics 3650 with five external CDC-515 disks
    containing LMFS partitions, totalling about 2.3 gigabytes.

    Symptom: any attempt to access unit 2, any cylinder, surface 17 results
    in a %DISK-ERROR-SEARCH error.
Sounds like a busted head.  Are you sure you can't have it fixed?  There
have been cases of people sending sealed disk enclosures (DEs) back the
factory where they are unsealed in a clean environment, fixed, resealed
and shipped back.  (I don't know the procedure or how much it costs, but
it has been done in the past.)

    Symbolics disk addresses are allocated in cylinder-major order, which
    means that 24 adjacent addresses out of every 576 are affected, i.e. the
    bad spots are scattered throughout the LMFS partition on that disk,
    including the LMFS free list.
The LMFS free list is an internal file of bits; if it got errors, you wouldn't
be able to bring up the LMFS.  Is this the case?

The bitmap is redundant information, in the sense that you could walk over
all the files in the partition and reconstruct it.  This is exactly what
the salvager does.

    LMFS is not very good at dealing with hard disk errors in its
    partitions.  The only tool available is LMFS:FIX-FILE, and it is only
    prepared to deal with simple problems like an ECC error.  It drops into
    the debugger when handed one of the affected files.
What kind of error does it get?  I'd expect different sorts of errors depending
upon whether or not the header of the file was in the bad section of the
disk.  I think I remember FIX-FILE dealing correctly with a search error
once in the past.

    As I understand it, it is not possible to use SI:FIX-FEP-FILE on a LMFS
    partition if you want LMFS to be able to use it afterward.  LMFS
    maintains linear offsets into the partition files, so splicing out a bad
    block would cause many of these offsets to be incorrect.
Yes; however, you could replace the block with a newly-allocated block of
zeros instead of splicing it out, and that would keep the addresses in sync.
At least some in-house versions of the tool let you do that, I don't know
if the shipped version in whichever release you are running does.

    Has anyone got any suggestions before I reinitialize the entire LMFS and
    start a reload?
Yes; I'd write a function to copy as many of the blocks of the lmfs file 
to another known-good disk (zeroing the blocks which it can't read from the
bad disk), then change the FSPT to point to it instead of the
original disk.  Upon reinitializing LMFS afterwards, you should have a 
situation where the lmfs is consistent, doesn't get hardware disk errors but
will have these 'mysterious' sections of zeros.  LMFS:FIX-FILE ought to be
able to deal with these.