[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

LMFS emergency!



    Date: Mon, 12 Aug 91 09:39:31 EDT
    From: bart@nynexst.com (Bart Burns)

    The LMFS for our file server started wierding out and finally bit the dust.

    First i started getting messages like:

	       File damaged, check words do not match
		  Expected: 12526520651  1664   ...
		  Actual:        0        0    ... 0
Were all the actuals always 0?  If so, either someone zeroed your disk or your
hardware broke in such a way as to always return 0's for all data read or someone
wrote a FEP file with zeros in it over blocks shared by your LMFS partition (see
below).

If the actuals were sometimes non-zero, their contents would have given a clue
as to what might be going wrong.
		  .
		  .
		  .
	       Cannot read record #o74132 of partition 4
	       (fep1:>lmfs4.file.1)

    All of these messages were from files in the same partition (fep1:>lmfs4.file.1),
    so i ran the salvager (over the entire lmfs so it had a chance to fix what it could), 
    but the salvager output (1,300 blocks worth) indicated to me that it wasn't able to do much.
Had someone recently created a large FEP file on that partition?  Had (SI:VERIFY-FEP-FILESYSTEM 1)
been run recently?  If so, what did it say?

    The next day I had to shut down all our machines because the AC went down.
What bad luck; LMFS makes a lot of checks when starting up to ensure that everything's copasetic,
so with a broken disk you often have trouble starting LMFS, but can usually limp along once its
been started...

    When I rebooted the file server (and tried to access lmfs) it
    replied with a LMFS-BAD-PARTITION-LABEL error 

		  Error: Bad label version on file partition fep0:>lmfs1.file.1

		  (from stack frame of LMFS:ensure-lmfs-up)

	    please note: a DIFFERENT partition (and a different disk drive) is gonzo here!
This is one of the checks I was talking about; the partition label had garbage in it.


    1. What can I do (short of reinitializing LMFS with new partitions (losing the old ones)) and
       hoping my latest complete backup tape (Exabyte 1.5 gigabyte) will restore?
First verify that the disk hardware is actually working correctly and that the FEPFS isn't corrupted.
I've seen errors similar to yours after a system crash when FEP blocks actually in use by a file are
marked as free, so writing a new FEP file will scribble over the old one.  If the old one happened
to be a LMFS partition, you're pretty much out of luck (i.e. the data that was there has been overwritten).

    2. Does this sound like a disk hardware failure?
It might.  If the FEPFS checks out, and if you can consistently read the same data from the same place
on disk (and read back what you wrote), then it probably isn't hardware.

Did you grow any of your LMFS partitions in the recent past?  There are persistent reports from the
field that this sometimes causes the FEPFS to be messed up in the way I described above.  Unfortunately,
I've never been able to reproduce this (not from lack of trying).  Again, SI:VERIFY-FEP-FILESYSTEM will
tell you if it is messed up.

    help!


    thanks,
    Bart.