[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

LMFS emergency!



   Date: Mon, 12 Aug 91 09:39:31 EDT
   From: Bart Burns <bart@nynexst.com>

   The LMFS for our file server started wierding out and finally bit the dust.

   First i started getting messages like:

	      File damaged, check words do not match
		 Expected: 12526520651  1664   ...
		 Actual:        0        0    ... 0
		 .
		 .
		 .
	      Cannot read record #o74132 of partition 4
	      (fep1:>lmfs4.file.1)


   All of these messages were from files in the same partition (fep1:>lmfs4.file.1),
   so i ran the salvager (over the entire lmfs so it had a chance to fix what it could), 
   but the salvager output (1,300 blocks worth) indicated to me that it wasn't able to do muc


Other people, I'm sure will tell you methods to recover from this.
Salvage will not fix these though.

   When I rebooted the file server (and tried to access lmfs) it
   replied with a LMFS-BAD-PARTITION-LABEL error 

		 Error: Bad label version on file partition fep0:>lmfs1.file.1

		 (from stack frame of LMFS:ensure-lmfs-up)

	   please note: a DIFFERENT partition (and a different disk drive) is gonzo here!
 
I ran into one of these before! It ultimately was a bent pin on the
disk cable header on the disk. ECC with a bent pin is very interesting
to debug. Anyway, the lesson I learned was to make sure your h/w like
cabling is ok first, then do the the s/w recovery. Fortunately
partition info is pretty robust. The lmfs code does some checking for
a legit partition(s) when it brings up the lmfs. You want to find out
where in the process it is losing by instrumenting the bring-up code.
This will tell you about what is bad in the partition header. Now,
there is code in the system to make a new one. You want to make a new
one with attribute values from the old. The chaining-in of the new
partition was not a problem as I recall. Not all the info in the
header is ever used, so you can fake some of the entries. On thing
that is critical, and if you have problems in it, you have to be
careful. This is the free-records bit array on disk. If this is messed
up, you may have to build a new on. You can edit and browse the lmfs
file-structures with a built-in lmfs editor called DDT. Read the code
for lmfs:fix-file as well. You have to know if 1 means occupied or
vice-versa. I forget of course. The strategy is make a new array (on
disk) which says that *ALL* of the records are used. Then you can
recover the unused blocks by the salvager after the lmfs comes up. It
took me a solid week to bring up or multi-gig file-server a couple of
years ago.  This length of time was due to the fact that I was
fighting that bent pin and ECC trying its best to recover which lead
to very nondeterministic behavior.

Best of luck,
Albert Boulanger
aboulanger@bbn.com