[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

LMFS blues



We've been having LOTS of problems with our 3650 file server, which has
four eagles on it.  I thought others might be interested in the latest
circus.

Tuesday last week we did an incremental dump.  At the end of the first
tape, the dumper went into the debugger trying to set the dump dates.
This has been a common occurrence ever since we converted our file
server to a 3650.  I did my usual thing of forcing it to return from the
appropriate stack frames, and we went on and finished the dump with a
second tape.  Strangely, it didn't complaining setting dates on the
second tape, which it usually does.  A few hours later, a few ECC errors
suddenly showed up, then some "checkwords do not match" problems, and in
the process of trying to fix these problems, we also started to see
"attempt to deposit bogus address to freemap: 0".  So I brought the
system down, rebooted (using a non-server net address), and ran
SI:VERIFY-FEP-FILESYSTEM on all the disks.  On FEP2 it complained that a
block was allocated to both LMFS.FILE and BAD-BLOCKS.FEP.  Sadly, it
didn't offer to correct the situation.  Believing the block was truly
bad, I used a locally consed utility to splice the block out of
LMFS.FILE and replace it with a newly allocated and zeroed block.  LMFS
can cope with one trashed block, right?

Wrong.  Trying to bring LMFS up just put you in the debugger with a
"checkwords do not match" error and no proceed option, and insufficient
frames left on the stack to figure out where it failed.  Run the
salvager, right?  Wrong.  To run the salvager you have to "bring LMFS
up", and LMFS wouldn't come up.  It's evening; I leave it down and go
home.  Next morning I probe around further and discover that it is dying
in LMFS:LOAD-FREEMAP.  Geez, a freemap ought to be designed to be thrown
away and rebuilt from scratch, right?  This seems like what salvagers
ought to be all about.  I call up Software Support and explain what I
know of the situation.  Surely they have some way of blowing away the
freemap and reconstructing it.  Wrong.  They have no immediate
suggestions.  They request a stack trace.  I send them the stack trace,
and spend part of the day browsing LMFS and salvager sources on another
file server, trying to understand and make educated guesses.  In the
late afternoon, via a local contact at SCRC, I get the tip that an
internal piece of LMFS:FIX-FILE might help.  I hack
LMFS:FIX-FILE-SCAN-FILE to take an FD instead of a stream, and from
previous debugging know how to get the FD for the freemap, and give it a
try.  It tells me "header damaged, further recovery is unlikely".  By
evening I'm wondering why Software Support hasn't called back, and call
them.  "They're in a meeting", I'm told.  An hour later they still
haven't called, and I go home.

Next morning I call Software Support again.  "No one in yet", I'm told.
Grumpf.  I've got lots of users twiddling their thumbs.  Time to start
smashing bits.  Fortunately, I have a spare eagle connected (of course,
it's only a "spare" because when I try to really use it for a LMFS
partition I get tons of ECC errors; still haven't gotten any word from
Symbolics about that), so I first copied the entire LMFS.FILE over for
safe-keeping.  I then tried smashing the offending block with the
correct checkwords.  This gets me farther, but lots of magic numbers
checks elsewhere still drop it into the debugger.  I probe and hack
around for a while.  I discover that LMFS:BITSALV-PART can be invoked in
"offline" mode, and do so on the FEP2 partition.  It starts
reconstructing the freemap, and I think I'm home free.  An hour later it
prints "Writing the new freemap" and promptly goes into the debugger
with a checkwork error on a read.  Software Support finally calls back,
to say they still haven't received the stack trace, and so haven't been
doing anything, and will I send them another one?  I send them two, and
write off getting any help from them.

I beat on a contact at SCRC to beat on someone, and one of their hackers
logs in over the net for a while.  An hour or two later, he hasn't
gotten any farther, and has to go off to a meeting.  Finally, I wise up.
The partition on FEP3 is the same size as the one of FEP2.  I find its
freemap, and verify that it has the same checkwords as expected for
FEP2.  I smash the first four FEP blocks from FEP3's freemap into FEP2's
freemap, and try to bring LMFS up again.  It gets farther, but dies now
with a checkword error in the "data" of the freemap.  I go back and
invoke my hacked LMFS:FIX-FILE-SCAN-FILE on the freemap fd, and it
"corrects" the error by zeroing things.  I try to bring LMFS up again,
and it prints out two complaints about checksum errors, but comes up.  I
start up the salvager (full record scan and top-down tree-walk, but
without orphan repatriation because that has consistently dropped into
the debugger ever since the LMFS disaster a few weeks before), and go
home.  Friday morning, all is well, no further damage has been found, no
files have been lost.  In my mailbox, I find a message from Software
Support saying they don't think it is possible to recover from the
error, that I will have to destroy the partition and reload from tape.

On Sunday, I come in and discover that the machine has crashed sometime
Saturday.  Someone has reported a file with checkword problems.  I start
up another full Check Records.  Part way through it drops into the FEP
with "page fault on unallocated VMA".  I reboot, and start another Check
Records.  It finds six damaged files.

And the beat(ing) goes on ...