[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fatal Disk Error urgency [1]



    Date: Wed, 6 Jun 90 12:03 PDT
    From: Doug Evans <DE@phoenix.sch.symbolics.com>

	Date: Wed, 6 Jun 90 13:33 EDT 
	From: Dodds@YUKON.SCRC.Symbolics.COM (Douglas Dodds)

	    Date: Wed, 6 Jun 90 00:25 EDT
	    From: RWK@FUJI.ILA.Dialnet.Symbolics.COM (Robert W. Kerns)
    Hi Bob.

Hi, Doug!  Glad to see (for Symbolics' sake, at least) you're
still there.

By the way:  I highly recommend doing (si:verify-fep-filesystem <unit>)
before doing *ANYTHING* else.  If the fep filesystem is inconsistent,
any repair actions may cause more damage than they solve.  Sometimes an
error will result in multiply allocated blocks in the FEP filesystem.
Such problems must be fixed before anything else can proceed.  (I have
been able to fix such things by hand on a number of occasions.  However,
there are no automatic tools to do this, and it is very easy to destroy
the information needed to put things back to rights.)

	    [ . . . . ]

	    I don't know if SI:FIX-FEP-BLOCK performs the proper actions to
	    minimize damage to the file in which the block appears, but hopefully
	    the documentation will enlighten you.  In general, world-load files which
	    get a bad block should be replaced, LMFS files should have a block of zeros
	    substituted, 

    As the other Doug says below, never use these functions on a LMFS
    partition unless you intend to throw the entire LMFS away.

This is somewhat of an overstatement, although I should have
elaborated more.  (I hoped that the missing documentation I
referred him to would have enough caveats...)

My real intent was to prevent people from just splicing out the
bad block, which will turn the LMFS file into so much chopped
liver.  (I can fix such mistakes by hand, but LMFS won't understand
AT ALL!)

But there are definitely times that replacing the block with a
new block of zeros is the correct (in fact, ONLY) strategy.

LMFS:FIX-FILE does not do the same task that FS:FIX-FEP-BLOCK
does.  It does not test the block to see if it is in fact a bad
block, as opposed to just having bad data.  (The *VAST* majority
of ECC errors are just damanged data, not a section of the disk
which no longer works reliably.  That's what FS:FIX-FEP-BLOCK's
read/write test does).  But if the block is really bad,
LMFS:FIX-FILE will not solve the problem; in fact, it will leave
the problem around waiting to bite you again.

So what's my recommendation?  For ECC problems in LMFS partitions:

0)  Call SI:VERIFY-FEP-FILESYSTEM on all drives.  If it shows problems,
get help!

1) If the LMFS does not come up at all, you're in serious
trouble.  LMFS:FIX-FILE won't help you.  Use SI:FIX-FEP-BLOCK.

If its read/write test finds that the data was bad but the block
is good, try bringing up the LMFS again.  If you're very lucky,
you may be able to bring it up and do a backup.

If the block is physically bad, choose the COPY option, and try
bringing up the LMFS again.  Immediately do a backup, if you
succeed in bringing it up.

1a)  If you have a recent backup, consider deleting the LMFS
file, and starting over.

1b)  If you do not have a recent backup, or if it will take too long
to reload your huge filesystem, consider calling an expert and
dropping a couple thousand bucks his way for fees and expenses.

2) If the LMFS comes up OK, the problem is not too serious.
LMFS was able to read all critical data structures.  First, do a
LMFS salvage.  This will identify what files have problems.  It also
will repatriate any orphans. and fix minor filesystem problems.  If
you have lots of ECC problems, you probably have broken hardware,
and should get it fixed before proceeding.  (If you have lots of other
kinds of problems, it's probably software, or non-disk hardware).

[If the salvager dies because of ECC errors, you'll have to repeatedly
cycle through this proceedure once per ECC error until you solve them
all.  I don't remember exactly what errors the salvager will catch and
continue with].

At this point, you're in a quandry.  Do you run LMFS:FIX-FILE, and
possibly lose track of intermittently bad blocks, or do you run
FS:FIX-FEP-XXXX and possibly replace bad data and bad ECC with bad
data and GOOD ECC (making bad data look like good data).

I think Symbolics' software screws up here, because you really
want to LMFS:FIX-FILE first, and then run a tool to read/write
test ALL the blocks of the LMFS partitions and verify their
reliability, replacing any bad ones, either with a COPY (if the
ECC was good) or with ZEROs (if the ECC (and hence the data) was
bad).  (FS:FIX-FEP-FILE only read/write tests blocks with ECC
errors.  If there's a marginal block, LMFS:FIX-FILE may hide the
problem from FS:FIX-FEP-FILE).

Anyway, my recommendation at this point is to first carefully
record the disk addresses of all errors.

It's unlikely that these are actually bad, but since
LMFS:FIX-FILE won't check, we'll keep our eyes open.  But actual
bad blocks are very rare, so we'll start out with LMFS:FIX-FILE,
not FS:FIX-FEP-FILE.

Once you have recorded the disk addresses, use LMFS:FIX-FILE to
blast the file(s).  If it's a directory, the files in that directory
will probably become orphans, so you'll have to recover them using
the salvager.  (If LMFS:FIX-FILE says it can't write, you'll have to
use FS:FIX-FEP-BLOCK to replace the block with a block of zeros, and
come back and retry LMFS:FIX-FILE).

Once you've fixed all the files, run the salvager again.  Make
sure you've gotten all the files.

Then run FS:FIX-FEP-FILE.  Any problem you find should be in
unallocated blocks, since the salvager should have detected any
other problems.  If the block is physically bad, replacing it
with a block of zeros will not harm the LMFS, since that's
what's in free blocks anyway.

If you have to replace a block, warm boot and run the salvager
again to be certain.

If later ECC problems recur at the same address(es), the next
time don't run LMFS:FIX-FILE, but instead run FS:FIX-FEP-BLOCK
and choose the zero option if it thinks the block is bad.  This
will give the containing file a connection error, which the
salvager will find, and LMFS:FIX-FILE will fix.  (If it's a
directory, this will orphan files, which a salvage will
recover).

    Due to minor buggyness in past releases, I usually recommend that if a
    hard ECC error is found, one should use the SPLICE option, delete and
    expunge the file, and recreate it.  This insures that the bad block is
    indeed removed from use and is not allocated to both the original file and
    the bad blocks file.

    In all cases, run the function SI:VERIFY-FEP-FILESYSTEM to make sure
    everything is clean.

I strongly suggest running it both after and *BEFORE*
undertaking repairs.

	The documentation states, and I agree, that for safety, you should never
	use SI:FIX-FEP-BLOCK or SI:FIX-FEP-FILE on LMFS partitions.  Instead,
	use LMFS:FIX-FILE, which gives you the right pathname-based handles on
	the file, and limits the options to those that are safe for the
	integrity of LMFS partitions.

Alas, it limits your options to ones which do not address the
type of problem I was discussing using SI:FIX-FEP-FILE for,
namely blocks which are really and truly bad.

Let me point out again that blocks which are truly bad are
extremely rare; the vast majority of ECC errors are damanged
*DATA* not damaged disk surface.

However, LMFS:FIX-FILE will successfully fix the problems
introduced by replacing a block of some random file with zeros.
(It will show up as a "checkwords" error).

By the way, I've referred to ECC errors above.  There's another
related type of error called SEARCH errors.  These indicate that
the disk block could not be found because the formating information
was bad.  It's possible for these to happen intermittently due to
read errors, but if they persist in a particular place on disk, they
can be treated as physically bad blocks.  (They may or may not involve
a physically damaged are of the disk.  They could be just damaged
formatting DATA, in which case a reformat will fix them, but that'll
kill all of your data, too, so you're usually better off just treating
them as if they were physically bad).

Also, none of this applies to the embedded systems, where things are
more complicated.  Usually on these systems, ECC errors would be
treated as a Macintosh or Unix problem, and dealt with at that level.