[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fatal Disk Error urgency [1]



    Date: Tue, 5 Jun 90 17:36 EDT
    From: Paul Pangaro <pan@athena.pangaro.dialnet.symbolics.com>
    We copied a band (incremental color world from Symbolics tape) from tape
    to FEP. We then tried to copy the band, and got a hard error:

1    Error: Attempt to read from CHAOS Connection for BAND-TRANSFER to A,
0	    1which was closed by ATHENA.
0	   1Reason given was "%DISK-ERROR-ECC during a %DCW-READ32 on unit 0., cyl 577., surf 5., sec 9.,
0	     1Fatal ECC error,
0	     118. pending transfers associated with this disk event aborted."

0    (Both the bad band and the good band booted OK, though we dont do much
    exercising of the color stuff at the moment.)

    This is on an Eagle drive that has performed flawlessly since 1983.

    Just to be as naive as possible, and so in hopes of learning the maximum
    amount, can you tell me:

    - Should I re-try copying the band to the same place, and see if I
    continue to get an error in the same place?

Yes, this is an appropriate strategy.  There is no evidence for
anything more than a single write error (albeit larger than can
be corrected via ECC).  If it happens again at the same spot,
then we can conclude that that disk block has become defective.

But see below for a more thorough test you might want to try first.

    - If so (or if not), should I somehow insure that area of disk is never
    used again? 

If rewriting the block continues to result in bad ECC, then yes.

		How do I do that?

Unfortunately, Symbolics doesn't seem to distribute the source for
SI:FIX-FEP-FILE and SI:FIX-FEP-BLOCK, and I don't seem to have access
to *ANY* Rel-8 documentation (online or paper), so I can't tell you
what they are documented as doing, but disassembly suggests that
SI:FIX-FEP-BLOCK does exactly what you want, including testing and
analysis, and prompting you for the suitable recovery strategies.

But I suggest reading its documentation first.  (I couldn't find any
documentation in the 7.2 set, although I know SI:FIX-FEP-FILE existed
then, and I think it was even documented.  I think SI:FIX-FEP-BLOCK
may be new to 8.0).

I don't know if SI:FIX-FEP-BLOCK performs the proper actions to
minimize damage to the file in which the block appears, but hopefully
the documentation will enlighten you.  In general, world-load files which
get a bad block should be replaced, LMFS files should have a block of zeros
substituted, and paging files can just have the block removed (while the
file is not in use!)

    - Should I expect further deterioration of this disk?

I wouldn't worry too much, unless it happens again in a short
time.  Surface defects do appear, and can become worse over
time.  Eventually they become severe enough that the system's
error correction can't fix the problem invisibly, and you get a
visible problem like this.

I'd worry a little bit if the test reveals no surface defect;
that would indicate your disk's (or IO-board or paddle's) write
electronics blew it without detecting anything wrong, but while
rare, even that may be due to something such as a power glitch.

And while disks are pretty reliable on an absolute scale, one
error in seven years of hard use may even be within statistical
norms given the specs!

    Do they have "useful lives" or last forever?

They have moving parts.  Eventually they get old and die, if
nothing happens to murder them first.  While I think seven years
is longer than I've ever used a single disk drive, nothing in
your situation suggests bearing failure or other age-related problem,
except possibly surface minor damage or defect growth.

In fact, one failure in seven years is a damned fine record, and
I'd say you're probably better off hanging onto this known-reliable
drive than substituting one of unknown quality.

    - Are there other questions I should be worried about.

Yes:  Are your backups frequent enough?  While one error
isn't enough to indicate catastrophic failure of your disk,
there's NEVER any certainty that it isn't going to die totally
on the next disk access.

    Many thanks.

    Trying to remain calm,
    PANgaro

Stay calm, but alert.  Keep records of all hardware-related
failures.  Once may be a fluke or random occurance, but if a
pattern appears, you'll have to take action.  (This applies
to other kinds of hardware, too!)