[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Fatal Disk Error urgency 
Date: Wed, 6 Jun 90 00:25 EDT
From: RWK@FUJI.ILA.Dialnet.Symbolics.COM (Robert W. Kerns)
Date: Tue, 5 Jun 90 17:36 EDT
From: Paul Pangaro <email@example.com>
We copied a band (incremental color world from Symbolics tape) from tape
to FEP. We then tried to copy the band, and got a hard error:
1 Error: Attempt to read from CHAOS Connection for BAND-TRANSFER to A,
0 1which was closed by ATHENA.
0 1Reason given was "%DISK-ERROR-ECC during a %DCW-READ32 on unit 0., cyl 577., surf 5., sec 9.,
0 1Fatal ECC error,
0 118. pending transfers associated with this disk event aborted."
0 (Both the bad band and the good band booted OK, though we dont do much
exercising of the color stuff at the moment.)
This is on an Eagle drive that has performed flawlessly since 1983.
Just to be as naive as possible, and so in hopes of learning the maximum
amount, can you tell me:
- If so (or if not), should I somehow insure that area of disk is never
If rewriting the block continues to result in bad ECC, then yes.
How do I do that?
Unfortunately, Symbolics doesn't seem to distribute the source for
SI:FIX-FEP-FILE and SI:FIX-FEP-BLOCK
They are in the optional sources. I'm surprised ILA doesn't have these.
, and I don't seem to have access
to *ANY* Rel-8 documentation (online or paper), so I can't tell you
what they are documented as doing, but disassembly suggests that
SI:FIX-FEP-BLOCK does exactly what you want, including testing and
analysis, and prompting you for the suitable recovery strategies.
But I suggest reading its documentation first. (I couldn't find any
documentation in the 7.2 set, although I know SI:FIX-FEP-FILE existed
then, and I think it was even documented. I think SI:FIX-FEP-BLOCK
may be new to 8.0).
I think in 7.2 they were documented in release notes, as I don't think
the appropriate manual was updated for 7.2 and SI:FIX-FEP-* weren't in
7.0. In 8.0 they're in the Site Operations manual.
I don't know if SI:FIX-FEP-BLOCK performs the proper actions to
minimize damage to the file in which the block appears, but hopefully
the documentation will enlighten you. In general, world-load files which
get a bad block should be replaced, LMFS files should have a block of zeros
substituted, and paging files can just have the block removed (while the
file is not in use!)
Well, the documentation isn't incredibly enlightening (so what else is
new (this is not meant as a denigration of Symbolics -- my comment
applies to most vendors' administrative documentation)?).
- Should I expect further deterioration of this disk?
I wouldn't worry too much, unless it happens again in a short
time. Surface defects do appear, and can become worse over
time. Eventually they become severe enough that the system's
error correction can't fix the problem invisibly, and you get a
visible problem like this.
I'd worry a little bit if the test reveals no surface defect;
that would indicate your disk's (or IO-board or paddle's) write
electronics blew it without detecting anything wrong, but while
rare, even that may be due to something such as a power glitch.
And while disks are pretty reliable on an absolute scale, one
error in seven years of hard use may even be within statistical
norms given the specs!
Do they have "useful lives" or last forever?
They have moving parts. Eventually they get old and die, if
nothing happens to murder them first. While I think seven years
is longer than I've ever used a single disk drive, nothing in
your situation suggests bearing failure or other age-related problem,
except possibly surface minor damage or defect growth.
In fact, one failure in seven years is a damned fine record, and
I'd say you're probably better off hanging onto this known-reliable
drive than substituting one of unknown quality.
Eagles are documented to have a MTBF of 10,000 hours, or a little less
than a failure a year. So you're about due for another five errors :-)