[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fatal Disk Error urgency [1]

    Date: Wed, 6 Jun 90 00:25 EDT
    From: RWK@FUJI.ILA.Dialnet.Symbolics.COM (Robert W. Kerns)

	Date: Tue, 5 Jun 90 17:36 EDT
	From: Paul Pangaro <pan@athena.pangaro.dialnet.symbolics.com>
	We copied a band (incremental color world from Symbolics tape) from tape
	to FEP. We then tried to copy the band, and got a hard error:

1	Error: Attempt to read from CHAOS Connection for BAND-TRANSFER to A,
0		1which was closed by ATHENA.
0	       1Reason given was "%DISK-ERROR-ECC during a %DCW-READ32 on unit 0., cyl 577., surf 5., sec 9.,
0		 1Fatal ECC error,
0		 118. pending transfers associated with this disk event aborted."

0	(Both the bad band and the good band booted OK, though we dont do much
	exercising of the color stuff at the moment.)

	This is on an Eagle drive that has performed flawlessly since 1983.

	Just to be as naive as possible, and so in hopes of learning the maximum
	amount, can you tell me:

	- If so (or if not), should I somehow insure that area of disk is never
	used again? 

    If rewriting the block continues to result in bad ECC, then yes.

		    How do I do that?

    Unfortunately, Symbolics doesn't seem to distribute the source for

They are in the optional sources.  I'm surprised ILA doesn't have these.

					, and I don't seem to have access
    to *ANY* Rel-8 documentation (online or paper), so I can't tell you
    what they are documented as doing, but disassembly suggests that
    SI:FIX-FEP-BLOCK does exactly what you want, including testing and
    analysis, and prompting you for the suitable recovery strategies.

    But I suggest reading its documentation first.  (I couldn't find any
    documentation in the 7.2 set, although I know SI:FIX-FEP-FILE existed
    then, and I think it was even documented.  I think SI:FIX-FEP-BLOCK
    may be new to 8.0).

I think in 7.2 they were documented in release notes, as I don't think
the appropriate manual was updated for 7.2 and SI:FIX-FEP-* weren't in
7.0.  In 8.0 they're in the Site Operations manual.

    I don't know if SI:FIX-FEP-BLOCK performs the proper actions to
    minimize damage to the file in which the block appears, but hopefully
    the documentation will enlighten you.  In general, world-load files which
    get a bad block should be replaced, LMFS files should have a block of zeros
    substituted, and paging files can just have the block removed (while the
    file is not in use!)

Well, the documentation isn't incredibly enlightening (so what else is
new (this is not meant as a denigration of Symbolics -- my comment
applies to most vendors' administrative documentation)?).

	- Should I expect further deterioration of this disk?

    I wouldn't worry too much, unless it happens again in a short
    time.  Surface defects do appear, and can become worse over
    time.  Eventually they become severe enough that the system's
    error correction can't fix the problem invisibly, and you get a
    visible problem like this.

    I'd worry a little bit if the test reveals no surface defect;
    that would indicate your disk's (or IO-board or paddle's) write
    electronics blew it without detecting anything wrong, but while
    rare, even that may be due to something such as a power glitch.

    And while disks are pretty reliable on an absolute scale, one
    error in seven years of hard use may even be within statistical
    norms given the specs!

	Do they have "useful lives" or last forever?

    They have moving parts.  Eventually they get old and die, if
    nothing happens to murder them first.  While I think seven years
    is longer than I've ever used a single disk drive, nothing in
    your situation suggests bearing failure or other age-related problem,
    except possibly surface minor damage or defect growth.

    In fact, one failure in seven years is a damned fine record, and
    I'd say you're probably better off hanging onto this known-reliable
    drive than substituting one of unknown quality.

Eagles are documented to have a MTBF of 10,000 hours, or a little less
than a failure a year.  So you're about due for another five errors :-)