[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Disk ECC errors



    Date: Fri, 29 Jul 88 12:53 PDT
    From: TYSON@warbucks.ai.sri.com (Mabry Tyson)

    I'm having a discussion with CUSTOMER-REPORTS regarding disk ECC errors.
    The person there claims I should see them very rarely.  I would
    appreciate hearing what experiences others have had with them.  (Replies
    to me and I will summarize to the list if appropriate.)  Please give me
    some idea of the mix of machines/disks you have and if there have been
    any changes in the ECC error rate recently (eg, since 7.2).

    My perception is that the historical rate at which we see disk ECC
    errors (usually in the paging area, occasionally in the LMFS (ie,
    sectors that are being rewritten to disk)) to be about 0.5/year/machine.
    This works out to something on the order of once a month for us.  The
    errors are spread across all of our machines (mostly 3600s with IFUs,
    essentially all disks are SMDs (Fujitsu Owls and Eagles plus some
    CDCs)).  I've been seeing these sorts of things for the 5 years that
    we've had Symbolics machines.  There once was a problem (as I recall)
    where the disk might get trashed when, say, the network was unterminated
    or there was some other problem on the net.  (Our local net has about 35
    Symbolics, 15 Suns, plus another 10 hosts.)

    Recently we have seen a rash of problems that involved trashing worlds
    (ie, disk sectors that weren't being written at the time) and that caused
    disk search errors (ie, it presumably dribbled on the header block).
    I'm trying to determine whether these are coincidences or not.

L machines have a problem with the network microtask hogging the machine.  As
a result, when the ethernet gets jammed, disk writes in progress tend to get
garbled.  That means that if someone is in the process of adding or removing a
transceiver while an L machine (3600, 3640, 3645, 3670, or 3675) is writing to
the disk the block might get written bad.  G machines don't have this problem.
Because of this we use 3650's for our file servers.

This doesn't sound like your problem, though.  It sounds to me that you
problem is your disks getting old and just plain wearing out.  We are getting
some of this here.  Mainly our 2284s.  Not many of our 2351As have failed yet,
though.  Is there an disk age pattern?