[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Disk ECC errors

I'm having a discussion with CUSTOMER-REPORTS regarding disk ECC errors.
The person there claims I should see them very rarely.  I would
appreciate hearing what experiences others have had with them.  (Replies
to me and I will summarize to the list if appropriate.)  Please give me
some idea of the mix of machines/disks you have and if there have been
any changes in the ECC error rate recently (eg, since 7.2).

My perception is that the historical rate at which we see disk ECC
errors (usually in the paging area, occasionally in the LMFS (ie,
sectors that are being rewritten to disk)) to be about 0.5/year/machine.
This works out to something on the order of once a month for us.  The
errors are spread across all of our machines (mostly 3600s with IFUs,
essentially all disks are SMDs (Fujitsu Owls and Eagles plus some
CDCs)).  I've been seeing these sorts of things for the 5 years that
we've had Symbolics machines.  There once was a problem (as I recall)
where the disk might get trashed when, say, the network was unterminated
or there was some other problem on the net.  (Our local net has about 35
Symbolics, 15 Suns, plus another 10 hosts.)

Recently we have seen a rash of problems that involved trashing worlds
(ie, disk sectors that weren't being written at the time) and that caused
disk search errors (ie, it presumably dribbled on the header block).
I'm trying to determine whether these are coincidences or not.