[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

FYI: Disk overruns.



Date: Wed, 13 Sep 89 10:16 PDT
From: Robert D. Pfeiffer <RDP@ALAN.LAAC-AI.Dialnet.Symbolics.COM>
Subject: FYI: Disk overruns.
To: SLUG@ALAN.LAAC-AI.Dialnet.Symbolics.COM
Message-ID: <19890913171608.3.RDP@ALAN.LAAC-AI.Dialnet.Symbolics.COM>

This mail is simply to alert other users to a hardware problem which we
experienced and found hard to troubleshoot.  Now that we think we've
gotten it resolved, I wanted to send a brief synopsis to this list.

We started experiencing a problem with our primary server, Alan Turing,
a few months ago.  ALAN is our file server, mail server, namespace
server, etc.  -- pretty much everything expect for our print spooler.
ALAN is a 3675 running 7.2.  The symptom of the problem was the
occurence of "disk overrun" errors.  When the problem occurred, the
RESUME option always worked (the problem was transient) except if it
happened to a virtual memory access (which would FEP you but the
Continue command would start you going again) or in the mailer (which
would force you to halt and restart the mailer).  We attempted to get
the problem resolved and the "conventional wisdom" was that L-machines
(also called "OBS machines") were susceptible to problems on the network
where bad data or heavy traffic would cause the network microtask to
out-compete the disk microtask for time slices and disk overruns would
be just the symptom to expect.

We got our hands on a product called "The Sniffer" which is used to
monitor and log errors on an Ethernet and tried it out.  Sure enough,
there were lots of bad packets whizzing by.  Unfortunately, these bad
packets were almost always fragments without an originating address.
After some further probing we decided that it sure seemed like a hard
problem to find the culprit (we've got lots of equipment on our
Ethernet).  Since the problem occurred infrequently and the workaround
was typically to hit the RESUME key, we decided to live with it.

Recently things got much worse.  Over the course of a few days the
number of disk overruns shot up dramatically bringing us to a virtual
standstill.  Also some of the (now numerous) overruns would occur during
disk writes leaving bad data (disk ECC errors) on the disk which had to
be repaired by hand (using LMFS:FIX-FILE).  We got "The Sniffer" back on
the cable and things looked worse than ever.  We got Symbolics field
service in again and went to it.  Now, finally, the conclusion...

The problem turned out to be in the Eagle disk drive controller board
set.  By replacing this board set, the disk overruns stopped occurring
and the bad Ethernet packets also subsided (they still aren't zero but
my rough calculation says that they've dropped about two orders of
magnitude).  The thing that made this problem hard is that all along we
thought we were looking for a network problem which was causing a disk
problem.  As it turns out (and I'd love to hear theories on how this
could happen), apparently disk problems were causing network problems.


A final footnote is that we were very satisfied with the effort provided
by Symbolics field service to keep working at this until the solution
was found.  Although we hate to be down for any amount of time, we were
very statisfied that the problem was troubleshot and resolved quickly
and efficiently.