[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bad blocks


Don't know if you heard the final disposition of this problem, so in
case you haven't, here's the pretty much the full story from start to
finish.  I've decided to copy SLUG because the situation was
sufficiently strange that somebody out there might be able to offer
some plausible explanation.

The system involved was a 3650 with two EMD368 drives on it.  It
started getting intermittent disk errors (mostly seek/search errors)
on both drives a few weeks ago.  By the time they got frequent enough
to attract our attention, we were apparently dealing with corrupted
worlds and paging files.  We ran the disk test code in KDO-8-0-FEP-CODE 
(yes, it says 7.1 in the text but we're running it under 8.1)
and I noticed that the exact same cylinders were failing on the BOTH
disks, and that furthermore only 3 or 4 particular cylinders were
affected, and only the first couple of surfaces (0 & 1) of each. 

The cylinders in question were #o1577, #o1677, #o1767, and #o1773,
which have an interesting pattern of having only a single 0-bit.
(Well actually 2 if you count the high-order bit for cylinders 
between 1024 and 1217.)

Which brings us to another interesting symptom, namely that any
attempts to access a cylinder > 1023. on either drive would cause a
device check, which our testing code couldn't handle and would
therefore abort.

I immediately decided that the disks had to be ok, since the
coincidence of the same cylinders failing on both disks was too great.
Changing the FEP/I-O board and paddle card didn't do any good.
This pretty much left only the cables, although one possibility
was that one of the drives was adding/dropping one of the unit
select bits and answering requests intended for the other one.

We then decided to physically disconnect the second disk and see how a
single disk behaved.  When we ran the tests, drive 0 passed with
flying colors.  When we reconnected the second disk, we replaced the
wide ribbon cable, and now both disks passed.  So circumstantial
evidence points to the cable, although it could have, in theory, been
just a loose connection.

The last job was to prune the bad-blocks files and remove all
of the entries that had been unnecessarily added by the testing.
Doing this without reformatting the drives was fun...

Any ideas what sort of error could really cause the symptoms
we saw?