[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: FYI: Disk overruns.



Date: Mon, 18 Sep 89 10:22 PDT
From: Robert D. Pfeiffer <RDP@ALAN.LAAC-AI.Dialnet.Symbolics.COM>
Subject: Re: FYI: Disk overruns.
To: SLUG@ALAN.LAAC-AI.Dialnet.Symbolics.COM
In-Reply-To: <19890913222008.2.TYSON@ELCAPITAN.AI.SRI.COM>
Supersedes: <19890914162425.5.RDP@ALAN.LAAC-AI.Dialnet.Symbolics.COM>
Comments: Retransmission of failed mail.
Message-ID: <19890918172224.7.RDP@ALAN.LAAC-AI.Dialnet.Symbolics.COM>

[I guess this message never made it out.  It bounced back me after three
days.  Oh, networks!  Anyway, it gives me a chance to add a
not-entirely-amusing postscript.]

    Date: Wed, 13 Sep 89 15:20 PDT
    From: TYSON@Warbucks.AI.SRI.COM (Mabry Tyson)

    We have occasional problems such as you described.  Despite what you
    say, it certainly sounds suspicious that you are right about what you think
    was the solution.  

Yes, it does.

			Unless you have a bad I/O paddle card, I rather doubt
    that anything going on with the disk will affect the net.  

That's what I would have said a few days ago.  I wonder if anyone can
say definitively that this is impossible?

								Of course,
    maybe the CE changed something else on your CPU (I/O, I/O paddle?) at
    or around the same time as the change to the disk.

We tried swapping the I/O board and paddle card months ago when the
problem first surfaced with no noticable effect (i.e. the problem
persisted.  At that time, however, it was hard to tell because the
frequency of the symptoms was much lower.)

    When we've had such problems it usually turned out to be bad I/O cards or bad
    transceivers (or transceiver connections).  I even have had one where the
    transceiver cable was partially pulled loose from the bulkhead.  Also, these
    machines seem to have occasional problems that are solved by reseating the
    boards.  (Ie, what may have seemed like irrelevant activity may actually have
    been your solution.)

It's possible.  On the other hand, we had already tried tightening
connections, reseating boards, etc. with no improvement.  We had been
fiddling with it for two full days without making any impact.  We
started out believing it was a problem along the lines you describe
(which is what prompted me to send mail, now that it doesn't seem that
way).

When the CE finally fixed it, he did the following:

Brought the machine down
Swapped the disk controller board set
Brought the machine up

Both symptoms of the problem (disk overruns and fragmented packets) were
gone.  As far as I know, he did nothing else on this attempt.  I wasn't
exactly looking over his shoulder, but I was never more than about twenty
feet away and he was telling me what he was doing each time he tried
another fix.

    When our CE believes he has a suspect board, we usually put it back in to
    see if the problems come back.  I'm curious if that was done with the
    disk boards in your case?  

No.  This would have been a good sanity check which, unfortunately, we
didn't think to try.  We were so happy to be back on the air and we had
a site full of users waiting to get things done.  If we had it to do
over again, I'd follow your suggestion.

				Without that, my theory of why the disk problems
    were causing network problems is "They weren't!".  (Of course, the disk
    problems could have been caused by bad boards on the disk or they could have
    been caused by the net problems which in turn had a different cause.)

I understand you skepticism.

    I presume you looked at the net statistics in the Peek window to see if that
    host (versus other hosts) had more or less bad packets.

I didn't really check this.  I'm aware of these meters but don't
particularly trust them.  I got the feeling from talking with Symbolics
folks that nobody is quite sure that the meters you see in Peek are
telling you anything useful.  I made this inference on the basis of
analysis like, "Well, we looked at the meters on one of our own servers
and they seem to look about like this."  Not very rigorous.  This,
coupled with the fact that the documentation seems pretty sketchy does
not give me a warm feeling.  "The Sniffer", on the other hand, is a
product designed solely to troubleshoot a network.  It has lots of
handy features and is easy to use.

Do you believe the the meters in Peek are useful and accurate?

    Did your CE do the recommended/required adjustments to the disk cards after
    putting them in?  It's rather obscure in the Eagle manual but in section
    14.2 (Adjustment of Servo Circuit), it says "After changing the DE or ...,
    only "Dynamic Adjustment" is needed.."

Hmmm, I think I better check into this.  Thanks.

[And now, the postscript:  The problem has resurfaced!  We ran for a
week with no noticable problems.  I decided it was time to do some disk
hygiene and so was using the Check Records command from Level 3 of
FSMaint.  It was running along fine for quite a while when all of a
sudden -- yes, you guessed it -- lots and lots of disk overruns!  Oh
well, back to the drawing board.  I guess this will finally give us the
impetus to migrate to a 3650 as our primary file server.]