[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Network diagnostics



Barmar, are you aware of SYS:NETWORK;CHAOS-PERF.LISP?  It has some
performance tests that are useful.   (Under separate cover, I'll send
you some code that I have that sits on top of this.  The Chaos testing
stuff is a lot better than my feeble attempts at doing ICMP echoes.)  If
your question is regarding sending vs receiving, this should help.

As I recall it is possible to put the interface into promiscuous mode but I
don't know of any code that would use it.

Most of my problems on our net tend to be with intermittent problems that
are very hard to track down.  I've put a TDR on the net but the net is
relatively clean.  The problems tend to be a transceiver goes bad (we still have
some old level 1 vampires from the Dandelion days) or when the I/O on the
computer is having problems (eg, I/O board on Lispms but the same type of
problem has occurred on Suns).

We have done some of our own wiring of transceiver cables and, once, the
person who did it didn't pay attention to the way the pairs were
supposed to go (ie, they weren't paired).  We then had a machine that
could talk to some hosts but not to others.

Some of our most recent problems have been when we had someone running Prolog
on a Sun.  The machine wound up thrashing and was using 40% of the bandwidth
of the net.  As soon as there was any more load, the number of collisions and
bad packets shot up.  (eg, if we sprayed from one Sun to another.)  The initial
symptoms of this problem were Suns saying their transceivers were jammed.
I also run a background process on my lispm that watches various system
parameters including number of collision and bad packets (I'm on a 3600 so
the statistics include all the packets, not just those to my machine).  That
started complaining.  Then the problem would go away when we tried to track it
down.  It happened just after noon one day, then again after 5PM, then again
the next day at similar times.  That's when I realized it was probably someone
doing something rather than a hardware failure.

Our usual way of tracking down intermittent hardware problems is to bisect the
net repeatedly.  (Hmmm... Some of our multiport transceivers have LOOPBACK
switches.  That's probably a good way to isolate them.)  However some of the
problems tend to disappear when the net is broken and then reassembled (with
a section missing).