[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Dialnet timing problems (was: "Modems")



    Date: Fri, 2 Nov 90 10:36-0000
    From: p2@porter.asl.dialnet.symbolics.com (Peter Paine)

	Date: Thu, 1 Nov 90 14:26 PST
	From: JFK@BOLD-EAGLE.varian.dialnet.symbolics.com (Joe F. Karnicky)

        [...]

	(2) Frequently, my system gets into a state where it is not possible for
	me to successfully probe Riverside.   My modem goes off hook, dials,
	Riverside answers, and I see a carrier detect.  However, the carrier is
	then immediately lost.   This seems to be a Riverside-specific problem
	as I can successfully dial up other (non-Symbolics) computers.
	I'm currently working with Symbolics software support on sorting out
	what's going on.
	(any suggestions?)

	Regards,
	Joe

    I have been having quite a war with what I take to be the same problem
    as yours. In the short term, I wrote a fix that actually keeps the
    mailer operational. Hoping that this is of use - and not too hideous.

    ;;; -*- Mode: LISP; Syntax: Common-lisp; Package: USER; Base: 10 -*-

    ;;; Pole Dialnet at intervals to check whether it is jammed, if so restart it.
    ;;; To do: post notification into S&F mail Log window

    [...]

(That's "Poll" by the way.)

This sounds suspiciously like a fundamental dialnet bug which
Symbolics has known about for about three years.  Unfortunately, my
actual patch for this bug is offline, but I could attempt to retrieve it
if some dialnet-sufferers would like to try it out.  There are a number
of other known timing problems in the dialnet code, BTW.

Date: Mon, 11 Jan 88 16:47 EST
In Symbolics 3640 Genera 7.1, IP-TCP 52.16, 7-1-Patches 1.34,
Hu-Kwa 12.2, microcode 3640-MIC 396, FEP 127, FEP0:>v127-lisp.flod(55),
FEP0:>v127-loaders.flod(55), FEP0:>v127-info.flod(55),
FEP0:>v127-debug.flod(34), FEP0:>v127-tests.flod(55),
Machine serial number 5146,
Debug ERROR-REPORTER frames. (from J:>Mly>Debugger-Patch.lisp.12), on Bullwinkle J. Moose:

SYS:DIALNET;STREAM.LISP contains the following comment:

   This means that both ends wanted to request service at the same time.  In this case,
   the user waits a random amount of time of around 1 second and then returns to the IDLE
   state.  From there, it will either get the other ends request or manage to send its
   request first.

However, absolutely 1no0 attempt is made to implement the above behaviour!!

In fact, what happens in such a situation is that the interface is set
to state :IDLE (by the receiver process).  This means that
(FLAVOR:METHOD :OPEN-STREAM DIAL:DIALNET-INTERFACE), running in the user
process, will notice that STATE is no longer :REQUEST-SENT, but is not
:OPEN, and will then signal an error saying that the connection was
refused (with a :REASON of NIL)  This is completely broken!

I believe that the below patch is a (deadlock-free) way of fixing this
problem.  It has the correct behaviour of making :OPEN-STREAM wait until
its attempt to connect is really accepted or rejected by the remote end,
rather than stupidly, incorrectly and misleadingly claiming to have been
rejected.  The patch doesn't rely on the remote machine running the same
patch.

It is very easy to reproduce this lossage:  Just open two successive
dialnet connection to another host which is waiting to open a dialnet
connection to you.  We saw this all the time as a result of the following:

1  Invoke MAIL-PROBE service on remote host.
2  Immediately following this, invoke SMTP service on remote host.
2a Remote host invokes SMTP service on us as soon as it sees that
   it has a connection to us.

Steps 2 and 2a happen `simultaneously,' and both ends end up claiming
that their connection request was rejected (and, unfortunately for our
mail service, claiming that the other host has gone down -- even though
it still has a live (idle) connection to it...)

[... patch deleted ...]