[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Dialnet timing problems (was: "Modems")



    Date: Tue, 6 Nov 1990 12:51 EST
    From: Mly@ai.mit.edu (Richard Mlynarik)

	Date: Fri, 2 Nov 90 10:36-0000
	From: p2@porter.asl.dialnet.symbolics.com (Peter Paine)

	    Date: Thu, 1 Nov 90 14:26 PST
	    From: JFK@BOLD-EAGLE.varian.dialnet.symbolics.com (Joe F. Karnicky)

	    [...]

	    (2) Frequently, my system gets into a state where it is not possible for
	    me to successfully probe Riverside.   My modem goes off hook, dials,
	    Riverside answers, and I see a carrier detect.  However, the carrier is
	    then immediately lost.   This seems to be a Riverside-specific problem
	    as I can successfully dial up other (non-Symbolics) computers.
	    I'm currently working with Symbolics software support on sorting out
	    what's going on.
	    (any suggestions?)

	    Regards,
	    Joe

As I pointed out to JFK a while ago, this particular problem was a two-day
incident in which Riverside's serial substrate was setting the serial chip
to a bogus mode, resulting in complete garbage out the serial port.  Since
this is in software not yet even released to the field, no one else is likely
to have seen this bug, and they won't, because it'll be fixed before the
code Riverside is running even makes it out into the field.

The fact that this was phrased as if this is a chronic problem is interesting,
because JFK told me that it was not.

I must also point out that all these messages flying around on SLUG about
Dialnet are mostly useless to other people who cannot read minds, since
vanishingly few of them report:
       that have floated around the network in the last half-decade or so.

	I have been having quite a war with what I take to be the same problem
	as yours. In the short term, I wrote a fix that actually keeps the
	mailer operational. Hoping that this is of use - and not too hideous.

	;;; -*- Mode: LISP; Syntax: Common-lisp; Package: USER; Base: 10 -*-

	;;; Pole Dialnet at intervals to check whether it is jammed, if so restart it.
	;;; To do: post notification into S&F mail Log window

	[...]

    (That's "Poll" by the way.)

    This sounds suspiciously like a fundamental dialnet bug which
    Symbolics has known about for about three years.  Unfortunately, my
    actual patch for this bug is offline, but I could attempt to retrieve it
    if some dialnet-sufferers would like to try it out.  There are a number
    of other known timing problems in the dialnet code, BTW.

I fixed many of the timing holes in Dialnet for 8.0.1 (1not0 8.0).  I'd
have to check if your nondeterminism patch, or something like it, was
incorporated; I'll do that later.

Certainly, people running 7.1 (as the below herald indicates) should not
expect 1any0 fixes to their implementations short of the mercy of
strangers, and should instead upgrade to at least 8.0.1 if they care
about the robustness of their Dialnet.

As I've said before, however, some timing races remain, and will continue
to remain, because there's far more payoff in scrapping Dialnet and replacing
it with a modern, interoperable protocol, such as TCP/IP over PPP, than there
is in continuing to poke at an implementation that is most of a decade old,
lacks a written specification other than the code, and interoperates only
with other lisp machines.  I do not know when you might see Dialnet replaced
by PPP, but it's quite likely that you'll see that before any 1major0 improvements
in Dialnet's reliability after 8.0.1.

I would certainly 1not0 install the patch below (which dates from 7.1) in
any 8.0.1 release.  While I haven't check it carefully, you're bound to
break something, since the patch below is based on code that has had
extensive changes made to it since then.

    Date: Mon, 11 Jan 88 16:47 EST
    In Symbolics 3640 Genera 7.1, IP-TCP 52.16, 7-1-Patches 1.34,
    Hu-Kwa 12.2, microcode 3640-MIC 396, FEP 127, FEP0:>v127-lisp.flod(55),
    FEP0:>v127-loaders.flod(55), FEP0:>v127-info.flod(55),
    FEP0:>v127-debug.flod(34), FEP0:>v127-tests.flod(55),
    Machine serial number 5146,
    Debug ERROR-REPORTER frames. (from J:>Mly>Debugger-Patch.lisp.12), on Bullwinkle J. Moose:

    SYS:DIALNET;STREAM.LISP contains the following comment:

       This means that both ends wanted to request service at the same time.  In this case,
       the user waits a random amount of time of around 1 second and then returns to the IDLE
       state.  From there, it will either get the other ends request or manage to send its
       request first.

    However, absolutely 1no0 attempt is made to implement the above behaviour!!

    In fact, what happens in such a situation is that the interface is set
    to state :IDLE (by the receiver process).  This means that
    (FLAVOR:METHOD :OPEN-STREAM DIAL:DIALNET-INTERFACE), running in the user
    process, will notice that STATE is no longer :REQUEST-SENT, but is not
    :OPEN, and will then signal an error saying that the connection was
    refused (with a :REASON of NIL)  This is completely broken!

    I believe that the below patch is a (deadlock-free) way of fixing this
    problem.  It has the correct behaviour of making :OPEN-STREAM wait until
    its attempt to connect is really accepted or rejected by the remote end,
    rather than stupidly, incorrectly and misleadingly claiming to have been
    rejected.  The patch doesn't rely on the remote machine running the same
    patch.

    It is very easy to reproduce this lossage:  Just open two successive
    dialnet connection to another host which is waiting to open a dialnet
    connection to you.  We saw this all the time as a result of the following:

    1  Invoke MAIL-PROBE service on remote host.
    2  Immediately following this, invoke SMTP service on remote host.
    2a Remote host invokes SMTP service on us as soon as it sees that
       it has a connection to us.

    Steps 2 and 2a happen `simultaneously,' and both ends end up claiming
    that their connection request was rejected (and, unfortunately for our
    mail service, claiming that the other host has gone down -- even though
    it still has a live (idle) connection to it...)

    [... patch deleted ...]