[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Dialnet timing problems (was: "Modems")

To: Mly@ai.mit.edu (Richard Mlynarik)
Subject: Dialnet timing problems (was: "Modems")
From: Foner@YUKON.SCRC.Symbolics.COM (Leonard N. Foner)
Date: Fri, 9 Nov 1990 20:28:00 -0500
Cc: Foner@YUKON.SCRC.Symbolics.COM, SLUG@Warbucks.AI.SRI.COM
Character-type-mappings: (1 0 (NIL 0) (NIL :ITALIC NIL) "CPTFONTI")
Fonts: CPTFONT, CPTFONTI
In-reply-to: <19901106175159.1.MLY@PSYCHOMAC.AI.MIT.EDU>

Date: Tue, 6 Nov 1990 12:51 EST
From: Mly@ai.mit.edu (Richard Mlynarik)

Date: Fri, 2 Nov 90 10:36-0000
From: p2@porter.asl.dialnet.symbolics.com (Peter Paine)

Date: Thu, 1 Nov 90 14:26 PST
From: JFK@BOLD-EAGLE.varian.dialnet.symbolics.com (Joe F. Karnicky)

[...]

(2) Frequently, my system gets into a state where it is not possible for
me to successfully probe Riverside. My modem goes off hook, dials,
Riverside answers, and I see a carrier detect. However, the carrier is
then immediately lost. This seems to be a Riverside-specific problem
as I can successfully dial up other (non-Symbolics) computers.
I'm currently working with Symbolics software support on sorting out
what's going on.
(any suggestions?)

Regards,
Joe

As I pointed out to JFK a while ago, this particular problem was a two-day
incident in which Riverside's serial substrate was setting the serial chip
to a bogus mode, resulting in complete garbage out the serial port. Since
this is in software not yet even released to the field, no one else is likely
to have seen this bug, and they won't, because it'll be fixed before the
code Riverside is running even makes it out into the field.

The fact that this was phrased as if this is a chronic problem is interesting,
because JFK told me that it was not.

I must also point out that all these messages flying around on SLUG about
Dialnet are mostly useless to other people who cannot read minds, since
vanishingly few of them report:
that have floated around the network in the last half-decade or so.

I have been having quite a war with what I take to be the same problem
as yours. In the short term, I wrote a fix that actually keeps the
mailer operational. Hoping that this is of use - and not too hideous.

;;; -*- Mode: LISP; Syntax: Common-lisp; Package: USER; Base: 10 -*-

;;; Pole Dialnet at intervals to check whether it is jammed, if so restart it.
;;; To do: post notification into S&F mail Log window

[...]

(That's "Poll" by the way.)

This sounds suspiciously like a fundamental dialnet bug which
Symbolics has known about for about three years. Unfortunately, my
actual patch for this bug is offline, but I could attempt to retrieve it
if some dialnet-sufferers would like to try it out. There are a number
of other known timing problems in the dialnet code, BTW.

I fixed many of the timing holes in Dialnet for 8.0.1 (1not0 8.0). I'd
have to check if your nondeterminism patch, or something like it, was
incorporated; I'll do that later.

Certainly, people running 7.1 (as the below herald indicates) should not
expect 1any0 fixes to their implementations short of the mercy of
strangers, and should instead upgrade to at least 8.0.1 if they care
about the robustness of their Dialnet.

As I've said before, however, some timing races remain, and will continue
to remain, because there's far more payoff in scrapping Dialnet and replacing
it with a modern, interoperable protocol, such as TCP/IP over PPP, than there
is in continuing to poke at an implementation that is most of a decade old,
lacks a written specification other than the code, and interoperates only
with other lisp machines. I do not know when you might see Dialnet replaced
by PPP, but it's quite likely that you'll see that before any 1major0 improvements
in Dialnet's reliability after 8.0.1.

I would certainly 1not0 install the patch below (which dates from 7.1) in
any 8.0.1 release. While I haven't check it carefully, you're bound to
break something, since the patch below is based on code that has had
extensive changes made to it since then.

Date: Mon, 11 Jan 88 16:47 EST
In Symbolics 3640 Genera 7.1, IP-TCP 52.16, 7-1-Patches 1.34,
Hu-Kwa 12.2, microcode 3640-MIC 396, FEP 127, FEP0:>v127-lisp.flod(55),
FEP0:>v127-loaders.flod(55), FEP0:>v127-info.flod(55),
FEP0:>v127-debug.flod(34), FEP0:>v127-tests.flod(55),
Machine serial number 5146,
Debug ERROR-REPORTER frames. (from J:>Mly>Debugger-Patch.lisp.12), on Bullwinkle J. Moose:

SYS:DIALNET;STREAM.LISP contains the following comment:

This means that both ends wanted to request service at the same time. In this case,
the user waits a random amount of time of around 1 second and then returns to the IDLE
state. From there, it will either get the other ends request or manage to send its
request first.

However, absolutely 1no0 attempt is made to implement the above behaviour!!

In fact, what happens in such a situation is that the interface is set
to state :IDLE (by the receiver process). This means that
(FLAVOR:METHOD :OPEN-STREAM DIAL:DIALNET-INTERFACE), running in the user
process, will notice that STATE is no longer :REQUEST-SENT, but is not
:OPEN, and will then signal an error saying that the connection was
refused (with a :REASON of NIL) This is completely broken!

I believe that the below patch is a (deadlock-free) way of fixing this
problem. It has the correct behaviour of making :OPEN-STREAM wait until
its attempt to connect is really accepted or rejected by the remote end,
rather than stupidly, incorrectly and misleadingly claiming to have been
rejected. The patch doesn't rely on the remote machine running the same
patch.

It is very easy to reproduce this lossage: Just open two successive
dialnet connection to another host which is waiting to open a dialnet
connection to you. We saw this all the time as a result of the following:

1 Invoke MAIL-PROBE service on remote host.
2 Immediately following this, invoke SMTP service on remote host.
2a Remote host invokes SMTP service on us as soon as it sees that
it has a connection to us.

Steps 2 and 2a happen `simultaneously,' and both ends end up claiming
that their connection request was rejected (and, unfortunately for our
mail service, claiming that the other host has gone down -- even though
it still has a live (idle) connection to it...)

[... patch deleted ...]

References:
- Dialnet timing problems (was: "Modems")
  - From: Mly@ai.mit.edu (Richard Mlynarik)

Prev by Date: Fast Graphics on 3653
Next by Date: Re: Fast Graphics on 3653
Previous by thread: Dialnet timing problems (was: "Modems")
Next by thread: Modems
Index(es):
- Date
- Thread