[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SYS:SITE; A novice adminstrator's question



    Date: Fri, 2 Aug 91  10:03:22 EDT
    From: "John R. Delaney" <delaney@xn.ll.mit.edu>

    Ghod, you know some obscure things. I hate to think of the circumstances
    that forced you to learn them!

It's worse than you think!

I guess I should tell the story; it's a good cautionary tale, about
how NOT to release software, and what not to do to the people working
on the last stages of a release (like documentation and QA). (Then I
stop, since there's been enough mail on this topic).

I should note at the outset that Symbolics isn't the only company
with this sort of problems, and the Symbolics back then was a much
younger, less experienced company, and with a smaller customer base.
Still, it was pretty unpleasant at the time.

A very long time ago (System 4.x) I was in charge of the group who
actually prepared for release the Symbolics system software.  Since
earlier releases had been very hard to install, and plagued with
lots of installation problems, I decided we should test them on the
various configurations.  Lo and behold, we found it wouldn't install
on Unix or VMS, due to logical pathname translation problems.  Unix
only allowed 14 characters, and VMS wasn't much longer, and didn't
allow anything but alphanumerics.  (This was a long time ago; but if
you buy your unix from a telephone company, it's still true!)

Anyway, it seems that nobody who was writing the software had bothered
to test that it could be done.

So, with every minute critical, I had to rescue the situation.  (I spent
a lot of my time at Symbolics rescuing other people's messes).

So I wrote a quick hack to fix the problem with as little impact on
the system as possible.  Specificaly, I wrote a wrapper on the
:TRANSLATED-PATHNAME to catch any translation errors, and look in a
table of exceptions, and I wrote a little tool to help me fill out this
exceptions table with new names I made up as I went along.  Presto, a
release.  The nice thing about this is that it didn't affect anything
about the rest of the system, so we could feel relatively safe about
this last-minute change.  It took me a couple of anxious days of no sleep
to implement and adaquately test, while everyone breathed down my neck
wondering why the release was being held up.

We shipped it, and it worked.

Along comes Release 5.0.  I went to update the table of exceptions, and
what did I find, but that the mechanism didn't work anymore!  At first,
I thought someone had removed it, but on further investigation, I found
that someone had changed the pathname wild-card translation code to not
signal errors anymore, and instead just do something half-assed.

Now, I considered this to be a serious mistake, because the error checking
was important; like for detecting errors, you know.  But at that late date,
I felt it was too dangerous to change it, since it could affect LOTS of things;
there would be no way to be sure I wasn't breaking something else.

(Needless to say, I was pissed.  I had made enough noise the first time
that I felt this was SERIOUSLY negligent.  I may have been one of the
maintainers of the pathname system, but that doesn't mean I should magically
add new functionality when people change the rules without telling me).

So this time I took a different tack.  Instead of depending on catching
the errors and correcting for them, I decided to instead build a new
mechanism, that could specify the transformations wanted.  Also, I wanted
to be able to specify heuristics.  One set of heuristics worked well for
Unix fonts, and another for VMS pathnames, and so on.  Buy specifying rules
to handle most cases, I could reduce the set of exceptions down to a managable
size.  Even so, the number of cases was too long to check each time you
translated, so I had to implement a hairy kind of pattern-matching hash table
lookup for the rules.

This took me an anxious week of no sleep, with people breathing down my neck
wondering why the release was being held up.  Along the way, I discovered that
one implementor (who will remain nameless) changed the way patch directory
names were generated, so that patch directories from one release would not
be seen by the other release, and vice versa.  The namings were so confusing
that I couldn't keep them straight, so I was sure customers wouldn't be able
to either.  Besides, it was too late to document what they'd have to do to be
compatible.  So I made the same rule-based mechanism handle translating patch
directory names, too, so it could search and be compatible with either naming
scheme.

This was an awful lot of new code to be written and tested in a week's time,
and it had to be solid and fast.  But it worked, and was flexible enough to
do everything that was needed.  So we crossed our fingers, and shipped it.
And waited very nervously for customers to install it.

It worked.

That was about 8 years ago or so now, and that code is still in use.
Much of the need for it has been alleviated by better Unix and VMS
filesystems, but it still handles the site directory and patch files
in addition to the usual wildcard matching.

Somewhere along the line, I wrote for the QA department (this was later,
when we HAD a QA department) an additional tool which translated all
of the logical pathnames on the distribution tapes to each kind of host,
and gave a report on the installability of the tape in that software.
(I think I wrote that because I couldn't get the graphics division to
test their tapes against actual hosts).

Symbolics did learn; installation testing was an important part of what
the Symbolics QA did once it was formed, and where possible, they did it
at various points in the product cycle, not just at the end.

Still, there are problems, and not just with Symbolics.  Have you ever
installed MS-DOS?  Macintosh System 7?  MS-DOS 5.0 is supposed to be
a huge improvement, but that speaks poorly for Microsoft's earlier efforts.
We won't discuss Unix, which also caused a lot of the problems for
Symbolics.

The lessons, of course, are not just limited to software installation,
but to all quality issues, including UI design usability, and
documentation.  Developers, go and get your QA person and Doco and
get them involved in the project from the vary first design stages.
Design for the end-user FIRST, because that has to set the goals
and constraints on the rest of the implementation.

There's lots of other lessons to be drawn, like the futility of
playing hero.  (If you get around the problem by heroic effort,
nothing changes; you'll have to play hero again).  I'll leave
the others as an exersize for the reader.

So you see, John, I didn't learn this stuff trying to deal with this
code in some nightmare.  Rather, my nightmare forced me to create
this code.