[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Kanef@CHARON.arc.nasa.gov: Symbolics and genetic databases idea]



Here's a suggestion for an interesting application of symbolic
processing (and Symbolics processors) from Bob Kanefsky. -- Chuck

--- Begin forwarded message ---
Date: Thu, 2 Aug 90 04:12 PDT
From: Bob Kanefsky <Kanef@CHARON.arc.nasa.gov>
Subject: Symbolics and genetic databases idea
To: Chucko@CHARON.arc.nasa.gov

I had this idea back in March and wrote it down, but it completely
slipped my mind until a minute ago.  It has as good a chance of helping
Symbolics as any single thing I can think of.  Maybe you can give me
your opinion of it and suggest who to pass it on to (SLUG or someone in
Symbolics).

The national and international databases of DNA and protein data are in
terrible shape.  (GenBank, for example -- which, last I heard, is
maintained under contract by Intelligenetics in Mountain View, upstairs
from the local Symbolics office.)  Not only do they have a huge backlog
which is growing larger every year, but they allow all sorts of mistakes
to creep in that any reasonable software should be able to catch.  Every
time I got data from GenBank via the biochemist Peter and I used to
collaborate with, my software on the Symbolics caught some kind of
anomaly in the data.  One time someone had used the symbol "J" in the
middle of some DNA strings, which is neither a nucleic acid code nor any
conventional uncertainty code, and apparently the GenBank folks just
blindly put it in.  Another time, several sequences were repeated more
than once, with the same names, and our collaborator claims that means
they were listed redundantly in GenBank -- which can really screw up
statistical conclusions like the ones he's writing a paper about.  And
the way in which the data is stored is the most trivial format you can
possibly imagine -- the nucleic acids are spelled out letter by letter,
e.g. "GTCAGTCTTTGGTTGGGTAGGAGTGTGCATCCC...", even though there are only
four letters, so the bulk of the information can theoretically be
represented by two bits per letter, or even less with some kind of
Huffman encoding or with rough evolution trees.  Sometimes they support
more than one format, by having two copies of each file on disk. And
things are likely to get much worse with all the money that the Human
Genome Initiative will be pouring into the gathering of more data.

Most of the problems I've seen in my occasional contact with GenBank
have technical solutions.  (The backlog isn't entirely technical;
apparently it's hard to convince biologists who sequence DNA to provide
machine-readable copies of their results instead of just printing them
in a journal and making GenBank type them in, if you can believe that!)
And most of the technical solutions are easy, if you have a good
environment for developing them.  Even allowing for the fact that the
users of the databases are running archaic software which will never
change and/or are used to seeing things formatted the way they are, I
see no reason why a more modern approach couldn't be put together on a
Symbolics machine with a few months of effort.

So here's the idea:  what if there were one or more Symbolics machines
on the network acting as smart database servers for, say, GenBank.  They
could contrive to look exactly like dumb UNIX and Tops-20 file servers to
clients who were used to dealing with that type of server.  (The GenBank
server that was recently retired was on Tops-20.)  Only they would be
unusually thorough file servers, since they would appear to have the
data replicated in every conceivable method of organization; the same
DNA sequence might be listed in dozens of files, so you could find all
primate "alu" repeat sequences in their own file, and all chimpanzee DNA
sequences in their own file, and all vertebrate TPA genes in their own
file, and so on, and each gene could also be listed alone in its own
file.  Subsets of genes could continue to be listed as separate entries
even after the entire gene is known and is listed elsewhere.

Physically, the files would never exist at all; the Symbolics machine
would internally be representing the sequences in Statice in a compact
form, and generating the stream of desired data in the expected format
on demand, but following FTP protocol (or NFS or Kermit or whatever).
Even the directory hierarchy need not exist, and the server could be
very forgiving about pathnames; if someone asks for "vertebrate/human/",
it could give it to them, even though for Show Directory purposes the
pathname is really "vertebrate/primate/human/".  An email-based server
could also be provided.  And for smarter clients, a more efficient and
database-oriented server would be available.

The advantage of this as a use for Symbolics machines as a delivery
vehicle is that no one can possibly object that the file-server isn't
running UNIX if they can talk to it just as if it were (sigh, is that
the Turing Test for the 1990's?), and if it costs a little more, no one
is likely to quibble, since the cost would be spread over many hundreds
of well-funded researchers.  And with a MacIvory in the pool, it's even
easy to read the Mac and PC floppy disks that are probably the way that
most of the data is provided by researchers who DO submit machine-readable
results.

I believe the data is public domain, so maybe this could be offered as a
private alternative service without getting official sanction until it's
obvious that everyone is subscribing to the Symbolics-based server, or
it could be put forth as a national database from the start.  I'll bet
there would be a good chance of getting Human Genome Initiative funding
if someone wanted to work on this.  The only reason I don't want to
apply for it myself is that I'm more interested in working on NASA
problems.

					--Kanef
--- End forwarded message ---