[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: reading numbers as symbols

> From: hunter@ncbi.nlm.nih.gov (Larry Hunter)
> I'm fishing for zip codes in large email files, and I would like to read
> preserving leading zeros in numbers.
> I wrote a function that looks for potnums in strings and puts || escapes
> around them.  It works, but it is inelegant and inefficient to have to read
> strings, then escape the potential numbers, then read-from-string.
> I would prefer to get the reader to treat all tokens that are potential
> numbers as symbols.  Something like setting the *read-base* to 0 :-).  
> Is there some straightforward way to do this that I missed?

The way you would do this is by creating your own readtable, using
SET-MACRO-CHARACTER, in which the ten digits would be nonterminatingp
and have a dispatch function that builds a string.

But I think that using the Common Lisp reader in this application is
simply inappropriate, and this is a fairly common design mistake.  The
READ function is a complicated subroutine, extendable by mediation of
the readtable, that converts a stream of characters into a recursively
nested Lisp object, capable of creating many different types according
to the external syntax described in CLtL Chapter 2.  READ is modular,
and uses lower-level functions such as READ-CHAR to obtain characters.

Your application intends to scan a stream of characters representing a
printed representation of a mailing list and presumably collect or
otherwise process zip codes.  Presumably this application will also
use the lower-level functions such as READ-CHAR, READ-LINE, and
friends, but nothing about this application is reminiscent of the task
performed by the READ function, which includes constructing lists,
complex numbers, conses, ratios, and pathnames.  The machinery of READ
seems more likely to get in your way than to help solve the task.

To look at it another way, mailing lists are usually sloppy data, with
occasional strange or botched punctiation.  Imagine that just one
entry has a colon char somewhere in it.  Suppose some creative company
has a name with an unbalanced parenthesis.  Suppose an entry "Mr .
Foo" has an unintended space before the period.  Suppose an entry has
a "#" in it, such as an apartment number, and perhaps even a "#.".
All these things will likely cause the reader to signal error, because
these syntax elements have meaning to READ that they don't have in a
mailing list.