[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parsing records, substrings, efficiency

In article <owens.700331287@tartarus.uchicago.edu> owens@tartarus.uchicago.edu (Christopher Owens) writes:
>I'm parsing records from an external text file into an internal
>structure. The file is record-oriented with fixed column positions for
>the fields comprising the record.  I'm assuming the right approach is
>to get one line at a time using read-line, then to use subseq to chop
>each line into fields for further processing.

Actually, most of the string and sequence functions accept :START and :END
arguments, so it would probably be more efficient to keep everything in the
original string and just pass these indices around.  But when you finally
store them somewhere it's probably better to extract the subsequence.

>Some of the string fields are parsed into something else before being
>inserted in the internal structure, i.e. by parse-integer, or by some
>function that looks up an object given its name.  

These are good candidates for the above approach of using indexes.  I.e.
instead of (parse-integer (subseq input-record 20 25)) do (parse-integer
input-record :start 20 :end 25).

>						   Other of the fields
>(about half the original record) are simply stored in the structure as
>strings, exactly as they appear in the original record.
>For these latter, unmodified string fields, I could store either, for
> (subseq input-record 20 25)
>or I could store
> (copy-seq (subseq input-record 20 25))
>The latter obviously allocates more storage, but it leaves the entire
>original string as garbage, to be collected in one piece.  The former
>doesn't allocate any more storage, but it leaves the storage allocated
>to the original string containing alternating segments of garbage and
>non-garbage.  I'm not sure what the GC will do with this, since I
>don't know how strings are relocated.

You're wrong about the storage allocated by the first version.  See the
second sentence of CLtL's description of SUBSEQ: "SUBSEQ *always* allocates
a new sequence for a result; it never shares storage with an old sequence."
So there is no "alternating segments of garbage and non-garbage"; the
entire string will still become garbage at once.

In fact, the two forms are semantically equivalent, since the second copy
that COPY-SEQ makes is redundant (it's a copy of a freshly-allocated list
that to which no other references exist), but few compilers will actually
optimize out the call.
Barry Margolin
System Manager, Thinking Machines Corp.

barmar@think.com          {uunet,harvard}!think!barmar