[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LispM's crummy I/O performance (was RE: LispM Market Share)

    Date: Wed, 17 Jan 90 11:08:56 -0500
    From: magerman@linc.cis.upenn.edu
	   My data set is a sequence of 1 million pairs of integers.  The
    first in each pair ranges from 0 to 32767, and the second ranges from 0
    to 127.  So, a section of the data might be:
	    3000 12 4562 50 18002 61 456 12 . . .
    (Actually, each of these values corresponds to a word/part-of-speech
    pair in a corpus of text, the Brown Corpus.  I have just mapped the
    words and parts-of-speech into integers so that the LispM could process
    them in my lifetime.)

I was about to say this might be consing/paging time, but I decided to put it
to the test first.  I wrote a file with 100,000 such pairs (I don't have the
disk space for a real trial) and then read it in.  Rather than making a
single big list I just put the numbers into the file at top-level, thinking
to avoid the consing of the big list.  READing 2e5 numbers distributed as
above took about 400 seconds, so I have to assume that doing the same for 2e6
numbers would take over an hour.  [As an experiment I redid this using a
2e5-element list, and a single READ took 10% less time, although twice as
many page faults, so my guess that CONSing was a big part of the problem is
probably wrong anyway]

You might try writing the file out as a binary file (not a BIN file).  I did
this with the same kind of data as above (100,000 pairs of integers) using
the technique in the second example below, and reading that file took under
30 seconds, giving a performance improvement of more than a factor of 12; if
it scales up, you should be able to read your file in about five minutes (you
might want to do something fancy with the paging behaviour of your program to
improve the array-storage time).

Here are a couple of file-writing functions which produce binary files; the
reading functions should be easy to derive.  CLtL is silent on the exact
format of binary files, so an (UNSIGNED-BYTE 16) file might not be portable
between lisp implementations, and the file produced by the second function is
25% smaller than the first one, but the first function is arguably simpler to
understand then the second.

[By the way, the second binary file was 66% smaller on the disk than the
equivalent character file, which you might also find desirable.]

(defun write-my-data (my-data path)   ; my-data is 1e6x2 array
  (with-open-file (data-file path :direction :output :element-type '(unsigned-byte 16)) 
    (dotimes (i (array-dimension my-data 0))
      (write-byte (aref my-data i 0) data-file)
      (write-byte (aref my-data i 1) data-file))))

(defun write-my-data (my-data path)
  (with-open-file (data-file path :direction :output :element-type '(unsigned-byte 8))
    (dotimes (i (array-dimension my-data 0))
      (let* ((b (aref my-data i 0))
	     (b1 (ldb (byte 8 0) b))
	     (b2 (ldb (byte 8 8) b))
	     (b3 (aref my-data i 1)))
	(write-byte b1 data-file)
	(write-byte b2 data-file)
	(write-byte b3 data-file)))))