[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: hash table access time



    To frequent hashers:

    We've just done some timings to compare hash-table
    access speed with assoc-list access speed,
    on a Symbolics using a Genera 7.1 system.

    I was surprised that it was this high,
    since hash-tables are supposed to be designed to optimize access time.
    Does anyone know what makes hashing this slow?
    Is it simply because the numerical hashing operations
    are much slower than list access operations on the lisp machine hardware?

    Does anyone know of any tricks to make hash-table access faster?

I also fell into the "hash table trap" myself about six months ago.
It is very tempting to use hash tables as a panacea without fully
understanding all of the gotchas involved. From what I can tell
although some of the information I am about to describe is
documented, much of it is not. This necessitated my disassembling
some system code to try to figure out how things work as these
sources are not available. Also, it has been six months since
I did this and I am not sure I remember all of the issues involved.
Well anyway, here goes...

First of all, I use a lot of EQ hash tables. It turns out that these,
or any other hash table which hashes on addresses can run into two
potential problems. First, after every GC the tables must be rehashed.
Second, GC flips are inhibited during hash tables accesses. In the first
case, if you have many, many small hash tables as I once did, you lose
real big as you spend all of your time rehashing with even a small
amount of consing and the ephemeral GC. The second problem causes a more
subtle performance bug. It turns out that the way flip inhibiting is
implemented, the GC process does not know when flipping becomes uninhibited.
So if it tries to flip and can't because some process inhibited flips it
waits 30 seconds and tries again. If it fails again it waits 60 seconds
and tries again. Each time it fails it doubles the wait time. So if you
do a lot of consing and a lot of table access then you run a greater chance
of having a flip attempt collide more frequently with a table access which
means that the flipping will get inhibited for more time which gives you
time to cons more and potentially run out of space. Now I don't know whether
EQL hash tables and EQUAL hash tables run into the same problems for leaf
level symbols.

It turns out that I was using EQ hash tables primarily to hash on flavor
objects. One solution for this case is to have each object contain its own
"serial number" to be used as a hash value. The idea is that a global
counter is kept which is used to assign a unique fixnum serial# to each
instance as it is created. Then hash tables are created which have
a user specified :test and :hash-function. Here is some sample code.

(defvar *serial#* 0 "The global serial# counter")

(defflavor serial#-mixin (serial#) ())

(defmethod (make-instance serial#-mixin :after) (&rest ignore)
 (setf serial# (incf *serial#*)))

(defmethod (serial#-hash serial#-mixin) ()
 (values serial# 0))

(defun =serial#s (x y) (= (serial#-hash x) (serial#-hash y)))

Now to use the above code.

(defflavor foo (bar) (serial#-mixin))

(setf baz (make-hash-table :test #'=serial#s :hash-function #'serial#-hash))

(setf quux (make-instance 'foo))

(setf (gethash quux foo) 'blah)

(print (gethash quux foo))

One problem with all of this code is that it puts a generic function call to
serial#-hash inside every hash table access. Actually, it calls serial#-hash
at least three times, first to compute the hash, and then two more times to
check whether it found the right entry through =serial#s. What I do to avoid
this is a bit unsafe but works a lot faster. I use defstruct instead of
flavors and apply a convention whereby the serial# slot is always at a fixed
offset in all my instances. This way I can avoid the dispatching to code which
is identical.

One thing for which I could find no documentation is the protocol of the
:hash-function for hash tables. It turns out that it should return two
values, the first being a hash value of its argument, and the second
being a "magic" indication of how the hash value depends on the object.
I don't remember all of the possible values but 0 means that the hash
does not depend on the address of the object so that rehash is not necessary
on GC. Perhaps someone from Symbolics can enlighten us on the exact
protocol.

In general, this points out two potential optimizations for flavors (and CLOS
as well). First, when a generic function has only one method (which is the
case when you want to be able to access the instance variables of the first
argument of the method as lexical variables) the dispatch can be optimized
away. Second, when several several different flavors all have a given
instance variable in the same offset position and thus the code generated
for the accessor functions for that instance variable (both read and write)
is therefore identical and when no other flavors have the same instance
variable at a different offset, the dispatch can be optimized away. Clearly,
this optimization trades off some error detection on calling generic functions
on objects which they aren't supposed to handle. This is a tradeoff I would like
to be able to make. It is for this reason that I use defstruct now instead of
flavors as my measurements show an overall speedup of at least 50%. Note that
with defstruct I do not lose firewall protection as defstruct uses arrays which
still have bounds checking.

Returning to the subject of hash tables, one thing which I find usefull
is to be able to define "multi-dimensional" hash tables. Clearly, one
can always package up several hash "indicies" into a list but this has
two disadvantages. First, every access entails consing which is wasted.
Second, I have to create a new :test and :hash-function for every tuple
of index types used to access tables. What I propose is a modification
to the Common Lisp standard which allows a :tests keyword to make-hash-table
which specifies a list of tests, one for each axis of the table. Then
gethash should be altered to take its arguments in the opposite order,
which is more in line with aref and, take optional additional keys:

gethash: (key table &optional default) is changed to
hashref: (table &rest keys)

I don't know how to handle the &optional default.

        Jeffrey Mark Siskind
        (please respond to Qobi@XX.LCS.MIT.EDU, I will be returning
        soon to MIT)
-------