[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Issue: HASH-TABLE-STABILITY (version 1)
- To: cl-cleanup@sail.stanford.edu
- Subject: Issue: HASH-TABLE-STABILITY (version 1)
- From: Jon L White <jonl@lucid.com>
- Date: Mon, 14 Nov 88 23:14:43 PST
Issue: HASH-TABLE-STABILITY
References: CLtL, p.282 "Hash on Lisp object"
The Art of Computer Programming, vol 3, Section 6.4 "Hashing"
Category: CLARIFICATION
Edit history: Version 1, 11-Nov-88, by JonL
Problem description:
The performance, and to some degree the semantics, of hash tables
depends not only on the kind of table as specified by the :test
argument to MAKE-HASH-TABLE, but also on the underlying techniques
of key transformation (into an integer) and of "collision" resolution.
CLtL is not specific enough to encompass current, desirable practice.
People tend to be confused as to what "Hash on EQ" means, both in terms
of semantics and expected performance. Many will often suggest using
SXHASH as the key-transformation function for EQ/EQL type tables, in
order to circumvent the well-known GC-related problem with those tables.
[See, for example, the message from Barry Margolin to the common-lisp
mailing list dated "12 Sep 88 11:05 EDT"; it is reproduced at the end of
the Discussion section below.] Unfortunately, this suggestion violates
the commonly perceived notion of what "Hash on EQ" means, even though
CLtL nowhere explicitly would rule it out. CLtL is not precise enough
as to what is expected of these types of tables, and certainly the
phrase "commonly perceived notion" is not precise enough.
A similar ambiguity can arise as to what "Hash on EQUAL" means; CLtL
p.285 only indirectly implies that SXHASH should be used as the
key-transformation function for EQUAL type tables. [See below for
definition of "key-transformation".]
The term "Hashing on Lisp objects" has come to be called "Hash on EQ",
and "Hashing on Tree Structure" is called "Hash on EQUAL"; see CLtL
p.282 , which describes the differences between hash-table kinds as
being merely which function they use as the equivalence predicate
(the :TEST function argument to MAKE-HASH-TABLE.) However, the term
"Hash Table" carries a strong connotation about how such a table is
implemented; in particular, for sufficiently large tables, some technique
for "collision resolution" must be done. (See Knuth vol 3, p507-8).
Since CLtL merely focuses on the :test function, people -- implementors
as well as end-users -- tend to be confused as to how these techniques
play a central part in the notion of "hash tables; furthermore, CLtL is
silent about what actions must preserve the stabililty of these
"collision chains" (i.e., the ability of the table to "find" previously
entered keys).
Proposal (HASH-TABLE-STABILITY:KEY-TRANSFORM-RESTRICTIONS)
Summary of proposals:
-- Clarify that by "hashing", we mean more than simple linear search.
-- Generalize the following requirement from CLtL p.285:
(EQUAL X Y) implies (EQUAL (SXHASH X) (SXHASH Y))
and clarify that this requirement exactly prescribes how sensitive
hash tables can be to user-initiated data modifications.
-- Characterize just what key-transformation functions may be used for
EQ, EQL, EQUAL and EQUALP hash tables.
First, some terminology:
"Key-Transformation". This is a term used by Knuth to describe the
homomorphic transformation of a hash-table key argument into an integer.
[See the index for "The Art of Computer Programming", vol 3; especially
see Section 6.4 and page 512.] It also can refer to the transformation
into a set of two or more integers (which is not really a distinct
notion considering Goedel numbering). From this integer, the pattern
of table indices to use when searching for that key will be completely
characterized. Knuth also uses a related term "hash function" to mean
"the address where the search begins"; that term is much too subject to
confusion, and will not be used herein.
"Collision Chain". This term is limited in use by Knuth to just one
particular method of collision resolution; herein it will be used to
describe the sequence of probes specified for a given candidate entry
by a particular key-transformation; i.e. a virtual "chain" of table
indices (or address) to be examined. Two different objects which "hash"
to the same table address are "in conflict", and their respective
collision chains may or may not be equal.
"Expected number of (re-)probes". A particular algorithm for key-
transformation and collision-chain specification could be analyzed to
show a graph of the "Expected" number of calls on the :test function,
as a function of the fullness of the table (number of entries divided
by table size). "Expected" has a technical, mathmatical meaning here:
it basically means "average", so one must be careful not to get carried
away with particular counter examples of "badly distributed" data. A
"probe" is one comparison of the argument with the key of an entry in
the table, using the test function.
"%UNIQUE-NO". The implementation of Lisp data, encoded into machine-
specific data and addresses, is not part of the portable specification
of CL; but we are aware that every implemetation _must_ do some such
embedding. Thus we will use %UNIQUE-NO to name a one-to-one function
which maps any Lisp object into a Lisp integer. Normally, this will
just be the machine address of where the object is stored, possibly with
some data-type tag bits added. But for non-stored, or "immediate" data,
it doesn't matter what %UNIQUE-NO returns as long as its bijective nature
is maintained. The following equivalence is a defining characterization:
(eq x y) <--> (= (%unique-no x) (%unique-no y))
Now for the actual proposals.
1. Clarify that the term "hash table" implies the use of some techniques
designed to make the expected number of probes be bounded by a small
constant rather than growing linearly with the number of entries in the
table [for "small constant", one could also accept "log of the number of
entries"]. Although nothing in CLtL explicitly prohibits it, very few
people would accept simple linear searching as any kind of hash table.
For example, the following definition is counter to our understanding:
(defun gethash (x ht &optional default)
(let ((index (position x (hash-table-key-vector ht)
:test (hash-table-test ht))))
(if index
(values (svref (hash-table-value-vector ht) index)
T)
(values default
NIL))))
Such a simple definition may be functionally useful when the total
number of entries is small (e.g., a couple dozen or so); but the
"Expected" number of probes grows linearly with the number of entries.
As a consequence of this requirement, the collision chain for a give item
in a given table will likely not cover the whole table; otherwise, if
every such chain covered a substantial fraction of the table, then the
behaviour time would be linear in the size of the table. Thus it should
be noted that even though an item is entered in a hash table, it
typically _cannot_ be found by searching through the wrong collision
chain.
2. Specify that for any equivalence relation <eqv> used as a hash table
:test function, there must be a corresponding key-transformation function
<sxh> used in that hash table such that the following implication is true
for all X and Y:
(<eqv> x y) --> (= (<sxh> x) (<sxh> y))
This could be said to mean that a key-transformation function must be "not
more discriminating" than the equivalence function it is associated with;
i.e. as a numerical function, it must not create more equivalence classes
of data than the associated equivalence function does.
This requirement resembles that placed upon SXHASH [CLtL, p285], and
from it, one may deduce that SXHASH is a acceptable key-transformation
function for EQUAL type hash tables. Note well, however, that there
are many many functions satisfying this property for EQUAL. Hence
key-transformation for EQUAL tables:
(1) need not be constant over all CL implementations;
(2) need not be constant over all instances of EQUAL hash-tables in
a given implementation;
(3) need not be constant even over all entry counts for a particular
hash table in a given implementation.
Note also that this requirement -- "not more discriminating" -- rules
out the use of key-transformations which "notice" data modifications
that are not likewise "noticed" by the test function. Since user-
initiated data modifications might conceivably affect either the
equivalence relation of a hash-table (the :test function) or the
associated key-transformation function, we want to ensure that the
ability of the table to "find" a previously entered key is related only
to the ability of the :test function to identify equivalent copies of
the key.
3. Clarify that %UNIQUE-NO is acceptable as a key-transformation for an
EQ type table, but that it is not suitable for EQUAL or EQUALP tables.
Clarify also that most SXHASH implementations are _not_ suitable for EQ
or EQL type tables.
Numerous implementations have some function like %UNIQUE-NO called
either %POINTER or POINTER-TO-FIXNUM; they are generally acceptable for
EQ type tables. But one must be careful to note that similar, unrelated
functions could also be used; in particular, many "unique identification"
schemes have been employed where the integer is cached with the object by
some means other than the bits of its address (e.g. a "hidden" component
inside the object). Of course, any %UNIQUE-NO defined as above would not
be acceptable for EQUAL or EQUALP tables; two EQUAL but non-EQ cons cells
must have different %UNIQUE-NO values, violating the general rule stated
in item 2 above.
A trivial variant on %UNIQUE-NO is acceptable for EQL tables:
(if (numberp x) (sxhash x) (%unique-no x))
By itself, %UNIQUE-NO would not be acceptable since it would be too
"discriminating" on numbers.
Many persons have noted that the definition:
(defun sxhash (x) 5) ;for any random integer value of "5"
meets the CLtL criterion for SXHASH. In fact, such a constant function
may be quite useful for hash-tables with entry counts below a specified
mininum. But of course it is not really suitable in general since it
would put every entry into the same collision chain; that would cause
the expected re-probe cost to be linear in the number of entries, which
violates item 1 above.
On the other hand, an SXHASH function suitable for use as the key
transformation in an EQUAL type table is _not_ acceptable for use
with an EQ or EQL table. Every implementation the proposer has queried
returns different values for the lists (A) and (B). Thus consider the
example of hashing a list (A) into an EQ type table, and observe what
happens after altering the (first) element of this list to be B. Let
x = the list before modification
y = the list after modificaton
now clearly (EQ X Y) is true, so we would obviously like a GETHASH call
after the modification has been done to find the same cons cell that
had been entered before the update. If SXHASH were used as the key-
transform, then the collision chain selected _after_ the alteration would
be different from the one selected beforehand. Since the two different
collision chains can not be guaranteed to intersect, then in at least
some circumstances, GETHASH on X would find the entry, but GETHASH on Y
would fail to find it. See also the examples section.
Although SXHASH is not very tightly defined in CLtL, one must be careful
not to make assumptions about whether or not it is acceptable for use in
EQUALP tables. In order to get a reasonable amount of randomization in
the collision chains, a key-transformation function for EQUAL tables
ought to be "more discriminating" than any minimal function acceptable
for EQUALP tables [because EQUAL partitions the object world up into
many more equivalence classes than does EQUALP].
In item 2 above, there are listed three areas where key-transformation
functions may differ: when going from one vendor to another (or from one
release by the same vendor to another), when going from one hash-table
to another of the same type, and when increasing or decreasing the entry
count of the table. To this list we can add another more general rule on
key-transformations.
(4) [they] need not be constant even over a particular "core image"
saving and restoration, or over a "memory reorganization" such as
a garbage collection.
Of course, if a change is made at some point in time in the key-
transformation algorithm being used for a particular table, then that
table should be "re-hashed" to ensure the continuity of its entries.
As has been noted before, many implementations use algorithms for EQ
type tables which change after any data is relocated; that is why
re-hashing may be required after a "GC".
Examples:
It is not surprising that in the following example, the value Y
cannot be found in the table after it has been altered by the first
SETF, even though it could be found before the alteration.
(setq ht (make-hash-table :test 'equal)) ==> #<Hash-Table>
(setq x '(A (B) (C D))
y (copy-tree x)) ==> (A (B) (C D))
(and (equal x y) (not (eq x y))) ==> T
(setf (gethash x ht) T) ==> T
(setf (car (second y)) 'E) ==> E
(gethash x ht) ==> T
(gethash y ht) ==> NIL
After all, the :test function will not be able to identify the
altered key with with the one originally entered, because at the
time gethash is called:
(equal x y) ==> NIL
However, the circumstances under which the following can fail are
not at all obvious:
(setq ht (make-hash-table :test 'equal)) ==> #<Hash-Table>
(setq x '(A #(B) (C D))
y (copy-tree x)) ==> (A #(B) (C D))
(and (equal x y) (not (eq x y))) ==> T
(setf (gethash x ht) T) ==> T
(setf (aref (second y)) 'E) ==> E
(gethash x ht) ==> T
(gethash y ht) ==> ?
Note however that:
(equal x y) ==> T
If the key-transformation function used in this hashtable failed to obey
the "not more discriminating" contraint imposed by item 2 above, it
might be tempted to descend into the vector #(B) in order to randomize
the keys a bit more; but EQUAL on pointer vectors is defined to be EQ.
Thus X and Y, while being EQUAL, might fall into different collision
chains, and hence not be identified as the same key.
On the other hand, EQ/EQL type tables should be impervious to the
updates in the above examples:
(setq ht (make-hash-table)) ==> #<Hash-Table>
(setq y (setq x (copy-tree '(A (B) (C D))))) ==> (A (B) (C D))
(setf (gethash x ht) T) ==> T
(gethash x ht) ==> T
(setf (car (second y)) 'E) ==> E
(gethash y ht) ==> T
Thus x and y are "EQ, but not EQUAL" [which only makes sense when
they refer to the same object at different points in time]; however,
the EQ/EQL-type table is not affected by this.
Rationale:
The performance expectations about hash-tables, and consequent
implementational constraints, need to be formalized.
Current practice:
Every implementation that the proposer has tried *seems* to satisfy
these constraints.
Cost to Implementors:
None.
Cost to Users:
None.
Cost of non-adoption:
Continuing confusion as to what is stable in EQ/EQL tables, and what
is stable in EQUAL tables. Possible confusion when it comes to
implementing EQUALP tables.
Performance impact:
N.A.
Benefits:
See Cost of non-adoption.
Esthetics:
The proposal more closely relates the term "Hash Table" to the
classic use of it in "The Art of Computer Programming", vol 3.
Discussion:
One of the attractions to Common Lisp is that many common techniques are
a required part of the language; C programmers who continue to re-invent
hasing techniques over and over have praised CL in particular for hash
tables. After all, it is much more likely that efficient, correctly
coded algorithms will be provided by the system supplier than that every
code writer will understand and correctly apply the information found
in Knuth's "The Art of Computer Programming", vol 3.
The requirement that the expected number of reprobes be bounded by a
"small constant" should not be taken to extreme. In particular, a
simple trade-off of space for time can assure some compliance with it.
For example, a data set of size N could be partitioned into N/20
subsets; as long as the partitioning function does a fairly good job
of balancing the number of elements in each partition class, and as
long as the partition function can be quickly calculated, then the one
could say that the expected number of probes would be bounded by "about
twenty or so". The generally understood meanings of the :rehash-size
and :rehash-threshold components of hash-tables may be biased towards
an "open-addressing" implementation; but "bucketizing" implementations
are not arbitrarily ruled out. This proposal is in no way intended to
rule out "bucketizing" implementations of hash tables.
Here's an example of how one might analyze the problems relating GC
and EQ/EQL type tables:
Date: Mon, 12 Sep 88 11:05 EDT
From: Barry Margolin <barmar@Think.COM>
Subject: MAKE-HASH-TABLE :TEST arg
To: "Steve Bacher (Batchman)" <SEB1525@draper.com>
Cc: common-lisp@sail.stanford.edu
. . .
Various aspects of the behavior of a hash table are dependent upon the
TEST argument. An EQUAL hash table need not be rehashed after a copying
GC. The hash function is generally dependent upon the test function;
for an EQUAL hash table it would be SXHASH, while for an EQ hash table
it would probably be a simple hash on the address.
I suppose you COULD use SXHASH for all hash tables, since EQ objects are
necessarily EQUAL, and you COULD rehash ALL hash tables. Or you could
implement hash tables without actually hashing (e.g. implement them as
alists). But if performance is an issue (which it generally is when you
use a hash table), you'll probably want to do things dependent on the
test function.
barmar
This suggestion is not prohibited by CLtL, although it violates the
commonly accepted understanding of what "Hash on EQ" means.