[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Issue: HASH-TABLE-STABILITY (version 1)



Issue:      HASH-TABLE-STABILITY

References:    CLtL, p.282 "Hash on Lisp object"
	       The Art of Computer Programming, vol 3, Section 6.4 "Hashing"

Category:     CLARIFICATION

Edit history: Version 1, 11-Nov-88, by JonL 


Problem description:

The performance, and to some degree the semantics, of hash tables
depends not only on the kind of table as specified by the :test
argument to MAKE-HASH-TABLE, but also on the underlying techniques
of key transformation (into an integer) and of "collision" resolution. 
CLtL is not specific enough to encompass current, desirable practice.

People tend to be confused as to what "Hash on EQ" means, both in terms
of semantics and expected performance.  Many will often suggest using
SXHASH as the key-transformation function for EQ/EQL type tables, in
order to circumvent the well-known GC-related problem with those tables.
[See, for example, the message from Barry Margolin to the common-lisp
mailing list dated "12 Sep 88 11:05 EDT"; it is reproduced at the end of
the Discussion section below.]  Unfortunately, this suggestion violates
the commonly perceived notion of what "Hash on EQ" means, even though
CLtL nowhere explicitly would rule it out.  CLtL is not precise enough
as to what is expected of these types of tables, and certainly the
phrase "commonly perceived notion" is not precise enough.

A similar ambiguity can arise as to what "Hash on EQUAL" means; CLtL
p.285 only indirectly implies that SXHASH should be used as the
key-transformation function for EQUAL type tables.  [See below for
definition of "key-transformation".]

The term "Hashing on Lisp objects" has come to be called "Hash on EQ", 
and "Hashing on Tree Structure" is called "Hash on EQUAL"; see CLtL
p.282 , which describes the differences between hash-table kinds as 
being merely which function they use as the equivalence predicate
(the :TEST function argument to MAKE-HASH-TABLE.)  However, the term 
"Hash Table" carries a strong connotation about how such a table is 
implemented; in particular, for sufficiently large tables, some technique
for "collision resolution" must be done.  (See Knuth vol 3, p507-8).  
Since CLtL merely focuses on the :test function, people -- implementors 
as well as end-users -- tend to be confused as to how these techniques 
play a central part in the notion of "hash tables; furthermore, CLtL is
silent about what actions must preserve the stabililty of these 
"collision chains"  (i.e., the ability of the table to "find" previously 
entered keys).



Proposal (HASH-TABLE-STABILITY:KEY-TRANSFORM-RESTRICTIONS)

Summary of proposals:

-- Clarify that by "hashing", we mean more than simple linear search.
-- Generalize the following requirement from CLtL p.285:
           (EQUAL X Y)  implies  (EQUAL (SXHASH X) (SXHASH Y)) 
    and clarify that this requirement exactly prescribes how sensitive 
    hash tables can be to user-initiated data modifications.
-- Characterize just what key-transformation functions may be used for 
    EQ, EQL, EQUAL and EQUALP hash tables.

First, some terminology:

"Key-Transformation".  This is a term used by Knuth to describe the
homomorphic transformation of a hash-table key argument into an integer.
[See the index for "The Art of Computer Programming", vol 3; especially
see Section 6.4 and page 512.]  It also can refer to the transformation
into a set of two or more integers (which is not really a distinct
notion considering Goedel numbering).  From this integer, the pattern 
of table indices to use when searching for that key will be completely
characterized.  Knuth also uses a related term "hash function" to mean
"the address where the search begins"; that term is much too subject to
confusion, and will not be used herein.

"Collision Chain".  This term is limited in use by Knuth to just one
particular method of collision resolution; herein it will be used to
describe the sequence of probes specified for a given candidate entry 
by a particular key-transformation; i.e. a virtual "chain" of table 
indices (or address) to be examined.  Two different objects which "hash" 
to the same table address are "in conflict", and their respective 
collision chains may or may not be equal.

"Expected number of (re-)probes".  A particular algorithm for key-
transformation and collision-chain specification could be analyzed to
show a graph of the "Expected" number of calls on the :test function, 
as a function of the fullness of the table (number of entries divided 
by table size).  "Expected" has a technical, mathmatical meaning here:
it basically means "average", so one must be careful not to get carried
away with particular counter examples of "badly distributed" data.  A
"probe" is one comparison of the argument with the key of an entry in
the table, using the test function.

"%UNIQUE-NO".  The implementation of Lisp data, encoded into machine-
specific data and addresses, is not part of the portable specification
of CL; but we are aware that every implemetation _must_ do some such
embedding.  Thus we will use %UNIQUE-NO to name a one-to-one function
which maps any Lisp object into a Lisp integer.  Normally, this will
just be the machine address of where the object is stored, possibly with
some data-type tag bits added.  But for non-stored, or "immediate" data,
it doesn't matter what %UNIQUE-NO returns as long as its bijective nature
is maintained.  The following equivalence is a defining characterization:
    (eq x y)  <-->  (= (%unique-no x) (%unique-no y))


Now for the actual proposals.


1. Clarify that the term "hash table" implies the use of some techniques
 designed to make the expected number of probes be bounded by a small
 constant rather than growing linearly with the number of entries in the
 table [for "small constant", one could also accept "log of the number of
 entries"].  Although nothing in CLtL explicitly prohibits it, very few 
 people would accept simple linear searching as any kind of hash table.
 For example, the following definition is counter to our understanding:
    (defun gethash (x ht &optional default)
      (let ((index (position x (hash-table-key-vector ht) 
			    :test (hash-table-test ht))))
	(if index
	    (values (svref (hash-table-value-vector ht) index)
		    T)
	    (values default 
		    NIL)))) 
 Such a simple definition may be functionally useful when the total
 number of entries is small (e.g., a couple dozen or so); but the 
 "Expected" number of probes grows linearly with the number of entries.  

 As a consequence of this requirement, the collision chain for a give item
 in a given table will likely not cover the whole table; otherwise, if
 every such chain covered a substantial fraction of the table, then the
 behaviour time would be linear in the size of the table.  Thus it should
 be noted that even though an item is entered in a hash table, it
 typically _cannot_ be found by searching through the wrong collision
 chain.


2. Specify that for any equivalence relation <eqv> used as a hash table 
 :test function, there must be a corresponding key-transformation function 
 <sxh> used in that hash table such that the following implication is true
 for all X and Y:
	 (<eqv> x y) -->  (= (<sxh> x) (<sxh> y))
 This could be said to mean that a key-transformation function must be "not
 more discriminating" than the equivalence function it is associated with;
 i.e. as a numerical function, it must not create more equivalence classes 
 of data than the associated equivalence function does.  

 This requirement resembles that placed upon SXHASH [CLtL, p285], and
 from it, one may deduce that SXHASH is a acceptable key-transformation
 function for EQUAL type hash tables.  Note well, however, that there 
 are many many functions satisfying this property for EQUAL.  Hence 
 key-transformation for EQUAL tables:
   (1) need not be constant over all CL implementations;
   (2) need not be constant over all instances of EQUAL hash-tables in 
       a given implementation;
   (3) need not be constant even over all entry counts for a particular 
       hash table in a given implementation.

 Note also that this requirement -- "not more discriminating" -- rules
 out the use of key-transformations which "notice" data modifications
 that are not likewise "noticed" by the test function.  Since user-
 initiated data modifications might conceivably affect either the
 equivalence relation of a hash-table (the :test function) or the
 associated key-transformation function, we want to ensure that the
 ability of the table to "find" a previously entered key is related only
 to the ability of the :test function to identify equivalent copies of
 the key.

3. Clarify that %UNIQUE-NO is acceptable as a key-transformation for an
 EQ type table, but that it is not suitable for EQUAL or EQUALP tables.
 Clarify also that most SXHASH implementations are _not_ suitable for EQ 
 or EQL type tables.

 Numerous implementations have some function like %UNIQUE-NO called 
 either %POINTER or POINTER-TO-FIXNUM; they are generally acceptable for 
 EQ type tables.  But one must be careful to note that similar, unrelated
 functions could also be used; in particular, many "unique identification" 
 schemes have been employed where the integer is cached with the object by
 some means other than the bits of its address (e.g. a "hidden" component 
 inside the object). Of course, any %UNIQUE-NO defined as above would not 
 be acceptable for EQUAL or EQUALP tables; two EQUAL but non-EQ cons cells
 must have different %UNIQUE-NO values, violating the general rule stated
 in item 2 above.

 A trivial variant on %UNIQUE-NO is acceptable for EQL tables:
     (if (numberp x) (sxhash x) (%unique-no x))
 By itself, %UNIQUE-NO would not be acceptable since it would be too
 "discriminating" on numbers.

 Many persons have noted that the definition:
     (defun sxhash (x) 5)	    ;for any random integer value of "5"
 meets the CLtL criterion for SXHASH.  In fact, such a constant function
 may be quite useful for hash-tables with entry counts below a specified
 mininum.  But of course it is not really suitable in general since it 
 would put every entry into the same collision chain; that would cause 
 the expected re-probe cost to be linear in the number of entries, which
 violates item 1 above.

 On the other hand, an SXHASH function suitable for use as the key
 transformation in an EQUAL type table is _not_ acceptable for use
 with an EQ or EQL table.  Every implementation the proposer has queried 
 returns different values for the lists (A) and (B).  Thus consider the
 example of hashing a list (A) into an EQ type table, and observe what 
 happens after altering the (first) element of this list to be B.  Let
     x = the list before modification
     y = the list after modificaton
 now clearly (EQ X Y) is true, so we would obviously like a GETHASH call
 after the modification has been done to find the same cons cell that 
 had  been entered before the update.  If SXHASH were used as the key-
 transform, then the collision chain selected _after_ the alteration would 
 be different from the one selected beforehand.   Since the two different 
 collision chains can not be  guaranteed to intersect, then in at least 
 some circumstances, GETHASH on X would find the entry, but GETHASH on Y 
 would fail to find it.  See also the examples section.

 Although SXHASH is not very tightly defined in CLtL, one must be careful
 not to make assumptions about whether or not it is acceptable for use in
 EQUALP tables.  In order to get a reasonable amount of randomization in
 the collision chains, a key-transformation function for EQUAL tables
 ought to be "more discriminating" than any minimal function acceptable 
 for EQUALP tables [because EQUAL partitions the object world up into 
 many more equivalence classes than does EQUALP].


In item 2 above, there are listed three areas where key-transformation
functions may differ: when going from one vendor to another (or from one
release by the same vendor to another), when going from one hash-table
to another of the same type, and when increasing or decreasing the entry
count of the table.  To this list we can add another more general rule on
key-transformations.

  (4) [they] need not be constant even over a particular "core image" 
      saving and restoration, or over a "memory reorganization" such as 
      a garbage collection.

Of course, if a change is made at some point in time in the key-
transformation algorithm being used for a particular table, then that 
table should be "re-hashed" to ensure the continuity of its entries.  
As has been noted before, many implementations use algorithms for EQ 
type tables which change after any data is relocated; that is why 
re-hashing may be required after a "GC".



Examples:

It is not surprising that in the following example, the value Y
cannot be found in the table after it has been altered by the first
SETF, even though it could be found before the alteration. 
   (setq ht (make-hash-table :test 'equal))     ==>  #<Hash-Table>
   (setq x '(A (B) (C D))
	 y (copy-tree x))                       ==>  (A (B) (C D))
   (and (equal x y) (not (eq x y)))             ==>  T
   (setf (gethash x ht) T)                      ==>  T
   (setf (car (second y)) 'E)                   ==>  E
   (gethash x ht)                               ==>  T
   (gethash y ht)                               ==>  NIL
After all, the :test function will not be able to identify the
altered key with with the one originally entered, because at the
time gethash is called:
   (equal x y)                               ==>  NIL

However, the circumstances under which the following can fail are
not at all obvious:
   (setq ht (make-hash-table :test 'equal))      ==>  #<Hash-Table>
   (setq x '(A #(B) (C D))
	 y (copy-tree x))                        ==>  (A #(B) (C D))
   (and (equal x y) (not (eq x y)))              ==>  T
   (setf (gethash x ht) T)                       ==>  T
   (setf (aref (second y)) 'E)                   ==>  E
   (gethash x ht)                                ==>  T
   (gethash y ht)                                ==>  ?
Note however that:
   (equal x y)                                   ==> T 
If the key-transformation function used in this hashtable failed to obey
the "not more discriminating" contraint imposed by item 2 above, it
might be tempted to descend into the vector #(B) in order to randomize
the keys a bit more; but EQUAL on pointer vectors is defined to be EQ.
Thus X and Y, while being EQUAL, might fall into different collision
chains, and hence not be identified as the same key.


On the other hand, EQ/EQL type tables should be impervious to the
updates in the above examples:
   (setq ht (make-hash-table))                    ==>  #<Hash-Table>
   (setq y (setq x (copy-tree '(A (B) (C D)))))   ==>  (A (B) (C D))
   (setf (gethash x ht) T)                        ==>  T
   (gethash x ht)                                 ==>  T
   (setf (car (second y)) 'E)                     ==>  E
   (gethash y ht)                                 ==>  T
Thus x and y are "EQ, but not EQUAL" [which only makes sense when
they refer to the same object at different points in time]; however,
the EQ/EQL-type table is not affected by this.



Rationale:

The performance expectations about hash-tables, and consequent 
implementational constraints, need to be formalized.

Current practice:

Every implementation that the proposer has tried *seems* to satisfy
these constraints.

Cost to Implementors:

None.

Cost to Users:

None.

Cost of non-adoption:

Continuing confusion as to what is stable in EQ/EQL tables, and what
is stable in EQUAL tables.  Possible confusion when it comes to
implementing EQUALP tables.

Performance impact:

N.A.

Benefits:

See Cost of non-adoption.

Esthetics:

The proposal more closely relates the term "Hash Table" to the
classic use of it in  "The Art of Computer Programming", vol 3.


Discussion:

One of the attractions to Common Lisp is that many common techniques are
a required part of the language; C programmers who continue to re-invent
hasing techniques over and over have praised CL in particular for hash
tables.  After all, it is much more likely that efficient, correctly
coded algorithms will be provided by the system supplier than that every
code writer will understand and correctly apply the information found
in Knuth's "The Art of Computer Programming", vol 3.

The requirement that the expected number of reprobes be bounded by a 
"small constant" should not be taken to extreme.  In particular, a
simple trade-off of space for time can assure some compliance with it.
For example, a data set of size N could be partitioned into N/20
subsets; as long as the partitioning function does a fairly good job
of balancing the number of elements in each partition class, and as
long as the partition function can be quickly calculated, then the one
could say that the expected number of probes would be bounded by "about
twenty or so".  The generally understood meanings of the :rehash-size
and :rehash-threshold components of hash-tables may be biased towards 
an "open-addressing" implementation; but "bucketizing" implementations
are not arbitrarily ruled out.  This proposal is in no way intended to
rule out "bucketizing" implementations of hash tables.


Here's an example of how one might analyze the problems relating GC 
and EQ/EQL type tables:

  Date: Mon, 12 Sep 88 11:05 EDT
  From: Barry Margolin <barmar@Think.COM>
  Subject: MAKE-HASH-TABLE :TEST arg
  To: "Steve Bacher (Batchman)" <SEB1525@draper.com>
  Cc: common-lisp@sail.stanford.edu

  . . . 
  Various aspects of the behavior of a hash table are dependent upon the
  TEST argument.  An EQUAL hash table need not be rehashed after a copying
  GC.  The hash function is generally dependent upon the test function;
  for an EQUAL hash table it would be SXHASH, while for an EQ hash table
  it would probably be a simple hash on the address.

  I suppose you COULD use SXHASH for all hash tables, since EQ objects are
  necessarily EQUAL, and you COULD rehash ALL hash tables.  Or you could
  implement hash tables without actually hashing (e.g. implement them as
  alists).  But if performance is an issue (which it generally is when you
  use a hash table), you'll probably want to do things dependent on the
  test function.

						  barmar

This suggestion is not prohibited by CLtL, although it violates the
commonly accepted understanding of what "Hash on EQ" means.