[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: cl-cleanup@sail.stanford.edu*Subject*: Issue: HASH-TABLE-STABILITY (version 1)*From*: Jon L White <jonl@lucid.com>*Date*: Mon, 14 Nov 88 23:14:43 PST

Issue: HASH-TABLE-STABILITY References: CLtL, p.282 "Hash on Lisp object" The Art of Computer Programming, vol 3, Section 6.4 "Hashing" Category: CLARIFICATION Edit history: Version 1, 11-Nov-88, by JonL Problem description: The performance, and to some degree the semantics, of hash tables depends not only on the kind of table as specified by the :test argument to MAKE-HASH-TABLE, but also on the underlying techniques of key transformation (into an integer) and of "collision" resolution. CLtL is not specific enough to encompass current, desirable practice. People tend to be confused as to what "Hash on EQ" means, both in terms of semantics and expected performance. Many will often suggest using SXHASH as the key-transformation function for EQ/EQL type tables, in order to circumvent the well-known GC-related problem with those tables. [See, for example, the message from Barry Margolin to the common-lisp mailing list dated "12 Sep 88 11:05 EDT"; it is reproduced at the end of the Discussion section below.] Unfortunately, this suggestion violates the commonly perceived notion of what "Hash on EQ" means, even though CLtL nowhere explicitly would rule it out. CLtL is not precise enough as to what is expected of these types of tables, and certainly the phrase "commonly perceived notion" is not precise enough. A similar ambiguity can arise as to what "Hash on EQUAL" means; CLtL p.285 only indirectly implies that SXHASH should be used as the key-transformation function for EQUAL type tables. [See below for definition of "key-transformation".] The term "Hashing on Lisp objects" has come to be called "Hash on EQ", and "Hashing on Tree Structure" is called "Hash on EQUAL"; see CLtL p.282 , which describes the differences between hash-table kinds as being merely which function they use as the equivalence predicate (the :TEST function argument to MAKE-HASH-TABLE.) However, the term "Hash Table" carries a strong connotation about how such a table is implemented; in particular, for sufficiently large tables, some technique for "collision resolution" must be done. (See Knuth vol 3, p507-8). Since CLtL merely focuses on the :test function, people -- implementors as well as end-users -- tend to be confused as to how these techniques play a central part in the notion of "hash tables; furthermore, CLtL is silent about what actions must preserve the stabililty of these "collision chains" (i.e., the ability of the table to "find" previously entered keys). Proposal (HASH-TABLE-STABILITY:KEY-TRANSFORM-RESTRICTIONS) Summary of proposals: -- Clarify that by "hashing", we mean more than simple linear search. -- Generalize the following requirement from CLtL p.285: (EQUAL X Y) implies (EQUAL (SXHASH X) (SXHASH Y)) and clarify that this requirement exactly prescribes how sensitive hash tables can be to user-initiated data modifications. -- Characterize just what key-transformation functions may be used for EQ, EQL, EQUAL and EQUALP hash tables. First, some terminology: "Key-Transformation". This is a term used by Knuth to describe the homomorphic transformation of a hash-table key argument into an integer. [See the index for "The Art of Computer Programming", vol 3; especially see Section 6.4 and page 512.] It also can refer to the transformation into a set of two or more integers (which is not really a distinct notion considering Goedel numbering). From this integer, the pattern of table indices to use when searching for that key will be completely characterized. Knuth also uses a related term "hash function" to mean "the address where the search begins"; that term is much too subject to confusion, and will not be used herein. "Collision Chain". This term is limited in use by Knuth to just one particular method of collision resolution; herein it will be used to describe the sequence of probes specified for a given candidate entry by a particular key-transformation; i.e. a virtual "chain" of table indices (or address) to be examined. Two different objects which "hash" to the same table address are "in conflict", and their respective collision chains may or may not be equal. "Expected number of (re-)probes". A particular algorithm for key- transformation and collision-chain specification could be analyzed to show a graph of the "Expected" number of calls on the :test function, as a function of the fullness of the table (number of entries divided by table size). "Expected" has a technical, mathmatical meaning here: it basically means "average", so one must be careful not to get carried away with particular counter examples of "badly distributed" data. A "probe" is one comparison of the argument with the key of an entry in the table, using the test function. "%UNIQUE-NO". The implementation of Lisp data, encoded into machine- specific data and addresses, is not part of the portable specification of CL; but we are aware that every implemetation _must_ do some such embedding. Thus we will use %UNIQUE-NO to name a one-to-one function which maps any Lisp object into a Lisp integer. Normally, this will just be the machine address of where the object is stored, possibly with some data-type tag bits added. But for non-stored, or "immediate" data, it doesn't matter what %UNIQUE-NO returns as long as its bijective nature is maintained. The following equivalence is a defining characterization: (eq x y) <--> (= (%unique-no x) (%unique-no y)) Now for the actual proposals. 1. Clarify that the term "hash table" implies the use of some techniques designed to make the expected number of probes be bounded by a small constant rather than growing linearly with the number of entries in the table [for "small constant", one could also accept "log of the number of entries"]. Although nothing in CLtL explicitly prohibits it, very few people would accept simple linear searching as any kind of hash table. For example, the following definition is counter to our understanding: (defun gethash (x ht &optional default) (let ((index (position x (hash-table-key-vector ht) :test (hash-table-test ht)))) (if index (values (svref (hash-table-value-vector ht) index) T) (values default NIL)))) Such a simple definition may be functionally useful when the total number of entries is small (e.g., a couple dozen or so); but the "Expected" number of probes grows linearly with the number of entries. As a consequence of this requirement, the collision chain for a give item in a given table will likely not cover the whole table; otherwise, if every such chain covered a substantial fraction of the table, then the behaviour time would be linear in the size of the table. Thus it should be noted that even though an item is entered in a hash table, it typically _cannot_ be found by searching through the wrong collision chain. 2. Specify that for any equivalence relation <eqv> used as a hash table :test function, there must be a corresponding key-transformation function <sxh> used in that hash table such that the following implication is true for all X and Y: (<eqv> x y) --> (= (<sxh> x) (<sxh> y)) This could be said to mean that a key-transformation function must be "not more discriminating" than the equivalence function it is associated with; i.e. as a numerical function, it must not create more equivalence classes of data than the associated equivalence function does. This requirement resembles that placed upon SXHASH [CLtL, p285], and from it, one may deduce that SXHASH is a acceptable key-transformation function for EQUAL type hash tables. Note well, however, that there are many many functions satisfying this property for EQUAL. Hence key-transformation for EQUAL tables: (1) need not be constant over all CL implementations; (2) need not be constant over all instances of EQUAL hash-tables in a given implementation; (3) need not be constant even over all entry counts for a particular hash table in a given implementation. Note also that this requirement -- "not more discriminating" -- rules out the use of key-transformations which "notice" data modifications that are not likewise "noticed" by the test function. Since user- initiated data modifications might conceivably affect either the equivalence relation of a hash-table (the :test function) or the associated key-transformation function, we want to ensure that the ability of the table to "find" a previously entered key is related only to the ability of the :test function to identify equivalent copies of the key. 3. Clarify that %UNIQUE-NO is acceptable as a key-transformation for an EQ type table, but that it is not suitable for EQUAL or EQUALP tables. Clarify also that most SXHASH implementations are _not_ suitable for EQ or EQL type tables. Numerous implementations have some function like %UNIQUE-NO called either %POINTER or POINTER-TO-FIXNUM; they are generally acceptable for EQ type tables. But one must be careful to note that similar, unrelated functions could also be used; in particular, many "unique identification" schemes have been employed where the integer is cached with the object by some means other than the bits of its address (e.g. a "hidden" component inside the object). Of course, any %UNIQUE-NO defined as above would not be acceptable for EQUAL or EQUALP tables; two EQUAL but non-EQ cons cells must have different %UNIQUE-NO values, violating the general rule stated in item 2 above. A trivial variant on %UNIQUE-NO is acceptable for EQL tables: (if (numberp x) (sxhash x) (%unique-no x)) By itself, %UNIQUE-NO would not be acceptable since it would be too "discriminating" on numbers. Many persons have noted that the definition: (defun sxhash (x) 5) ;for any random integer value of "5" meets the CLtL criterion for SXHASH. In fact, such a constant function may be quite useful for hash-tables with entry counts below a specified mininum. But of course it is not really suitable in general since it would put every entry into the same collision chain; that would cause the expected re-probe cost to be linear in the number of entries, which violates item 1 above. On the other hand, an SXHASH function suitable for use as the key transformation in an EQUAL type table is _not_ acceptable for use with an EQ or EQL table. Every implementation the proposer has queried returns different values for the lists (A) and (B). Thus consider the example of hashing a list (A) into an EQ type table, and observe what happens after altering the (first) element of this list to be B. Let x = the list before modification y = the list after modificaton now clearly (EQ X Y) is true, so we would obviously like a GETHASH call after the modification has been done to find the same cons cell that had been entered before the update. If SXHASH were used as the key- transform, then the collision chain selected _after_ the alteration would be different from the one selected beforehand. Since the two different collision chains can not be guaranteed to intersect, then in at least some circumstances, GETHASH on X would find the entry, but GETHASH on Y would fail to find it. See also the examples section. Although SXHASH is not very tightly defined in CLtL, one must be careful not to make assumptions about whether or not it is acceptable for use in EQUALP tables. In order to get a reasonable amount of randomization in the collision chains, a key-transformation function for EQUAL tables ought to be "more discriminating" than any minimal function acceptable for EQUALP tables [because EQUAL partitions the object world up into many more equivalence classes than does EQUALP]. In item 2 above, there are listed three areas where key-transformation functions may differ: when going from one vendor to another (or from one release by the same vendor to another), when going from one hash-table to another of the same type, and when increasing or decreasing the entry count of the table. To this list we can add another more general rule on key-transformations. (4) [they] need not be constant even over a particular "core image" saving and restoration, or over a "memory reorganization" such as a garbage collection. Of course, if a change is made at some point in time in the key- transformation algorithm being used for a particular table, then that table should be "re-hashed" to ensure the continuity of its entries. As has been noted before, many implementations use algorithms for EQ type tables which change after any data is relocated; that is why re-hashing may be required after a "GC". Examples: It is not surprising that in the following example, the value Y cannot be found in the table after it has been altered by the first SETF, even though it could be found before the alteration. (setq ht (make-hash-table :test 'equal)) ==> #<Hash-Table> (setq x '(A (B) (C D)) y (copy-tree x)) ==> (A (B) (C D)) (and (equal x y) (not (eq x y))) ==> T (setf (gethash x ht) T) ==> T (setf (car (second y)) 'E) ==> E (gethash x ht) ==> T (gethash y ht) ==> NIL After all, the :test function will not be able to identify the altered key with with the one originally entered, because at the time gethash is called: (equal x y) ==> NIL However, the circumstances under which the following can fail are not at all obvious: (setq ht (make-hash-table :test 'equal)) ==> #<Hash-Table> (setq x '(A #(B) (C D)) y (copy-tree x)) ==> (A #(B) (C D)) (and (equal x y) (not (eq x y))) ==> T (setf (gethash x ht) T) ==> T (setf (aref (second y)) 'E) ==> E (gethash x ht) ==> T (gethash y ht) ==> ? Note however that: (equal x y) ==> T If the key-transformation function used in this hashtable failed to obey the "not more discriminating" contraint imposed by item 2 above, it might be tempted to descend into the vector #(B) in order to randomize the keys a bit more; but EQUAL on pointer vectors is defined to be EQ. Thus X and Y, while being EQUAL, might fall into different collision chains, and hence not be identified as the same key. On the other hand, EQ/EQL type tables should be impervious to the updates in the above examples: (setq ht (make-hash-table)) ==> #<Hash-Table> (setq y (setq x (copy-tree '(A (B) (C D))))) ==> (A (B) (C D)) (setf (gethash x ht) T) ==> T (gethash x ht) ==> T (setf (car (second y)) 'E) ==> E (gethash y ht) ==> T Thus x and y are "EQ, but not EQUAL" [which only makes sense when they refer to the same object at different points in time]; however, the EQ/EQL-type table is not affected by this. Rationale: The performance expectations about hash-tables, and consequent implementational constraints, need to be formalized. Current practice: Every implementation that the proposer has tried *seems* to satisfy these constraints. Cost to Implementors: None. Cost to Users: None. Cost of non-adoption: Continuing confusion as to what is stable in EQ/EQL tables, and what is stable in EQUAL tables. Possible confusion when it comes to implementing EQUALP tables. Performance impact: N.A. Benefits: See Cost of non-adoption. Esthetics: The proposal more closely relates the term "Hash Table" to the classic use of it in "The Art of Computer Programming", vol 3. Discussion: One of the attractions to Common Lisp is that many common techniques are a required part of the language; C programmers who continue to re-invent hasing techniques over and over have praised CL in particular for hash tables. After all, it is much more likely that efficient, correctly coded algorithms will be provided by the system supplier than that every code writer will understand and correctly apply the information found in Knuth's "The Art of Computer Programming", vol 3. The requirement that the expected number of reprobes be bounded by a "small constant" should not be taken to extreme. In particular, a simple trade-off of space for time can assure some compliance with it. For example, a data set of size N could be partitioned into N/20 subsets; as long as the partitioning function does a fairly good job of balancing the number of elements in each partition class, and as long as the partition function can be quickly calculated, then the one could say that the expected number of probes would be bounded by "about twenty or so". The generally understood meanings of the :rehash-size and :rehash-threshold components of hash-tables may be biased towards an "open-addressing" implementation; but "bucketizing" implementations are not arbitrarily ruled out. This proposal is in no way intended to rule out "bucketizing" implementations of hash tables. Here's an example of how one might analyze the problems relating GC and EQ/EQL type tables: Date: Mon, 12 Sep 88 11:05 EDT From: Barry Margolin <barmar@Think.COM> Subject: MAKE-HASH-TABLE :TEST arg To: "Steve Bacher (Batchman)" <SEB1525@draper.com> Cc: common-lisp@sail.stanford.edu . . . Various aspects of the behavior of a hash table are dependent upon the TEST argument. An EQUAL hash table need not be rehashed after a copying GC. The hash function is generally dependent upon the test function; for an EQUAL hash table it would be SXHASH, while for an EQ hash table it would probably be a simple hash on the address. I suppose you COULD use SXHASH for all hash tables, since EQ objects are necessarily EQUAL, and you COULD rehash ALL hash tables. Or you could implement hash tables without actually hashing (e.g. implement them as alists). But if performance is an issue (which it generally is when you use a hash table), you'll probably want to do things dependent on the test function. barmar This suggestion is not prohibited by CLtL, although it violates the commonly accepted understanding of what "Hash on EQ" means.

**Follow-Ups**:**Issue: HASH-TABLE-STABILITY (version 1)***From:*David A. Moon <Moon@STONY-BROOK.SCRC.Symbolics.COM>

- Prev by Date:
**Re: Issue: STREAM-CAPABILITIES (Version 1)** - Next by Date:
**Issue: GET-MACRO-CHARACTER-READTABLE (version 1)** - Previous by thread:
**Re: Issue: SETF-PLACES (Version 1)** - Next by thread:
**Issue: HASH-TABLE-STABILITY (version 1)** - Index(es):