Hash table

From Wikipedia, the free encyclopedia
  (Redirected from Hashtable)
Jump to navigation Jump to search

Hash table
TypeUnordered associative array
Invented1953
Time complexity in big O notation
Algorithm Average Worst case
Space O(n)[1] O(n)
Search O(1) O(n)
Insert O(1) O(n)
Delete O(1) O(n)
A small phone book as a hash table

In computing, a hash table (hash map) is a data structure that implements an associative array abstract data type, a structure that can map keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash function, which might cause hash collisions where the hash function generates the same index for more than one key. Such collisions are typically accommodated in some way.

In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key–value pairs, at (amortized[2]) constant average cost per operation.[3][4]

In many situations, hash tables turn out to be on average more efficient than search trees or any other table lookup structure. For this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets.

Hashing[edit]

The advantage of using hashing is that the table address of a record can be directly computed from the key. Hashing implies a function , when applied to a key , produces a hash . However, since could be potentially large, the hash result should be mapped to finite entries in the hash table—or slots—several methods can be used to map the keys into the size of hash table . The most common method is the division method, in which modular arithmetic is used in computing the slot.[5]: 110 

This is often done in two steps,

Choosing a hash function[edit]

A basic requirement is that the function should provide a uniform distribution of hash values. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., a Pearson's chi-squared test for discrete uniform distributions.[6][7]

The distribution needs to be uniform only for table sizes that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of the table size, then the hash function needs to be uniform only when the size is a power of two. Here the index can be computed as some range of bits of the hash function. On the other hand, some hashing algorithms prefer to have the size be a prime number.[8] The modulus operation may provide some additional mixing; this is especially useful with a poor hash function.

For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash[3] is claimed to have particularly poor clustering behavior.[8]

Cryptographic hash functions are believed to provide good hash functions for any table size, either by modulo reduction or by bit masking[citation needed]. They may also be appropriate if there is a risk of malicious users trying to sabotage a network service by submitting requests designed to generate a large number of collisions in the server's hash tables. However, the risk of sabotage can also be avoided by cheaper methods (such as applying a secret salt to the data, or using a universal hash function). A drawback of cryptographic hashing functions is that they are often slower to compute, which means that in cases where the uniformity for any size is not necessary, a non-cryptographic hashing function might be preferable.[citation needed]

K-independent hashing offers a way to prove a certain hash function doesn't have bad keysets for a given type of hashtable. A number of such results are known for collision resolution schemes such as linear probing and cuckoo hashing. Since K-independence can prove a hash function works, one can then focus on finding the fastest possible such hash function.

Perfect hash function[edit]

If all keys are known ahead of time, a perfect hash function can be used to create a perfect hash table that has no collisions.[9] If minimal perfect hashing is used, every location in the hash table can be used as well.[10]

Perfect hashing allows for constant time lookups in all cases. This is in contrast to most chaining and open addressing methods, where the time for lookup is low on average, but may be very large, O(n), for instance when all the keys hash to a few values.

Key statistics[edit]

A critical statistic for a hash table is the load factor, defined as

,

where

  • is the number of entries occupied in the hash table.
  • is the number of buckets.

The performance of the hash table worsens in relation to the load factor () i.e. as approaches 1. Hence, it's essential to resize—or "rehash"—the hash table when the load factor exceeds an ideal value. It's also efficient to resize the hash table if the size is smaller—which is usually done when load factor drops below .[11] Generally, a load factor of 0.6 and 0.75 is an acceptable figure.[12][5]: 110 

Collision resolution[edit]

The search algorithm that uses hashing consists of two parts. The first part is computing a hash function which transforms the search key into an array index. The ideal case is such that no two search keys hashes to the same array index, however, this is not always the case, since it's theoretically impossible.[13]: 515  Hence the second part of the algorithm is collision resolution. The two common methods for collision resolution are separate chaining and open addressing.[14]: 458 

Separate chaining[edit]

Hash collision resolved by separate chaining
Hash collision by separate chaining with head records in the bucket array.

Hashing is an example of space-time tradeoff. If there exists a condition where the memory is infinite, single memory access using the key as an index in a (potentially huge) array would retrieve the value—which also implies possible key values are huge. On the other hand, if time is infinite, the values can be stored in minimum possible memory and a linear search through the array can be used to retrieve the element.[14]: 458  In separate chaining, the process involves building a linked list with key-value pair for each search array indices. The collided items are chained together through a single linked list, which can be traversed to access the item with a unique search key.[14]: 464  Collision resolution through charming i.e. with a linked list is a common method of implementation. The operation involves as follows:[15]: 258 

Chained-Hash-Insert(T, x)
  insert  at the head of linked list  
Chained-Hash-Search(T, k) search for an element with key in linked list
Chained-Hash-Delete(T, x) delete from the linked list

If the keys of the elements are ordered, it's efficient to insert the item by maintaining the order when the key is comparable either numerically or lexically, thus resulting in faster insertions and unsuccessful searches.[13]: 520-521  However, the standard method of using a linked list is not cache-conscious since there is little spatial localitylocality of reference—since the nodes of the list are scattered across the memory, hence doesn't make efficient use of CPU cache.[16]: 91 

Separate chaining with other structures[edit]

If the keys are ordered, it could be efficient to use "self-organizing" concepts such as using a self-balancing binary search tree, though which the theoretical worse case could be bought down to , although it introduces additional complexities.[13]: 521 

In cache-conscious variants, a dynamic array found to be more cache-friendly and is used in the place where a linked list or self-balancing binary search trees is usually deployed for collision resolution through separate chaining, since the contiguous allocation patten of the array could be exploited by hardware-cache prefetchers—such as translation lookaside buffer—resulting in reduced access time and memory consumption.[17][18][19]

In dynamic perfect hashing, two level hash tables are used to reduce the look-up complexity to be a guaranteed in the worse case. In this technique, the buckets of entries are organized as perfect hash tables with slots providing constant worst-case lookup time, and low amortized time for insertion.[20] A study shows array based separate chaining to be 97% more performant when compared to standard the linked list method under heavy load.[16]: 99 

Techniques such as using fusion tree for each buckets also result in constant time for all operations with high probability.[21]

Open addressing[edit]

Hash collision resolved by open addressing with linear probing (interval=1). Note that "Ted Baker" has a unique hash, but nevertheless collided with "Sandra Dee", that had previously collided with "John Smith".
This graph compares the average number of CPU cache misses required to look up elements in large hash tables (far exceeding size of the cache) with chaining and linear probing. Linear probing performs better due to better locality of reference, though as the table gets full, its performance degrades drastically.

In another strategy, called open addressing, all entry records are stored in the bucket array itself. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table.[22] The name "open addressing" refers to the location ("address") of the item is not determined by its hash value. (This method is also called closed hashing; it should not be confused with "open hashing" or "closed addressing" which usually means separate chaining.)

Well-known probe sequences include:

  • Linear probing, in which the interval between probes is fixed (usually 1). Since the slots are located in successive locations, linear probing could lead to better utilization of CPU cache due to locality of references.[23]
  • Quadratic probing, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the starting value given by the original hash computation
  • Double hashing, in which the interval between probes is computed by a second hash function

In practice, the performance of open addressing is slower than separate chaining when used in conjunction with an array of buckets for collusion resolution,[16]: 93  since a longer sequence of array indices may need to be tried to find a given element when the load factor approaches 1.[11] The load factor must be maintained below 1 since if the reaches 1—in case of a completely full table—a search miss would go into an infinite loop through the table.[14]: 471  The average cost of linear probing depends on the chosen hash function's ability to distribute the keys uniformly throughout the table to avoid clustering, since formation of clusters would result in increased search time leading to inefficiency.[14]: 472 

Coalesced hashing[edit]

A hybrid of chaining and open addressing, coalesced hashing links together chains of nodes within the table itself.[22] Like open addressing, it achieves space usage and (somewhat diminished) cache advantages over chaining. Like chaining, it does not exhibit clustering effects; in fact, the table can be efficiently filled to a high density. Unlike chaining, it cannot have more elements than table slots.

Cuckoo hashing[edit]

Another alternative open-addressing solution is cuckoo hashing, which ensures constant lookup and deletion time in the worst case, and constant amortized time for insertions (with low probability that the worst-case will be encountered). It uses two or more hash functions, which means any key/value pair could be in two or more locations. For lookup, the first hash function is used; if the key/value is not found, then the second hash function is used, and so on. If a collision happens during insertion, then the key is re-hashed with the second hash function to map it to another bucket. If all hash functions are used and there is still a collision, then the key it collided with is removed to make space for the new key, and the old key is re-hashed with one of the other hash functions, which maps it to another bucket. If that location also results in a collision, then the process repeats until there is no collision or the process traverses all the buckets, at which point the table is resized. By combining multiple hash functions with multiple cells per bucket, very high space utilization can be achieved.[citation needed]

Hopscotch hashing[edit]

Another alternative open-addressing solution is hopscotch hashing,[24] which combines the approaches of cuckoo hashing and linear probing, yet seems in general to avoid their limitations. In particular it works well even when the load factor grows beyond 0.9. The algorithm is well suited for implementing a resizable concurrent hash table.

The hopscotch hashing algorithm works by defining a neighborhood of buckets near the original hashed bucket, where a given entry is always found. Thus, search is limited to the number of entries in this neighborhood, which is logarithmic in the worst case, constant on average, and with proper alignment of the neighborhood typically requires one cache miss. When inserting an entry, one first attempts to add it to a bucket in the neighborhood. However, if all buckets in this neighborhood are occupied, the algorithm traverses buckets in sequence until an open slot (an unoccupied bucket) is found (as in linear probing). At that point, since the empty bucket is outside the neighborhood, items are repeatedly displaced in a sequence of hops. (This is similar to cuckoo hashing, but with the difference that in this case the empty slot is being moved into the neighborhood, instead of items being moved out with the hope of eventually finding an empty slot.) Each hop brings the open slot closer to the original neighborhood, without invalidating the neighborhood property of any of the buckets along the way. In the end, the open slot has been moved into the neighborhood, and the entry being inserted can be added to it.[citation needed]

Robin Hood hashing[edit]

A variation on double-hashing collision resolution is Robin Hood hashing.[25][26] The idea is that a new key may displace a key already inserted, if its probe count is larger than that of the key at the current position. The net effect of this is that it reduces worst case search times in the table. This is similar to ordered hash tables[27] except that the criterion for bumping a key does not depend on a direct relationship between the keys. Since both the worst case and the variation in the number of probes is reduced dramatically, an interesting variation is to probe the table starting at the expected successful probe value and then expand from that position in both directions.[28] External Robin Hood hashing is an extension of this algorithm where the table is stored in an external file and each table position corresponds to a fixed-sized page or bucket with B records.[29]

2-choice hashing[edit]

2-choice hashing employs two different hash functions, h1(x) and h2(x), for the hash table. Both hash functions are used to compute two table locations. When an object is inserted in the table, it is placed in the table location that contains fewer objects (with the default being the h1(x) table location if there is equality in bucket size). 2-choice hashing employs the principle of the power of two choices.[30]

Dynamic resizing[edit]

When an insert is made such that the number of entries in a hash table exceeds the product of the load factor and the current capacity then the hash table will need to be rehashed.[31] Rehashing includes increasing the size of the underlying data structure[31] and mapping existing items to new bucket locations. In some implementations, if the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.[31]

To limit the proportion of memory wasted due to empty buckets, some implementations also shrink the size of the table—followed by a rehash—when items are deleted. From the point of space–time tradeoffs, this operation is similar to the deallocation in dynamic arrays.

Resizing by copying all entries[edit]

A common approach is to automatically trigger a complete resizing when the load factor exceeds some threshold rmax. Then a new larger table is allocated, each entry is removed from the old table, and inserted into the new table. When all entries have been removed from the old table then the old table is returned to the free storage pool. Likewise, when the load factor falls below a second threshold rmin, all entries are moved to a new smaller table.

For hash tables that shrink and grow frequently, the resizing downward can be skipped entirely. In this case, the table size is proportional to the maximum number of entries that ever were in the hash table at one time, rather than the current number. The disadvantage is that memory usage will be higher, and thus cache behavior may be worse. For best control, a "shrink-to-fit" operation can be provided that does this only on request.

If the table size increases or decreases by a fixed percentage at each expansion, the total cost of these resizings, amortized over all insert and delete operations, is still a constant, independent of the number of entries n and of the number m of operations performed.

For example, consider a table that was created with the minimum possible size and is doubled each time the load ratio exceeds some threshold. If m elements are inserted into that table, the total number of extra re-insertions that occur in all dynamic resizings of the table is at most m − 1. In other words, dynamic resizing roughly doubles the cost of each insert or delete operation.

Alternatives to all-at-once rehashing[edit]

Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually.

Disk-based hash tables almost always use some alternative to all-at-once rehashing, since the cost of rebuilding the entire table on disk would be too high.

Incremental resizing[edit]

One alternative to enlarging the table all at once is to perform the rehashing gradually:

  • During the resize, allocate the new hash table, but keep the old table unchanged.
  • In each lookup or delete operation, check both tables.
  • Perform insertion operations only in the new table.
  • At each insertion also move r elements from the old table to the new table.
  • When all elements are removed from the old table, deallocate it.

To ensure that the old table is completely copied over before the new table itself needs to be enlarged, it is necessary to increase the size of the table by a factor of at least (r + 1)/r during resizing.

Monotonic keys[edit]

If it is known that keys will be stored in monotonically increasing (or decreasing) order, then a variation of consistent hashing can be achieved.

Given some initial key k1, a subsequent key ki partitions the key domain [k1, ∞) into the set {[k1, ki), [ki, ∞)}. In general, repeating this process gives a finer partition {[k1, ki0), [ki0, ki1), ..., [kin - 1, kin), [kin, ∞)} for some sequence of monotonically increasing keys (ki0, ..., kin), where n is the number of refinements. The same process applies, mutatis mutandis, to monotonically decreasing keys. By assigning to each subinterval of this partition a different hash function or hash table (or both), and by refining the partition whenever the hash table is resized, this approach guarantees that any key's hash, once issued, will never change, even when the hash table is grown.

Since it is common to grow the overall number of entries by doubling, there will only be O(log(N)) subintervals to check, and binary search time for the redirection will be O(log(log(N))).

Linear hashing[edit]

Linear hashing[32] is a hash table algorithm that permits incremental hash table expansion. It is implemented using a single hash table, but with two possible lookup functions.

Hashing for distributed hash tables[edit]

Another way to decrease the cost of table resizing is to choose a hash function in such a way that the hashes of most values do not change when the table is resized. Such hash functions are prevalent in disk-based and distributed hash tables, where rehashing is prohibitively costly. The problem of designing a hash such that most values do not change when the table is resized is known as the distributed hash table problem. The four most popular approaches are rendezvous hashing, consistent hashing, the content addressable network algorithm, and Kademlia distance.

Performance[edit]

Speed analysis[edit]

In the simplest model, the hash function is completely unspecified and the table does not resize. With an ideal hash function, a table of size with open addressing has no collisions and holds up to elements with a single comparison for successful lookup, while a table of size with chaining and keys has the minimum collisions and comparisons for lookup. With the worst possible hash function, every insertion causes a collision, and hash tables degenerate to linear search, with amortized comparisons per insertion and up to comparisons for a successful lookup.

Adding rehashing to this model is straightforward. As in a dynamic array, geometric resizing by a factor of implies that only keys are inserted or more times, so that the total number of insertions is bounded above by , which is . By using rehashing to maintain , tables using both chaining and open addressing can have unlimited elements and perform successful lookup in a single comparison for the best choice of hash function.

In more realistic models, the hash function is a random variable over a probability distribution of hash functions, and performance is computed on average over the choice of hash function. When this distribution is uniform, the assumption is called "simple uniform hashing" and it can be shown that hashing with chaining requires comparisons on average for an unsuccessful lookup, and hashing with open addressing requires .[33] Both these bounds are constant, if we maintain ' using table resizing, where is a fixed constant less than 1.

Two factors affect significantly the latency of operations on a hash table:[34]

  • Cache missing. With the increasing of load factor, the search and insertion performance of hash tables can be degraded a lot due to the rise of average cache missing.
  • Cost of resizing. Resizing becomes an extreme time-consuming task when hash tables grow massive.

In latency-sensitive programs, the time consumption of operations on both the average and the worst cases are required to be small, stable, and even predictable. The K hash table [35] is designed for a general scenario of low-latency applications, aiming to achieve cost-stable operations on a growing huge-sized table.

Memory utilization[edit]

Sometimes the memory requirement for a table needs to be minimized. One way to reduce memory usage in chaining methods is to eliminate some of the chaining pointers or to replace them with some form of abbreviated pointers.

Another technique was introduced by Donald Knuth[36] where it is called abbreviated keys. (Bender et al. wrote that Knuth called it quotienting.[37]) For this discussion assume that the key, or a reversibly-hashed version of that key, is an integer m from {0, 1, 2, ..., M-1} and the number of buckets is N. m is divided by N to produce a quotient q and a remainder r. The remainder r is used to select the bucket; in the bucket only the quotient q need be stored. This saves log2(N) bits per element, which can be significant in some applications.[36]

Quotienting works readily with chaining hash tables, or with simple cuckoo hash tables. To apply the technique with ordinary open-addressing hash tables, John G. Cleary introduced a method[38] where two bits (a virgin bit and a change bit) are included in each bucket to allow the original bucket index (r) to be reconstructed.

In the scheme just described, log2(M/N) + 2 bits are used to store each key. It is interesting to note that the theoretical minimum storage would be log2(M/N) + 1.4427 bits where 1.4427 = log2(e).

Features[edit]

Advantages[edit]

  • The main advantage of hash tables over other table data structures is speed. This advantage is more apparent when the number of entries is large. Hash tables are particularly efficient when the maximum number of entries can be predicted, so that the bucket array can be allocated once with the optimum size and never resized.
  • If the set of key–value pairs is fixed and known ahead of time (so insertions and deletions are not allowed), one may reduce the average lookup cost by a careful choice of the hash function, bucket table size, and internal data structures. In particular, one may be able to devise a hash function that is collision-free, or even perfect. In this case the keys need not be stored in the table.

Drawbacks[edit]

  • Although operations on a hash table take constant time on average, the cost of a good hash function can be significantly higher than the inner loop of the lookup algorithm for a sequential list or search tree. Thus hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
  • For certain string processing applications, such as spell-checking, hash tables may be less efficient than tries, finite automata, or Judy arrays. Also, if there are not too many possible keys to store—that is, if each key can be represented by a small enough number of bits—then, instead of a hash table, one may use the key directly as the index into an array of values. Note that there are no collisions in this case.
  • The entries stored in a hash table can be enumerated efficiently (at constant cost per entry), but only in some pseudo-random order. Therefore, there is no efficient way to locate an entry whose key is nearest to a given key. Listing all n entries in some specific order generally requires a separate sorting step, whose cost is proportional to log(n) per entry. In comparison, ordered search trees have lookup and insertion cost proportional to log(n), but allow finding the nearest key at about the same cost, and ordered enumeration of all entries at constant cost per entry. However, a LinkingHashMap can be made to create a hash table with a non-random sequence.[39]
  • If the keys are not stored (because the hash function is collision-free), there may be no easy way to enumerate the keys that are present in the table at any given moment.
  • Although the average cost per operation is constant and fairly small, the cost of a single operation may be quite high. In particular, if the hash table uses dynamic resizing, an insertion or deletion operation may occasionally take time proportional to the number of entries. This may be a serious drawback in real-time or interactive applications.
  • Hash tables in general exhibit poor locality of reference—that is, the data to be accessed is distributed seemingly at random in memory. Because hash tables cause access patterns that jump around, this can trigger microprocessor cache misses that cause long delays. Compact data structures such as arrays searched with linear search may be faster, if the table is relatively small and keys are compact. The optimal performance point varies from system to system.
  • Hash tables become quite inefficient when there are many collisions. While extremely uneven hash distributions are extremely unlikely to arise by chance, a malicious adversary with knowledge of the hash function may be able to supply information to a hash that creates worst-case behavior by causing excessive collisions, resulting in very poor performance, e.g., a denial of service attack.[40][41][42] In critical applications, a data structure with better worst-case guarantees can be used; however, universal hashing or a keyed hash function, both of which prevents the attacker from predicting which inputs cause worst-case behavior, may be preferable.[43][44]
    • The hash function used by the hash table in the Linux routing table cache was changed with Linux version 2.4.2 as a countermeasure against such attacks.[45] The ad hoc short-keyed hash construction was later updated to use a "jhash", and then SipHash.[46]

Uses[edit]

Associative arrays[edit]

Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays (arrays whose indices are arbitrary strings or other complicated objects), especially in interpreted programming languages like Ruby, Python, and PHP.

When storing a new item into a multimap and a hash collision occurs, the multimap unconditionally stores both items.

When storing a new item into a typical associative array and a hash collision occurs, but the actual keys themselves are different, the associative array likewise stores both items. However, if the key of the new item exactly matches the key of an old item, the associative array typically erases the old item and overwrites it with the new item, so every item in the table has a unique key.

Database indexing[edit]

Hash tables may also be used as disk-based data structures and database indices (such as in dbm) although B-trees are more popular in these applications. In multi-node database systems, hash tables are commonly used to distribute rows amongst nodes, reducing network traffic for hash joins.

Caches[edit]

Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries—usually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value.

Sets[edit]

Besides recovering the entry that has a given key, many hash table implementations can also tell whether such an entry exists or not.

Those structures can therefore be used to implement a set data structure,[47] which merely records whether a given key belongs to a specified set of keys. In this case, the structure can be simplified by eliminating all parts that have to do with the entry values. Hashing can be used to implement both static and dynamic sets.

Object representation[edit]

Several dynamic languages, such as Perl, Python, JavaScript, Lua, and Ruby, use hash tables to implement objects. In this representation, the keys are the names of the members and methods of the object, and the values are pointers to the corresponding member or method.

Unique data representation[edit]

Hash tables can be used by some programs to avoid creating multiple character strings with the same contents. For that purpose, all strings in use by the program are stored in a single string pool implemented as a hash table, which is checked whenever a new string has to be created. This technique was introduced in Lisp interpreters under the name hash consing, and can be used with many other kinds of data (expression trees in a symbolic algebra system, records in a database, files in a file system, binary decision diagrams, etc.).

Transposition table[edit]

A transposition table to a complex Hash Table which stores information about each section that has been searched.[48]

Implementations[edit]

In programming languages[edit]

Many programming languages provide hash table functionality, either as built-in associative arrays or as standard library modules. In C++11, for example, the unordered_map class provides hash tables for keys and values of arbitrary type.

The Java programming language (including the variant which is used on Android) includes the HashSet, HashMap, LinkedHashSet, and LinkedHashMap generic collections.[49]

In PHP 5 and 7, the Zend 2 engine and the Zend 3 engine (respectively) use one of the hash functions from Daniel J. Bernstein to generate the hash values used in managing the mappings of data pointers stored in a hash table. In the PHP source code, it is labelled as DJBX33A (Daniel J. Bernstein, Times 33 with Addition).

Python's built-in hash table implementation, in the form of the dict type, as well as Perl's hash type (%) are used internally to implement namespaces and therefore need to pay more attention to security, i.e., collision attacks. Python sets also use hashes internally, for fast lookup (though they store only keys, not values).[50] CPython 3.6+ uses an insertion-ordered variant of the hash table, implemented by splitting out the value storage into an array and having the vanilla hash table only store a set of indices.[51]

In the .NET Framework, support for hash tables is provided via the non-generic Hashtable and generic Dictionary classes, which store key–value pairs, and the generic HashSet class, which stores only values.

In Ruby the hash table uses the open addressing model from Ruby 2.4 onwards.[52][53]

In Rust's standard library, the generic HashMap and HashSet structs use linear probing with Robin Hood bucket stealing.

ANSI Smalltalk defines the classes Set / IdentitySet and Dictionary / IdentityDictionary. All Smalltalk implementations provide additional (not yet standardized) versions of WeakSet, WeakKeyDictionary and WeakValueDictionary.

Tcl array variables are hash tables, and Tcl dictionaries are immutable values based on hashes. The functionality is also available as C library functions Tcl_InitHashTable et al. (for generic hash tables) and Tcl_NewDictObj et al. (for dictionary values). The performance has been independently benchmarked as extremely competitive.[54]

The Wolfram Language supports hash tables since version 10. They are implemented under the name Association.

Common Lisp provides the hash-table class for efficient mappings. In spite of its naming, the language standard does not mandate the actual adherence to any hashing technique for implementations.[55]

History[edit]

The idea of hashing arose independently in different places. In January 1953, Hans Peter Luhn wrote an internal IBM memorandum that used hashing with chaining.[56] Gene Amdahl, Elaine M. McGraw, Nathaniel Rochester, and Arthur Samuel implemented a program using hashing at about the same time. Open addressing with linear probing (relatively prime stepping) is credited to Amdahl, but Ershov (in Russia) had the same idea.[56]

See also[edit]

Related data structures[edit]

There are several data structures that use hash functions but cannot be considered special cases of hash tables:

References[edit]

  1. ^ Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009). Introduction to Algorithms (3rd ed.). Massachusetts Institute of Technology. pp. 253–280. ISBN 978-0-262-03384-8.
  2. ^ Charles E. Leiserson, Amortized Algorithms, Table Doubling, Potential Method Archived August 7, 2009, at the Wayback Machine Lecture 13, course MIT 6.046J/18.410J Introduction to Algorithms—Fall 2005
  3. ^ a b Knuth, Donald (1998). The Art of Computer Programming. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 513–558. ISBN 978-0-201-89685-5.
  4. ^ Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). "Chapter 11: Hash Tables". Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. pp. 221–252. ISBN 978-0-262-53196-2.
  5. ^ a b Owolabi, Olumide (February 1, 2003). "Empirical studies of some hashing functions". Information and Software Technology. Department of Mathematics and Computer Science, University of Port Harcourt. 45 (2). doi:10.1016/S0950-5849(02)00174-X – via ScienceDirect.
  6. ^ Pearson, Karl (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling". Philosophical Magazine. Series 5. 50 (302): 157–175. doi:10.1080/14786440009463897.
  7. ^ Plackett, Robin (1983). "Karl Pearson and the Chi-Squared Test". International Statistical Review. 51 (1): 59–72. doi:10.2307/1402731. JSTOR 1402731.
  8. ^ a b Wang, Thomas (March 1997). "Prime Double Hash Table". Archived from the original on September 3, 1999. Retrieved May 10, 2015.
  9. ^ Lu, Yi; Prabhakar, Balaji; Bonomi, Flavio (2006), "Perfect Hashing for Network Applications", 2006 IEEE International Symposium on Information Theory: 2774–2778, doi:10.1109/ISIT.2006.261567
  10. ^ Belazzougui, Djamal; Botelho, Fabiano C.; Dietzfelbinger, Martin (2009), "Hash, displace, and compress" (PDF), Algorithms—ESA 2009: 17th Annual European Symposium, Copenhagen, Denmark, September 7-9, 2009, Proceedings (PDF), Lecture Notes in Computer Science, 5757, Berlin: Springer, pp. 682–693, CiteSeerX 10.1.1.568.130, doi:10.1007/978-3-642-04128-0_61, MR 2557794.
  11. ^ a b Mayers, Andrew (2008). "CS 312: Hash tables and amortized analysis". Cornell Univeristy, Department of Computer Science. Archived from the original on April 26, 2021. Retrieved October 26, 2021 – via cs.cornell.edu.
  12. ^ Maurer, W.D.; Lewis, T.G. (March 1, 1975). "Hash Table Methods". ACM Computing Surveys. Journal of the ACM. 1 (1): 14. doi:10.1145/356643.356645.
  13. ^ a b c Donald E. Knuth (April 24, 1998). The Art of Computer Programming: Volume 3: Sorting and Searching. Addison-Wesley Professional. ISBN 978-0-201-89685-5.
  14. ^ a b c d e Sedgewick, Robert; Wayne, Kevin (2011). Algorithms. 1 (4 ed.). Addison-Wesley Professional – via Princeton University, Department of Computer Science.
  15. ^ Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). "Chapter 11: Hash Tables". Introduction to Algorithms (2nd ed.). Massachusetts Institute of Technology. ISBN 978-0-262-53196-2.
  16. ^ a b c Askitis, Nikolas; Zobel, Justin (2005). "Cache-Conscious Collision Resolution in String Hash Tables". International Symposium on String Processing and Information Retrieval. Springer Science+Business Media. doi:10.1007/11575832_1. ISBN 978-3-540-29740-6.
  17. ^ Askitis, Nikolas; Sinha, Ranjan (2010). "Engineering scalable, cache and space efficient tries for strings". The VLDB Journal. 17 (5): 634. doi:10.1007/s00778-010-0183-9. ISSN 1066-8888. S2CID 432572.
  18. ^ Askitis, Nikolas; Zobel, Justin (October 2005). Cache-conscious Collision Resolution in String Hash Tables. Proceedings of the 12th International Conference, String Processing and Information Retrieval (SPIRE 2005). 3772/2005. pp. 91–102. doi:10.1007/11575832_11. ISBN 978-3-540-29740-6.
  19. ^ Askitis, Nikolas (2009). Fast and Compact Hash Tables for Integer Keys (PDF). Proceedings of the 32nd Australasian Computer Science Conference (ACSC 2009). 91. pp. 113–122. ISBN 978-1-920682-72-9. Archived from the original (PDF) on February 16, 2011. Retrieved June 13, 2010.
  20. ^ Erik Demaine, Jeff Lind. 6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory. Spring 2003. "Archived copy" (PDF). Archived (PDF) from the original on June 15, 2010. Retrieved June 30, 2008.CS1 maint: archived copy as title (link)
  21. ^ Willard, Dan E. (2000). "Examining computational geometry, van Emde Boas trees, and hashing from the perspective of the fusion tree". SIAM Journal on Computing. 29 (3): 1030–1049. doi:10.1137/S0097539797322425. MR 1740562..
  22. ^ a b Tenenbaum, Aaron M.; Langsam, Yedidyah; Augenstein, Moshe J. (1990). Data Structures Using C. Prentice Hall. pp. 456–461, p. 472. ISBN 978-0-13-199746-2.
  23. ^ Pagh, Rasmus; Rodler, Flemming Friche (2001). "Cuckoo Hashing". Algorithms — ESA 2001. Lecture Notes in Computer Science. 2161. pp. 121–133. CiteSeerX 10.1.1.25.4189. doi:10.1007/3-540-44676-1_10. ISBN 978-3-540-42493-2.
  24. ^ Herlihy, Maurice; Shavit, Nir; Tzafrir, Moran (2008). "Hopscotch Hashing". DISC '08: Proceedings of the 22nd international symposium on Distributed Computing. Berlin, Heidelberg: Springer-Verlag. pp. 350–364. CiteSeerX 10.1.1.296.8742.
  25. ^ Celis, Pedro (1986). Robin Hood hashing (PDF) (Technical report). Computer Science Department, University of Waterloo. CS-86-14. Archived (PDF) from the original on July 17, 2014.
  26. ^ Goossaert, Emmanuel (2013). "Robin Hood hashing". Archived from the original on March 21, 2014.
  27. ^ Amble, Ole; Knuth, Don (1974). "Ordered hash tables". Computer Journal. 17 (2): 135. doi:10.1093/comjnl/17.2.135.
  28. ^ Viola, Alfredo (October 2005). "Exact distribution of individual displacements in linear probing hashing". Transactions on Algorithms (TALG). 1 (2): 214–242. doi:10.1145/1103963.1103965. S2CID 11183559.
  29. ^ Celis, Pedro (March 1988). External Robin Hood Hashing (Technical report). Computer Science Department, Indiana University. TR246.
  30. ^ Mitzenmacher, Michael; Richa, Andréa W.; Sitaraman, Ramesh (2001). "The Power of Two Random Choices: A Survey of Techniques and Results" (PDF). Harvard University. Archived (PDF) from the original on March 25, 2015. Retrieved April 10, 2015.
  31. ^ a b c Javadoc for HashMap in Java 10 https://docs.oracle.com/javase/10/docs/api/java/util/HashMap.html
  32. ^ Litwin, Witold (1980). "Linear hashing: A new tool for file and table addressing". Proc. 6th Conference on Very Large Databases. pp. 212–223.
  33. ^ Doug Dunham. CS 4521 Lecture Notes Archived July 22, 2009, at the Wayback Machine. University of Minnesota Duluth. Theorems 11.2, 11.6. Last modified April 21, 2009.
  34. ^ Andy Ke. Inside the latency of hash table operations Last modified December 30, 2019.
  35. ^ Andy Ke. The K hash table, a design for low-latency applications Archived February 21, 2021, at the Wayback Machine Last modified December 20, 2019.
  36. ^ a b Knuth, Donald (1973). The Art of Computer Programming:Searching and Sorting, volume 3. Section 6.4, exercise 13: Addison Wesley.CS1 maint: location (link)
  37. ^ Bender, Michael A.; Farach-Colton, Martin; Johnson, Rob; Kuszmaul, Bradley C.; Medjedovic, Dzejla; Montes, Pablo; Shetty, Pradeep; Spillane, Richard P.; Zadok, Erez (June 2011). "Don't thrash: how to cache your hash on flash" (PDF). Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems (HotStorage'11). Retrieved July 21, 2012.
  38. ^ Clerry (1984). "Compact Hash Tables Using Bidirectional Linear Probing". IEEE Transactions on Computers (9): 828–834. doi:10.1109/TC.1984.1676499. S2CID 195908955. Archived from the original on October 26, 2020. Retrieved February 21, 2021.
  39. ^ "LinkedHashMap (Java Platform SE 7 )". docs.oracle.com. Archived from the original on September 18, 2020. Retrieved May 1, 2020.
  40. ^ Alexander Klink and Julian Wälde's Efficient Denial of Service Attacks on Web Application Platforms Archived September 16, 2016, at the Wayback Machine, December 28, 2011, 28th Chaos Communication Congress. Berlin, Germany.
  41. ^ Mike Lennon "Hash Table Vulnerability Enables Wide-Scale DDoS Attacks" Archived September 19, 2016, at the Wayback Machine. 2011.
  42. ^ "Hardening Perl's Hash Function". November 6, 2013. Archived from the original on September 16, 2016.
  43. ^ Crosby and Wallach. Denial of Service via Algorithmic Complexity Attacks Archived March 4, 2016, at the Wayback Machine. quote: "modern universal hashing techniques can yield performance comparable to commonplace hash functions while being provably secure against these attacks." "Universal hash functions ... are ... a solution suitable for adversarial environments. ... in production systems."
  44. ^ Aumasson, Jean-Philippe; Bernstein, Daniel J.; Boßlet, Martin (November 8, 2012). Hash-flooding DoS reloaded: attacks and defenses (PDF). Application Security Forum – Western Switzerland 2012.
  45. ^ Bar-Yosef, Noa; Wool, Avishai (2007). Remote algorithmic complexity attacks against randomized hash tables Proc. International Conference on Security and Cryptography (SECRYPT) (PDF). p. 124. Archived (PDF) from the original on September 16, 2014.
  46. ^ "inet: switch IP ID generator to siphash · torvalds/linux@df45370". GitHub.
  47. ^ "Set (Java Platform SE 7 )". docs.oracle.com. Archived from the original on November 12, 2020. Retrieved May 1, 2020.
  48. ^ "Transposition Table - Chessprogramming wiki". chessprogramming.org. Archived from the original on February 14, 2021. Retrieved May 1, 2020.
  49. ^ "Lesson: Implementations (The Java™ Tutorials > Collections)". docs.oracle.com. Archived from the original on January 18, 2017. Retrieved April 27, 2018.
  50. ^ "Python: List vs Dict for look up table". stackoverflow.com. Archived from the original on December 2, 2017. Retrieved April 27, 2018.
  51. ^ Dimitris Fasarakis Hilliard. "Are dictionaries ordered in Python 3.6+?". Stack Overflow.
  52. ^ Dmitriy Vasin (June 19, 2018). "Do You Know How Hash Table Works? (Ruby Examples)". anadea.info. Retrieved July 3, 2019.
  53. ^ Jonan Scheffler (December 25, 2016). "Ruby 2.4 Released: Faster Hashes, Unified Integers and Better Rounding". heroku.com. Archived from the original on July 3, 2019. Retrieved July 3, 2019.
  54. ^ Wing, Eric. "Hash Table Shootout 2: Rise of the Interpreter Machines". LuaHashMap: An easy to use hash table library for C. PlayControl Software. Archived from the original on October 14, 2013. Retrieved October 24, 2019. Did Tcl win? In any case, these benchmarks showed that these interpreter implementations have very good hash implementations and are competitive with our reference benchmark of the STL unordered_map. Particularly in the case of Tcl and Lua, they were extremely competitive and often were within 5%-10% of unordered_map when they weren't beating it. (On 2019-10-24, the original site still has the text, but the figures appear to be broken, whereas they are intact in the archive.)
  55. ^ "CLHS:System Class HASH-TABLE". lispworks.com/documentation/HyperSpec/Front/index.htm. Archived from the original on October 22, 2019. Retrieved May 18, 2020.
  56. ^ a b Mehta, Dinesh P.; Sahni, Sartaj (October 28, 2004). Handbook of Datastructures and Applications. p. 9-15. ISBN 978-1-58488-435-4.

Further reading[edit]

External links[edit]