Double hashing is a computer programming technique used in hash tables to resolve hash collisions, cases when two different values to be searched for produce the same hash key. It is a popular collision-resolution technique in open-addressed hash tables. Double hashing is implemented in many popular libraries.
Like linear probing, it uses one hash value as a starting point and then repeatedly steps forward an interval until the desired value is located, an empty location is reached, or the entire table has been searched; but this interval is decided using a second, independent hash function (hence the name double hashing). Unlike linear probing and quadratic probing, the interval depends on the data, so that even values mapping to the same location have different bucket sequences; this minimizes repeated collisions and the effects of clustering.
Given two randomly, uniformly, and independently selected hash functions and , the ith location in the bucket sequence for value k in a hash table is: Generally, and are selected from a set of universal hash functions.
Classical applied data structure
Double hashing with open addressing is a classical data structure on a table . Let be the number of elements stored in , then 's load factor is .
Double hashing approximates uniform open address hashing. That is, start by randomly, uniformly and independently selecting two universal hash functions and to build a double hashing table .
All elements are put in by double hashing using and . Given a key , determining the -st hash location is computed by:
Let have fixed load factor . Bradford and Katehakis showed the expected number of probes for an unsuccessful search in , still using these initially chosen hash functions, is regardless of the distribution of the inputs. More precisely, these two uniformly, randomly and independently chosen hash functions are chosen from a set of universal hash functions where pair-wise independence suffices.
Previous results include: Guibas and Szemerédi showed holds for unsuccessful search for load factors . Also, Lueker and Molodowitch showed this held assuming ideal randomized functions. Schmidt and Siegel showed this with -wise independent and uniform functions (for , and suitable constant ).
Implementation details for caching
Linear probing and, to a lesser extent, quadratic probing are able to take advantage of the data cache by accessing locations that are close together. Double hashing has, on average, larger intervals and is not able to achieve this advantage.
Like all other forms of open addressing, double hashing becomes linear as the hash table approaches maximum capacity. The only solution to this is to rehash to a larger size, as with all other open addressing schemes.
On top of that, it is possible for the secondary hash function to evaluate to zero. For example, if we choose k=5 with the following function:
The resulting sequence will always remain at the initial hash value. One possible solution is to change the secondary hash function to:
This ensures that the secondary hash function will always be non zero.
- P. G. Bradford and M. Katehakis: A Probabilistic Study on Combinatorial Expanders and Hashing, SIAM Journal on Computing 2007 (37:1), 83-111. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.2647
- L. Guibas and E. Szemerédi: The Analysis of Double Hashing, Journal of Computer and System Sciences, 1978, 16, 226-274.
- G. S. Lueker and M. Molodowitch: More Analysis of Double Hashing, Combinatorica, 1993, 13(1), 83-96.
- J. P. Schmidt and A. Siegel: Double Hashing is Computable and Randomizable with Universal Hash Functions, manuscript.