Talk:Cuckoo hashing
This is the talk page for discussing improvements to the Cuckoo hashing article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Computer science Start‑class | |||||||||||||||||
|
Computing Unassessed | ||||||||||
|
rehashing
can someone please explain how the rehashing with the new hash functions work? do I have to prepare a large number of different hash functions for this to work? I would have quite some trouble to figure out some good and fast third one for my own data, let alone any further ones... and how many do I need? It seems that the collisions can happen quite often for pathological data, does that mean I have to code hundreds or even thousands of different hash functions?
Comment on weasel words
IMO the following fragment from the page is full of weasel words:
The relevance of cuckoo hashing to practice is becoming increasingly clear (although as of 2006 it is not yet widely known outside the research community.) Given the fact that cuckoo hashing is acknowledged to have a central place in the theory of hash functions, it was discovered surprisingly late -- in 2001.
Anybody have any sources for this "acknowledgement" or "increasingly clear relevance"? — Preceding unsigned comment added by Mandyhan (talk • contribs) 13:30, 6 October 2006 (UTC)
Weasel words
Hello,
I don't know who inserted the last comment. However, it seems that cuckoo hashing variants are currently the fastest schemes for cache-resident hash tables. See:
I have added the link to the article.
Best,
Rasmus Pagh — Preceding unsigned comment added by Pagh (talk • contribs) 11:00, 9 October 2006 (UTC)
Practicality of cuckoo hashing
From my own programming experience, cuckoo hashing makes every other in-memory hashing scheme as obsolete as a Wright brothers airplane. It's fast, easy to implement, and predictable - in my own projects, I use cuckoo hashing exclusively, and many thanks to Rasmus Pagh for his great idea.
The only practical issue I had with cuckoo hashing is that it is particularly sensitive to the choice of hashing function; anything with bad mixing behaviour greatly increases frequency of rehashes, sometimes to the point of utter unusability. But with any good hash function, cuckoo hashing works perfectly; functions that I found to be good are Bob Jenkins' hash, "One at a time" hash, Paul Hsieh's hash (all can be found at http://burtleburtle.net/bob/hash/doobs.html ), Thomas Wang's hash (found at http://www.concentric.net/~Ttwang/tech/inthash.htm ).
Other functions designed around similar lines would probably work just as well - while commonly used simple additive, multiplicative and rotating hash functions definitely would not.
wjaguar 14:23, 14 October 2006 (UTC)
Amen. Cuckoo Hashing is the best invention since sliced bread. I'm using ternary Cuckoo Hashing to map incoming UDP datagrams (or rather their sockaddr) to sessions. In my particular setup, it is roughly 3 times as fast std::map which I had used before, and doesn't have nasty hiccups as linear probing does with collisions (which I tried too). Basically, Cuckoo Hashing turned "3000 sessions are beginning to get noticeable" into "I wish the ethernet cable had bandwidth for 7500".
As hash function, I use Bernstein's well-known old hash with a higher order primeth, adding a number at the end to make it unique, and scramble the bits again with two shift-xors. The table rarely ever needs a single rehash up to about 90-91% load. I've been able to stress it up to a 99.5% load experimentially, which still worked just fine much to my surprise (although rehashing and thus insert time grows exponentially, at >99% load, a single insert takes 5-20 seconds real time...). Lookups are of course still O(1), which is just plain awesome. —Preceding unsigned comment added by 91.35.154.68 (talk) 12:37, 15 May 2008 (UTC)
Bipartite graph
I came up with a variantion on this idea in 1991 for a software system needed to store a large dictionary of two character Mandarin words in a highly compressed hash table at a time when our target market was mostly still running MSDOS in 640K. However, in our case, we didn't compute the cell assignment on the fly. We used a three-key approach to achieve a 90% fill ratio. We computed all the keys for all the elements in a batch operation, and then used the bipartite graph solution algorithm to find a legal placement, if any exists.
I was a bit choked at the time. We had no internet / electronic resources. I had about a dozen text books on fundamental algorithms, and I looked through them all, not finding a suitable algorithm. So I implemented a heuristic that worked fairly well up to a point. The empty cells are empty. The cells with only one possible occupant go next, and as cells are stricken, also the occupants reduced to one remaining possible cell. This shrinks the hairball quite a bit. I then created some kind of local metric for the assignments least contested, choosing the least contested assignment at each step. It was slow (hours), but I tuned it to where it would solve instances up to 80+ plus percent occupancy with no back-tracking. Then a coworker, who had mostly the same books, found the graph matching algorithm in one book I didn't have, we had to use a cheesy DOS extender to handle the extra memory structures, but once we got past that hurdle, we were placing close to 30,000 elements into 2^15 buckets in under 3s on our fastest machine, a 33MHz 486.
That was far from the only hurdle, as we also incorporated a hardware security dongle into the hashing algorithm, many of the "words" were actually stems for longer expressions, because our hash check value was truncated almost to the point of insufficiency, the space of phantom matches grew exponentially with phrase length, but we had determined these phantoms didn't interfere with normal use of the system. I think if someone tried to exhaustively enumerate the dictionary, you got something close to 4 million 4 character Chinese expressions, of which 99.99% were phantoms. This made it difficult to replace the security dongle with a look-up table, and it wasn't possible to rekey the table to a new hash function without duplicating our work on the bipartite graph matching algorithm. We analyzed the phantom space against the legitimate data and had a small 16K data structure which suppressed any conflicts between valid data and phantom data likely to occur in practise.
Because we were hashing so much, we recoded the Rainbow dongle library to run ten times over spec. but it seemed to work (on all the systems we tested). Plus my analysis showed that the dongle was not producing good randomness so we also had to massage the hardware contribution to our hash values. I never trusted Rainbow's claim that the hardware hash algorithm could not be reverse engineered, but I didn't have time to pursue it. On short seed lengths, it showed all the statistical potency of rot13. On medium seed lengths, it improved to RANDU quality.
Our software had a check for the dongle on startup, but otherwise never checked the presence of the dongle, it just incorporated whatever random values the parallel port returned as part of the table lookup hash. Obviously, our Chinese input method returned nothing but garbage under this condition, but our software dongle was reported as cracked because someone removed the dongle test on startup, without subsequently checking for correct operation of the main input algorithm.
From http://www.woodmann.com/crackz/Dongles.htm
Others use the return codes from the dongle as inputs to hash functions or as values that control execution flow, implementations of this kind of secure method are few and far between.
I guess we were the few. Obviously, it wasn't our purpose at the time to publish any of this, though looking back I wish I had. I don't have a published paper which I can cite to add to this article the connection to bipartite graph matching, which is too bad, because it's quite essential to full understanding of this technique.
I also had an adaptive variant of this in the works to manage our font cache, as the print fonts were quite large. The nice property of this problem was that you could use the three way version to get a high fill rate and shuffle things around for a few steps, but if this became tedious, you could just throw an entry away, it would get fetched again if needed. I also thought the memory management of the font pool sucked, so I was considering a fairly radical departure from conventional cache management wisdom. I was planning to implement the font pool as a packed circular buffer, which functioned as a FIFO rather than an LRU. If a glyph was accessed from the back 10% of the FIFO (entries soon to be discarded), there was an option to copy that entry back to the front (a fast bypass on discard/reload), which somewhat resembled an amortized coallesence. I thought the combination of an adaptive cuckoo hash with a circular font cache would have been killer, but my time there ended before I was able to test this empirically. The problem case was where the font cache size was an insufficient workset for the document being composited, with a high churn rate. Perhaps an LRU would have performed better under this condition. I was always shocked these techniques did not appear to be better known. MaxEnt 17:42, 21 September 2007 (UTC)
Improper anchor text on external links
[Three] [blog] [posts] by Michael Mitzenmacher expounding cuckoo hashing
This method of listing multiple hyperlinks is acceptable in informal contexts (discussion forums, etc.) but does not meet the standards for speaking link anchors by the W3C. From a semantic perspective, the anchors suggest that the first article is related to "three", the second to "blog" and the third to "posts". 212.121.153.12 (talk) 15:29, 7 December 2007 (UTC)
Difference between Cuckoo and Double hashing
What's the difference between this and double-hashing? The article makes it sound like using two hash functions was the main idea of the algorithm. Firefight (talk) 03:01, 14 August 2008 (UTC)
- The big difference is that once Double Hashing places an item in a certain location, the item stays there whereas Cuckoo Hashing will move items around when there is a collision. -- Derek Ross | Talk 20:38, 17 November 2008 (UTC)
Algorithm
The explicit algorithm can be added, like the one on page 4 of: http://www.it-c.dk/people/pagh/papers/cuckoo-undergrad.pdf —Preceding unsigned comment added by 79.21.6.176 (talk) 12:33, 12 May 2009 (UTC)
Undue weight?
I just read the new material based on the investigation of Askitis. In my opinion this should be considerably shortened, to get a similar weight as the other empirical investigations of (bucketed) cuckoo hashing. Perhaps the best solution would be to have a separate article that discusses "array hash tables" in more depth? Even at the present length there is missing a proper discussion of the space-efficiency of array hash tables, which seems to include a considerable overhead in the memory management system (it is not accurate to simply sum up the sizes of currently allocated arrays). It is not clear to me if it is relevant at all to compare array hash table to highly space-efficient hash tables. —Preceding unsigned comment added by Pagh (talk • contribs) 08:18, 15 September 2009 (UTC)
UPDATE: Since no response has been made, I take this as evidence that the comparison is not relevant. I thus remove the description of the Askitis investigation. Pagh (talk) 09:38, 26 October 2009 (UTC)