Talk:MinHash

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia


Time analysis incorrect[edit]

The time analysis section was misleading and vague, simply saying that the time required to generate the signatures was linear time. This is true for the size of the set, but not for the parameter k. I have updated it to be more accurate, but the article would be improved by somebody finding an actual reference (and double-checking my work!) Leopd (talk) 20:50, 5 October 2012 (UTC)[reply]


Variance of the random variable r[edit]

The last paragraph of the section Jaccard similarity and minimum hash values defines a random variable r. As defined, it appears that this random variable has a Bernoulli distribution with parameter

The claim is made that the variance of this random variable is always either zero or one. But for a Bernoulli distribution, the variance is p(1-p), which is never zero or one (assuming that 0 < p < 1). I will update this section to reflect this clarification. --Mbw314 (talk) 17:29, 16 May 2017 (UTC)[reply]

You misread. The claim is that the variable itself is always either zero or one. —David Eppstein (talk) 23:28, 16 May 2017 (UTC)[reply]

Where does the name MinHash come from?[edit]

It seems this algorithm is generally known by this name now, but Broder's original 1997 article and his 2000 followup don't use that name at all. Where did the name "MinHash" come from? --Polm23 (talk) 02:50, 3 July 2017 (UTC)[reply]

Corrections[edit]

The random part in the min hashing is not randomly choosing sets A, B, but hash functions randomly drawn. I hope I could remedy this error in a understandable way!

Best regards, Petarded (talk) 10:13, 24 October 2018 (UTC)[reply]

Error bounds[edit]

Variant with many hash functions throws out an O(1/√k) bound for the expected error of the computed Jaccard index. Expected error doesn't seem to be a well-established term and the linked citation doesn't derive such a bound. By failing to mention a measure of confidence, the page misleadingly implies that the result will fall within +/- 1/√k. https://www.cs.princeton.edu/courses/archive/fall09/cos521/Handouts/probabilityandcomputing.pdf looks a handy resource for using Chernoff bounds to relate the number of samples/hashes to error bounds and confidence, for sampling with replacement (i.e. many hash functions but not the single hash function variant).