Talk:MinHash

Statistics Low‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
Low	This article has been rated as Low-importance on the importance scale.

Computer science Low‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Low

This article has been rated as Low-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Time analysis incorrect[edit]

The time analysis section was misleading and vague, simply saying that the time required to generate the signatures was linear time. This is true for the size of the set, but not for the parameter k. I have updated it to be more accurate, but the article would be improved by somebody finding an actual reference (and double-checking my work!) Leopd (talk) 20:50, 5 October 2012 (UTC)[reply]

Variance of the random variable r[edit]

The last paragraph of the section Jaccard similarity and minimum hash values defines a random variable r. As defined, it appears that this random variable has a Bernoulli distribution with parameter

p:=\mathrm {Pr} [h_{\mathrm {min} }(A)=h_{\mathrm {min} }(B)]=J(A,B).

The claim is made that the variance of this random variable is always either zero or one. But for a Bernoulli distribution, the variance is p(1-p), which is never zero or one (assuming that 0 < p < 1). I will update this section to reflect this clarification. --Mbw314 (talk) 17:29, 16 May 2017 (UTC)[reply]

You misread. The claim is that the variable itself is always either zero or one. —David Eppstein (talk) 23:28, 16 May 2017 (UTC)[reply]

Where does the name MinHash come from?[edit]

It seems this algorithm is generally known by this name now, but Broder's original 1997 article and his 2000 followup don't use that name at all. Where did the name "MinHash" come from? --Polm23 (talk) 02:50, 3 July 2017 (UTC)[reply]

Corrections[edit]

The random part in the min hashing is not randomly choosing sets A, B, but hash functions randomly drawn. I hope I could remedy this error in a understandable way!

Best regards, Petarded (talk) 10:13, 24 October 2018 (UTC)[reply]

Error bounds[edit]

Variant with many hash functions throws out an O(1/√k) bound for the expected error of the computed Jaccard index. Expected error doesn't seem to be a well-established term and the linked citation doesn't derive such a bound. By failing to mention a measure of confidence, the page misleadingly implies that the result will fall within +/- 1/√k. https://www.cs.princeton.edu/courses/archive/fall09/cos521/Handouts/probabilityandcomputing.pdf looks a handy resource for using Chernoff bounds to relate the number of samples/hashes to error bounds and confidence, for sampling with replacement (i.e. many hash functions but not the single hash function variant).