Talk:MinHash

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Statistics (Rated C-class, Low-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C-Class article C  This article has been rated as C-Class on the quality scale.
 Low  This article has been rated as Low-importance on the importance scale.
 
WikiProject Computer science (Rated C-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
 


Time analysis incorrect[edit]

The time analysis section was misleading and vague, simply saying that the time required to generate the signatures was linear time. This is true for the size of the set, but not for the parameter k. I have updated it to be more accurate, but the article would be improved by somebody finding an actual reference (and double-checking my work!) Leopd (talk) 20:50, 5 October 2012 (UTC)


Variance of the random variable r[edit]

The last paragraph of the section Jaccard similarity and minimum hash values defines a random variable r. As defined, it appears that this random variable has a Bernoulli distribution with parameter

The claim is made that the variance of this random variable is always either zero or one. But for a Bernoulli distribution, the variance is p(1-p), which is never zero or one (assuming that 0 < p < 1). I will update this section to reflect this clarification. --Mbw314 (talk) 17:29, 16 May 2017 (UTC)

You misread. The claim is that the variable itself is always either zero or one. —David Eppstein (talk) 23:28, 16 May 2017 (UTC)

Where does the name MinHash come from?[edit]

It seems this algorithm is generally known by this name now, but Broder's original 1997 article and his 2000 followup don't use that name at all. Where did the name "MinHash" come from? --Polm23 (talk) 02:50, 3 July 2017 (UTC)