Locality-sensitive hashing
Locality-sensitive hashing (LSH) is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items). This is different from the conventional hash functions, such as those used in cryptography as in this case the goal is to maximize probability of "collision" of similar items rather than avoid collisions. [1] Note how locality-sensitive hashing, in many ways, mirrors data clustering and Nearest neighbor search.
Contents |
Definition [edit]
An LSH family
is defined for a metric space
, a threshold
and an approximation factor
. An LSH family [2] [3]
is a family of functions
[clarification needed] satisfying the following conditions for any two points
, and a function
[clarification needed] chosen uniformly at random from
:
- if
, then
(i.e.,
and
collide) with probability at least
, - if
, then
with probability at most
.
A family is interesting when
. Such a family
is called
-sensitive.
Alternatively[4] it is defined with respect to a universe of items
that have a similarity function
. An LSH scheme is a family of hash functions
coupled with a probability distribution
over the functions such that a function
chosen according to
satisfies the property that
for any
.
Applications [edit]
LSH has been applied to several problem domains including[citation needed]
- Near-duplicate detection[5] [6]
- Hierarchical clustering[7]
- Genome-wide association study[8]
- Image similarity identification
- Gene expression similarity identification
- Audio similarity identification
- Nearest neighbor search
Methods [edit]
Bit sampling for Hamming distance [edit]
One of the easiest ways to construct an LSH family is by bit sampling.[3] This approach works for the Hamming distance over d-dimensional vectors
. Here, the family
of hash functions is simply the family of all the projections of points on one of the
coordinates, i.e.,
, where
is the
th coordinate of
. A random function
from
simply selects a random bit from the input point. This family has the following parameters:
,
.
Min-wise independent permutations [edit]
Suppose
is composed of subsets of some ground set of enumerable items
and the similarity function of interest is the Jaccard index
. If
is a permutation on the indices of
, for
let
. Each possible choice of
defines a single hash function
mapping input sets to integers.
Define the function family
to be the set of all such functions and let
be the uniform distribution. Given two sets
the event that
corresponds exactly to the event that the minimizer of
lies inside
. As
was chosen uniformly at random,
and
define an LSH scheme for the Jaccard index.
Because the symmetric group on n elements has size n!, choosing a truly random permutation from the full symmetric group is infeasible for even moderately sized n. Because of this fact, there has been significant work on finding a family of permutations that is "min-wise independent" - a permutation family for which each element of the domain has equal probability of being the minimum under a randomly chosen
. It has been established that a min-wise independent family of permutations is at least of size
.[9] and that this boundary is tight[10]
Because min-wise independent families are too big for practical applications, two variant notions of min-wise independence are introduced: restricted min-wise independent permutations families, and approximate min-wise independent families. Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most k.[11] Approximate min-wise independence differs from the property by at most a fixed
.[12]
Random projection [edit]
The random projection method of LSH is designed to approximate the cosine distance between vectors. The basic idea of this technique is to choose a random hyperplane (defined by a normal unit vector
) at the outset and use the hyperplane to hash input vectors.
Given an input vector
and a hyperplane defined by
, we let
. That is,
depending on which side of the hyperplane
lies.
Each possible choice of
defines a single function. Let
be the set of all such functions and let
be the uniform distribution once again. It is not difficult to prove that, for two vectors
,
, where
is the angle between
and
.
is closely related to
.
In this instance hashing produces only a single bit. Two vectors' bits match with probability proportional to the cosine of the angle between them.
Stable distributions [edit]
The hash function [13]
maps a d dimensional vector
onto a set of integers. Each hash function in the family is indexed by a choice of random
and
where
is a d dimensional vector with entries chosen independently from a stable distribution and
is a real number chosen uniformly from the range [0,r]. For a fixed
the hash function
is given by
.
Other construction methods for hash functions have been proposed to better fit the data. [14] In particular k-means hash functions are better in practice than projection-based hash functions, but without any theoretical guarantee.
LSH algorithm for nearest neighbor search [edit]
One of the main application of LSH is to provide efficient nearest neighbor search algorithms. Consider any LSH family
. The algorithm has two main parameters: the width parameter
and the number of hash tables
.
In the first step, we define a new family
of hash functions
, where each function
is obtained by concatenating
functions
from
, i.e.,
. In other words, a random hash function
is obtained by concatenating
randomly chosen hash functions from
. The algorithm then constructs
hash tables, each corresponding to a different randomly chosen hash function
.
In the preprocessing step we hash all
points from the data set
into each of the
hash tables. Given that the resulting hash tables have only
non-zero entries, one can reduce the amount of memory used per each hash table to
using standard hash functions.
Given a query point
, the algorithm iterates over the
hash functions
. For each
considered, it retrieves the data points that are hashed into the same bucket as
. The process is stopped as soon as a point within distance
from
is found.
Given the parameters
and
, the algorithm has the following performance guarantees:
- preprocessing time:
, where
is the time to evaluate a function
on an input point
; - space:
, plus the space for storing data points; - query time:
; - the algorithm succeeds in finding a point within distance
from
(if it exists) with probability at least
;
For a fixed approximation ratio
and probabilities
and
, one can set
and
, where
. Then one obtains the following performance guarantees:
- preprocessing time:
; - space:
, plus the space for storing data points; - query time:
;
See also [edit]
- Curse of dimensionality
- Feature hashing
- Fourier-related transforms
- Multilinear subspace learning
- Principal component analysis
- Singular value decomposition
- Wavelet compression
- Rolling hash
- Bloom Filter
References [edit]
- ^ A. Rajaraman and J. Ullman (2010). "Mining of Massive Datasets, Ch. 3.".
- ^ Gionis, A.; Indyk, P., Motwani, R. (1999). , "Similarity Search in High Dimensions via Hashing". Proceedings of the 25th Very Large Database (VLDB) Conference.
- ^ a b Indyk, Piotr.; Motwani, Rajeev. (1998). , "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality.". Proceedings of 30th Symposium on Theory of Computing.
- ^ Charikar, Moses S.. (2002). "Similarity Estimation Techniques from Rounding Algorithms". Proceedings of the 34th Annual ACM Symposium on Theory of Computing 2002: (ACM 1–58113–495–9/02/0005)…. doi:10.1145/509907.509965. Retrieved 2007-12-21.
- ^ Gurmeet Singh, Manku; Das Sarma, Anish (2007), "Detecting near-duplicates for web crawling", Proceedings of the 16th international conference on World Wide Web. ACM,.
- ^ Das, Abhinandan S., et al. (2007), "Google news personalization: scalable online collaborative filtering", Proceedings of the 16th international conference on World Wide Web. ACM,.
- ^ Koga, Hisashi, Tetsuo Ishibashi, and Toshinori Watanabe (2007), "Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing", Knowledge and Information Systems 12.1: 25-53,.
- ^ Brinza, Dumitru, et al., "RAPID detection of gene–gene interactions in genome-wide association studies", Bioinformatics 26.22 (2010): 2856-2862.
- ^ Broder, A.Z.; Charikar, M.; Frieze, A.M.; Mitzenmacher, M. (1998). "Min-wise independent permutations". Proceedings of the thirtieth annual ACM symposium on Theory of computing: 327–336. doi:10.1145/276698.276781. Retrieved 2007-11-14.
- ^ "An optimal construction of exactly min-wise independent permutations". Technical Report COMP98-62, IEICE, 1998.
- ^ Matoušek, J.; Stojakovic, M. (2002). "On Restricted Min-Wise Independence of Permutations". Preprint. Retrieved 2007-11-14.
- ^ Saks, M.; Srinivasan, A.; Zhou, S.; Zuckerman, D. (2000). "Low discrepancy sets yield approximate min-wise independent permutation families". Information Processing Letters 73 (1-2): 29–32. doi:10.1016/S0020-0190(99)00163-5. Retrieved 2007-11-14.
- ^ Datar, M.; Immorlica, N., Indyk, P., Mirrokni, V.S. (2004). "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions". Proceedings of the Symposium on Computational Geometry.
- ^ Pauleve, L.; Jegou, H., Amsaleg, L. (2010). "Locality sensitive hashing: A comparison of hash function types and querying mechanisms". Pattern recognition Letters.
Further reading [edit]
- Samet, H. (2006) Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann. ISBN 0-12-369446-9
External links [edit]
- Alex Andoni's LSH homepage
- LSHKIT: A C++ Locality Sensitive Hashing Library
- Caltech Large Scale Image Search Toolbox: a Matlab toolbox implementing several LSH hash functions, in addition to Kd-Trees, Hierarchical K-Means, and Inverted File search algorithms.
- Simhash at Google
- Slash: A C++ LSH library, implementing Spherical LSH by Terasawa, K., Tanaka, Y
, then
(i.e.,
and
, then
, where
is the time to evaluate a function
on an input point
, plus the space for storing data points;
;
;
;
, plus the space for storing data points;
;