Nearest neighbor search: Difference between revisions
m Citations: [Pu144]Added: postscript. Unified citation types. You can use this bot yourself! Report bugs here. |
|||
Line 29: | Line 29: | ||
===Space partitioning=== |
===Space partitioning=== |
||
Starting from 1970s [[branch and bound]] methodology was applied to the problem. In the case of Euclidean space this approach is known as [[spatial index]] or spatial access methods. Several [[Space_partitioning|space-partitioning]] methods have been developed for solving the NNS problem. Perhaps the simplest is the [[Kd tree|kd-tree]], which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, also neighboring branches that might contain hits need to be evaluated. For constant dimension query time, average complexity is ''O''(log ''N'') <ref>{{cite web|title=An introductory tutorial on KD trees|author=Andrew Moore | url=http://www.autonlab.com/autonweb/14665/version/2/part/5/data/moore-tutorial.pdf?branch=main&language=en}}</ref> in the case of randomly distributed points, worst case complexity analyses have been performed.<ref name=Lee1977>{{ |
Starting from 1970s [[branch and bound]] methodology was applied to the problem. In the case of Euclidean space this approach is known as [[spatial index]] or spatial access methods. Several [[Space_partitioning|space-partitioning]] methods have been developed for solving the NNS problem. Perhaps the simplest is the [[Kd tree|kd-tree]], which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, also neighboring branches that might contain hits need to be evaluated. For constant dimension query time, average complexity is ''O''(log ''N'') <ref>{{cite web|title=An introductory tutorial on KD trees|author=Andrew Moore | url=http://www.autonlab.com/autonweb/14665/version/2/part/5/data/moore-tutorial.pdf?branch=main&language=en}}</ref> in the case of randomly distributed points, worst case complexity analyses have been performed.<ref name=Lee1977>{{Cite journal |
||
| last1 = Lee | first1 = D. T. | author1-link = Der-Tsai Lee |
| last1 = Lee | first1 = D. T. | author1-link = Der-Tsai Lee |
||
| last2 = Wong | first2 = C. K. |
| last2 = Wong | first2 = C. K. |
||
Line 39: | Line 39: | ||
| pages = 23–29 |
| pages = 23–29 |
||
| doi = 10.1007/BF00263763 |
| doi = 10.1007/BF00263763 |
||
| postscript = . |
|||
}}</ref> |
}}</ref> |
||
Alternatively the [[R-tree]] data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions. |
Alternatively the [[R-tree]] data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions. |
||
Line 81: | Line 82: | ||
For some applications (e.g. [[entropy estimation]]), we may have ''N'' data-points and wish to know which is the nearest neighbor ''for every one of those N points''. This could of course be achieved by running a nearest-neighbor search once for every point, but an improved strategy would be an algorithm that exploits the information redundancy between these ''N'' queries to produce a more efficient search. As a simple example: when we find the distance from point ''X'' to point ''Y'', that also tells us the distance from point ''Y'' to point ''X'', so the same calculation can be reused in two different queries. |
For some applications (e.g. [[entropy estimation]]), we may have ''N'' data-points and wish to know which is the nearest neighbor ''for every one of those N points''. This could of course be achieved by running a nearest-neighbor search once for every point, but an improved strategy would be an algorithm that exploits the information redundancy between these ''N'' queries to produce a more efficient search. As a simple example: when we find the distance from point ''X'' to point ''Y'', that also tells us the distance from point ''Y'' to point ''X'', so the same calculation can be reused in two different queries. |
||
Given a fixed dimension, a semi-definite positive norm (thereby including every [[lp space|L<sup>p</sup> norm]]), and ''n'' points in this space, the ''m'' nearest neighbours of every point can be found in ''O''(''mn'' log ''n'') time.<ref name=Vaidya>{{ |
Given a fixed dimension, a semi-definite positive norm (thereby including every [[lp space|L<sup>p</sup> norm]]), and ''n'' points in this space, the ''m'' nearest neighbours of every point can be found in ''O''(''mn'' log ''n'') time.<ref name=Vaidya>{{Cite journal |
||
| last1 = Vaidya | first1 = P. M. |
| last1 = Vaidya | first1 = P. M. |
||
| year = 1989 |
| year = 1989 |
||
Line 90: | Line 91: | ||
| pages = 101–115 |
| pages = 101–115 |
||
| url = http://www.springerlink.com/content/p4mk2608787r7281/?p=09da9252d36e4a1b8396833710ef08cc&pi=8 |
| url = http://www.springerlink.com/content/p4mk2608787r7281/?p=09da9252d36e4a1b8396833710ef08cc&pi=8 |
||
| postscript = . |
|||
}}</ref> |
}}</ref> |
||
Revision as of 01:01, 19 May 2010
Nearest neighbor search (NNS), also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point q ∈ M, find the closest point in S to q. In many cases, M is taken to be d-dimensional Euclidean space and distance is measured by Euclidean distance or Manhattan distance.
Donald Knuth in vol. 3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning a residence to the nearest post office.
Applications
The nearest neighbor search problem arises in numerous fields of application, including:
- Pattern recognition - in particular for optical character recognition
- Statistical classification- see k-nearest neighbor algorithm
- Computer vision
- Databases - e.g. content-based image retrieval
- Coding theory - see maximum likelihood decoding
- Data compression - see MPEG-2 standard
- Recommendation systems
- Internet marketing - see contextual advertising and behavioral targeting
- DNA sequencing
- Spell checking - suggesting correct spelling
- Plagiarism detection
- Contact searching algorithms in FEA
- Similarity scores for predicting career paths of professional athletes.
- Cluster analysis - assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense, usually based on Euclidean distance
Methods
Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time.
Linear search
The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(Nd) where N is the cardinality of S and d is the dimensionality of M. There are no search data structures to maintain, so linear search has no space complexity beyond the storage of the database. Surprisingly, naive search outperforms space partitioning approaches on higher dimensional spaces [1].
Space partitioning
Starting from 1970s branch and bound methodology was applied to the problem. In the case of Euclidean space this approach is known as spatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is the kd-tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, also neighboring branches that might contain hits need to be evaluated. For constant dimension query time, average complexity is O(log N) [2] in the case of randomly distributed points, worst case complexity analyses have been performed.[3] Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions.
In case of general metric space branch and bound approach is known under the name of metric trees. Particular examples include VP-tree and Bk-tree.
Locality sensitive hashing
Locality sensitive hashing (LSH) is a technique for grouping points in space into 'buckets' based on some distance metric operating on the points. Points that are close to each other under the chosen metric are mapped to the same bucket with high probability.
Nearest neighbor search in spaces with small intrinsic dimension
The cover tree has a theoretical bound that is based on the dataset's doubling constant. The bound on search time is O(c12 log n) where c is the expansion constant of the dataset.
Vector Approximation Files
In high dimensional spaces tree indexing structures turn out to become useless because an increasing percentage of the nodes need to be examined anyway. To speed up linear search, a compressed version of the feature vectors stored in RAM is used to prefilter the datasets in a first run. The final candidates are determined in a second stage using the uncompressed data from the disk for distance calculation.[4]
Variants
There are numerous variants of the NNS problem and the two most well-known are the k-nearest neighbor search and the ε-approximate nearest neighbor search.
K-nearest neighbor
k-nearest neighbor search identifies the top k nearest neighbors to the query. This technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors. k-nearest neighbor graphs are graphs in which every point is connected to its k nearest neighbors.
Approximate nearest neighbor
In some applications it may be acceptable to retrieve a "good guess" of the nearest neighbor. In those cases, we can use an algorithm which doesn't guarantee to return the actual nearest neighbor in every case, in return for improved speed or memory savings. Often such an algorithm will find the nearest neighbor in a majority of cases, but this depends strongly on the dataset being queried.
Algorithms which support the approximate nearest neighbor search include Best Bin First and Balanced Box-Decomposition Tree based search.[5]
ε-approximate nearest neighbor search is becoming an increasingly popular tool for fighting the curse of dimensionality.
Nearest neighbor distance ratio
Nearest neighbor distance ratio do not apply the threshold on the direct distance from the original point to the challenger neighbor but on a ratio of it depending on the distance to the previous neighbor. It is used in CBIR to retrieve pictures through a "query by example" using the similarity between local features. More generally it is involved in several matching problems.
All nearest neighbors
For some applications (e.g. entropy estimation), we may have N data-points and wish to know which is the nearest neighbor for every one of those N points. This could of course be achieved by running a nearest-neighbor search once for every point, but an improved strategy would be an algorithm that exploits the information redundancy between these N queries to produce a more efficient search. As a simple example: when we find the distance from point X to point Y, that also tells us the distance from point Y to point X, so the same calculation can be reused in two different queries.
Given a fixed dimension, a semi-definite positive norm (thereby including every Lp norm), and n points in this space, the m nearest neighbours of every point can be found in O(mn log n) time.[6]
See also
- K-nearest neighbor algorithm
- Nearest-neighbor interpolation
- Content-based image retrieval
- Locality sensitive hashing
- Voronoi diagram
- Dimension reduction
- Curse of dimensionality
- Time series
- Cluster analysis
- Linear least squares
- Principal Component Analysis
- Singular value decomposition
- Fourier Analysis
- Wavelet
- Digital signal processing
- Multidimensional analysis
Notes
- ^ Weber, Schek, Blott. "A quantitative analysis and performance study for similarity search methods in high dimensional spaces" (PDF).
{{cite web}}
: CS1 maint: multiple names: authors list (link) - ^ Andrew Moore. "An introductory tutorial on KD trees" (PDF).
- ^ Lee, D. T.; Wong, C. K. (1977). "Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees". Acta Informatica. 9 (1): 23–29. doi:10.1007/BF00263763.
{{cite journal}}
: CS1 maint: postscript (link) - ^ Weber, Blott. "An Approximation-Based Data Structure for Similarity Search".
{{cite web}}
: Missing or empty|url=
(help) - ^ S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman and A. Wu, An optimal algorithm for approximate nearest neighbor searching, Journal of the ACM, 45(6):891-923, 1998. [1]
- ^ Vaidya, P. M. (1989). "An O(n log n) Algorithm for the All-Nearest-Neighbors Problem". Discrete and Computational Geometry. 4 (1): 101–115.
{{cite journal}}
: CS1 maint: postscript (link)
References
- Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. 1999. When is nearest neighbor meaningful? In Proceedings of the 7th ICDT, Jerusalem, Israel.
- Arya, S., D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions. Journal of the ACM, vol. 45, no. 6, pp. 891-923
- Zezula, P., Amato, G., Dohnal, V., and Batko, M. Similarity Search - The Metric Space Approach. Springer, 2006. ISBN 0-387-29146-6
- Chung-Min Chen and Yibei Ling - A Sampling-Based Estimator for Top-k Query. ICDE 2002: 617-627
- Samet, H. 2006. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufman. ISBN 0123694469
External links
- Nearest Neighbors and Similarity Search - a website dedicated to educational materials, software, literature, researchers, open problems and events related to NN searching. Maintained by Yury Lifshits.
- Similarity Search Wiki a collection of links, people, ideas, keywords, papers, slides, code and data sets on nearest neighbours.
- Metric Spaces Library - An open-source C-based library for metric space indexing by Karina Figueroa, Gonzalo Navarro, Edgar Chávez.
- ANN - A Library for Approximate Nearest Neighbor Searching by David M. Mount and Sunil Arya.
- STANN - A Simple Threaded Approximate Nearest Neighbor Search Library in C++ by Michael Connor and Piyush Kumar.
- MESSIF - Metric Similarity Search Implementation Framework by Michal Batko and David Novak.
- OBSearch - Similarity Search engine for Java (GPL). Implementation by Arnoldo Muller, developed during Google Summer of Code 2007.
- KNNLSB - K Nearest Neighbors Linear Scan Baseline (distributed, LGPL). Implementation by Georges Quénot (LIG-CNRS).
Further reading
- Shasha, Dennis (2004). High Performance Discovery in Time Series. Berlin: Springer. ISBN 0387008578.