# k-d tree

KD-tree
Type Multidimensional BST
Invented 1975
Invented by Jon Louis Bentley
Time complexity
in big O notation
Average Worst case
Space O(n) O(n)
Search O(log n) O(n)
Insert O(log n) O(n)
Delete O(log n) O(n)
A 3-dimensional k-d tree. The first split (red) cuts the root cell (white) into two subcells, each of which is then split (green) into two subcells. Finally, each of those four is split (blue) into two subcells. Since there is no more splitting, the final eight are called leaf cells.

In computer science, a k-d tree (short for k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space. k-d trees are a useful data structure for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbor searches). k-d trees are a special case of binary space partitioning trees.

## Informal description

The k-d tree is a binary tree in which every node is a k-dimensional point. Every non-leaf node can be thought of as implicitly generating a splitting hyperplane that divides the space into two parts, known as half-spaces. Points to the left of this hyperplane are represented by the left subtree of that node and points right of the hyperplane are represented by the right subtree. The hyperplane direction is chosen in the following way: every node in the tree is associated with one of the k-dimensions, with the hyperplane perpendicular to that dimension's axis. So, for example, if for a particular split the "x" axis is chosen, all points in the subtree with a smaller "x" value than the node will appear in the left subtree and all points with larger "x" value will be in the right subtree. In such a case, the hyperplane would be set by the x-value of the point, and its normal would be the unit x-axis.[1]

## Operations on k-d trees

### Construction

Since there are many possible ways to choose axis-aligned splitting planes, there are many different ways to construct k-d trees. The canonical method of k-d tree construction has the following constraints:[2]

• As one moves down the tree, one cycles through the axes used to select the splitting planes. (For example, in a 3-dimensional tree, the root would have an x-aligned plane, the root's children would both have y-aligned planes, the root's grandchildren would all have z-aligned planes, the root's great-grandchildren would all have x-aligned planes, the root's great-great-grandchildren would all have y-aligned planes, and so on.)
• Points are inserted by selecting the median of the points being put into the subtree, with respect to their coordinates in the axis being used to create the splitting plane. (Note the assumption that we feed the entire set of n points into the algorithm up-front.)

This method leads to a balanced k-d tree, in which each leaf node is about the same distance from the root. However, balanced trees are not necessarily optimal for all applications.

Note also that it is not required to select the median point. In that case, the result is simply that there is no guarantee that the tree will be balanced. A simple heuristic to avoid coding a complex linear-time median-finding algorithm, or using an O(n log n) sort of all n points, is to use sort to find the median of a fixed number of randomly selected points to serve as the splitting plane. In practice, this technique often results in nicely balanced trees.

Given a list of n points, the following algorithm uses a median-finding sort to construct a balanced k-d tree containing those points.

function kdtree (list of points pointList, int depth)
{
// Select axis based on depth so that axis cycles through all valid values
var int axis := depth mod k;

// Sort point list and choose median as pivot element
select median by axis from pointList;

// Create node and construct subtrees
var tree_node node;
node.location := median;
node.leftChild := kdtree(points in pointList before median, depth+1);
node.rightChild := kdtree(points in pointList after median, depth+1);
return node;
}


It is common that points "after" the median include only the ones that are strictly greater than the median. For points that lie on the median, it is possible to define a "superkey" function that compares the points in all dimensions. In some cases, it is acceptable to let points equal to the median lie on one side of the median, for example, by splitting the points into a "less than" subset and a "greater than or equal to" subset.

 The above algorithm implemented in the Python programming language is as follows: from collections import namedtuple from operator import itemgetter from pprint import pformat class Node(namedtuple('Node', 'location left_child right_child')):     def __repr__(self):         return pformat(tuple(self)) def kdtree(point_list, depth=0):     try:         k = len(point_list[0]) # assumes all points have the same dimension     except IndexError as e: # if not point_list:         return None     # Select axis based on depth so that axis cycles through all valid values     axis = depth % k       # Sort point list and choose median as pivot element     point_list.sort(key=itemgetter(axis))     median = len(point_list) // 2 # choose median       # Create node and construct subtrees     return Node(         location=point_list[median],         left_child=kdtree(point_list[:median], depth + 1),         right_child=kdtree(point_list[median + 1:], depth + 1)     ) def main():     """Example usage"""     point_list = [(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)]     tree = kdtree(point_list)     print(tree) if __name__ == '__main__':     main() Output would be: ((7, 2),  ((5, 4), ((2, 3), None, None), ((4, 7), None, None)),  ((9, 6), ((8, 1), None, None), None)) The generated tree is shown below. k-d tree decomposition for the point set (2,3), (5,4), (9,6), (4,7), (8,1), (7,2). The resulting k-d tree.

This algorithm creates the invariant that for any node, all the nodes in the left subtree are on one side of a splitting plane, and all the nodes in the right subtree are on the other side. Points that lie on the splitting plane may appear on either side. The splitting plane of a node goes through the point associated with that node (referred to in the code as node.location).

A novel tree-building algorithm builds a balanced k-d tree in O(kn log n) time by sorting n points in k dimensions independently and prior to building the k-d tree.[3][4] A suitable sorting algorithm is Heapsort that creates a sorted array in O(n log n) time. Application of Heapsort to n points in each of k dimensions requires O(kn log n) time, and produces k sorted arrays of length n that contain references (or pointers) to the n points. These arrays are numbered from 0 to k-1. Each array represents the result of sorting the points in one of the k dimensions. For example, the elements of array 0, from first to last, reference the n points in order of increasing x-coordinate. Similarly, the elements of arrays 1, 2, and 3, from first to last, reference the n points in order of increasing y-, z- and w-coordinates, respectively.

In order to insert the first node into the k-d tree, the median element of array 0 is chosen and stored in the tree node. This median element splits array 0 into two subarrays. One subarray lies above the median element, and the other subarray lies below it. Also, the x-coordinate of the point that this median element references defines an x-aligned splitting plane that may be used to split each of the other k-1 arrays into two subarrays. The following procedure splits an array into two subarrays:

• Consider each element of the array in order from first to last.
• Test against the splitting plane the x-coordinate of the point that is referenced by the array element, and assign that element to one of two subarrays, depending on which side of the splitting plane the point lies.
• Ignore the array element that references the same point that the median element of array 0 references, because this point defines the splitting plane.

This procedure splits the arrays into two sets of subarrays while preserving the original sorted order within each subarray. These subarrays may then be used to insert nodes into the two subtrees at the next level of the tree in a recursive manner. However, if the subarrays comprise only one or two array elements, no further recursion is required because these cases may be solved trivially.

These guidelines will simplify creation of k-d trees:

• Arrays should be split into subarrays that represent "less than" and "greater than or equal to" partitioning. This convention requires that, after choosing the median element of array 0, the element of array 0 that lies immediately below the median element be examined to ensure that this adjacent element references a point whose x-coordinate is less than and not equal to the x-coordinate of the splitting plane. If this adjacent element references a point whose x-coordinate is equal to the x-coordinate of the splitting plane, continue searching towards the beginning of array 0 until the first instance of an array element is found that references a point whose x-coordinate is less than and not equal to the x-coordinate of the splitting plane. When this array element is found, the element that lies immediately above this element is the correct choice for the median element. Apply this method of choosing the median element at each level of recursion.
• This procedure for producing subarrays guarantees that the two subarrays comprise one less array element than the array from which these subarrays were produced. This characteristic permits re-use of the k arrays at each level of recursion as follows: (1) copy array 0 into a temporary array, (2) build the subarrays that are produced from array 1 in array 0, (3) build the subarrays that are produced from array 2 in array 1, (4) continue this pattern, and build the subarrays that are produced from array k-1 in array k-2, and finally (4) copy the temporary array into array k-1. This method permutes the subarrays so that at successive levels of the k-d tree, the median element is chosen from x-, y-, z- w-,... sorted arrays.
• The addresses of the first and last elements of the 2k subarrays can be passed to the next level of recursion in order to designate where these subarrays lie within the k arrays. Each of the two sets of k subarrays have identical addresses for their first and last elements.

This tree-building algorithm requires at most O([k-1]n) tests of coordinates against splitting planes to build each of the log n levels of a balanced k-d tree. Hence, building the entire k-d tree requires less than O([k-1]n log n) time, which is less than the O(kn log n) time that is required to sort the n points in k dimensions prior to building the k-d tree.

One adds a new point to a k-d tree in the same way as one adds an element to any other search tree. First, traverse the tree, starting from the root and moving to either the left or the right child depending on whether the point to be inserted is on the "left" or "right" side of the splitting plane. Once you get to the node under which the child should be located, add the new point as either the left or right child of the leaf node, again depending on which side of the node's splitting plane contains the new node.

Adding points in this manner can cause the tree to become unbalanced, leading to decreased tree performance. The rate of tree performance degradation is dependent upon the spatial distribution of tree points being added, and the number of points added in relation to the tree size. If a tree becomes too unbalanced, it may need to be re-balanced to restore the performance of queries that rely on the tree balancing, such as nearest neighbour searching.

### Removing elements

To remove a point from an existing k-d tree, without breaking the invariant, the easiest way is to form the set of all nodes and leaves from the children of the target node, and recreate that part of the tree.

Another approach is to find a replacement for the point removed.[5] First, find the node R that contains the point to be removed. For the base case where R is a leaf node, no replacement is required. For the general case, find a replacement point, say p, from the subtree rooted at R. Replace the point stored at R with p. Then, recursively remove p.

For finding a replacement point, if R discriminates on x (say) and R has a right child, find the point with the minimum x value from the subtree rooted at the right child. Otherwise, find the point with the maximum x value from the subtree rooted at the left child.

### Balancing

Balancing a k-d tree requires care because k-d trees are sorted in multiple dimensions so the tree rotation technique cannot be used to balance them as this may break the invariant.

Several variants of balanced k-d trees exist. They include divided k-d tree, pseudo k-d tree, k-d B-tree, hB-tree and Bkd-tree. Many of these variants are adaptive k-d trees.

### Nearest neighbour search

Animation of NN searching with a k-d tree in two dimensions

The nearest neighbour search (NN) algorithm aims to find the point in the tree that is nearest to a given input point. This search can be done efficiently by using the tree properties to quickly eliminate large portions of the search space.

Searching for a nearest neighbour in a k-d tree proceeds as follows:

1. Starting with the root node, the algorithm moves down the tree recursively, in the same way that it would if the search point were being inserted (i.e. it goes left or right depending on whether the point is less than or greater than the current node in the split dimension).
2. Once the algorithm reaches a leaf node, it saves that node point as the "current best"
3. The algorithm unwinds the recursion of the tree, performing the following steps at each node:
1. If the current node is closer than the current best, then it becomes the current best.
2. The algorithm checks whether there could be any points on the other side of the splitting plane that are closer to the search point than the current best. In concept, this is done by intersecting the splitting hyperplane with a hypersphere around the search point that has a radius equal to the current nearest distance. Since the hyperplanes are all axis-aligned this is implemented as a simple comparison to see whether the difference between the splitting coordinate of the search point and current node is less than the distance (overall coordinates) from the search point to the current best.
1. If the hypersphere crosses the plane, there could be nearer points on the other side of the plane, so the algorithm must move down the other branch of the tree from the current node looking for closer points, following the same recursive process as the entire search.
2. If the hypersphere doesn't intersect the splitting plane, then the algorithm continues walking up the tree, and the entire branch on the other side of that node is eliminated.
4. When the algorithm finishes this process for the root node, then the search is complete.

Generally the algorithm uses squared distances for comparison to avoid computing square roots. Additionally, it can save computation by holding the squared current best distance in a variable for comparison.

Finding the nearest point is an O(log N) operation in the case of randomly distributed points, although analysis in general is tricky. However an algorithm has been given that claims guaranteed O(log N) complexity.[6]

In high-dimensional spaces, the curse of dimensionality causes the algorithm to need to visit many more branches than in lower-dimensional spaces. In particular, when the number of points is only slightly higher than the number of dimensions, the algorithm is only slightly better than a linear search of all of the points.

The algorithm can be extended in several ways by simple modifications. It can provide the k nearest neighbours to a point by maintaining k current bests instead of just one. A branch is only eliminated when k points have been found and the branch cannot have points closer than any of the k current bests.

It can also be converted to an approximation algorithm to run faster. For example, approximate nearest neighbour searching can be achieved by simply setting an upper bound on the number points to examine in the tree, or by interrupting the search process based upon a real time clock (which may be more appropriate in hardware implementations). Nearest neighbour for points that are in the tree already can be achieved by not updating the refinement for nodes that give zero distance as the result, this has the downside of discarding points that are not unique, but are co-located with the original search point.

Approximate nearest neighbour is useful in real-time applications such as robotics due to the significant speed increase gained by not searching for the best point exhaustively. One of its implementations is best-bin-first search.

### Range search

Analyses of binary search trees has found that the worst case time for range search in a k-dimensional KD tree containing N nodes is given by the following equation.[7]

$t_{worst} = O(k \cdot N^{1-\frac{1}{k}})$

## High-dimensional data

k-d trees are not suitable for efficiently finding the nearest neighbour in high-dimensional spaces. As a general rule, if the dimensionality is k, the number of points in the data, N, should be N >> 2k. Otherwise, when k-d trees are used with high-dimensional data, most of the points in the tree will be evaluated and the efficiency is no better than exhaustive search,[8] and approximate nearest-neighbour methods should be used instead.

## Complexity

• Building a static k-d tree from n points takes:
• O(n log2 n) time if an O(n log n) sort such as Heapsort is used to compute the median at each level;
• O(n log n) time if a complex linear-time median-finding algorithm such as the one described in Cormen et al.[9] is used;
• O(kn log n) plus O([k-1]n log n) time if n points are sorted in each of k dimensions using an O(n log n) sort prior to building the k-d tree.
• Inserting a new point into a balanced k-d tree takes O(log n) time.
• Removing a point from a balanced k-d tree takes O(log n) time.
• Querying an axis-parallel range in a balanced k-d tree takes O(n1-1/k +m) time, where m is the number of the reported points, and k the dimension of the k-d tree.
• Finding 1 nearest neighbour in a balanced k-d tree with randomly distributed points takes O(log n) time on average.

## Variations

### Volumetric objects

Instead of points, a k-d tree can also contain rectangles or hyperrectangles.[10][11] Thus range search becomes the problem of returning all rectangles intersecting the search rectangle. The tree is constructed the usual way with all the rectangles at the leaves. In an orthogonal range search, the opposite coordinate is used when comparing against the median. For example, if the current level is split along xhigh, we check the xlow coordinate of the search rectangle. If the median is less than the xlow coordinate of the search rectangle, then no rectangle in the left branch can ever intersect with the search rectangle and so can be pruned. Otherwise both branches should be traversed. See also interval tree, which is a 1-dimensional special case.

### Points only in leaves

It is also possible to define a k-d tree with points stored solely in leaves.[2] This form of k-d tree allows a variety of split mechanics other than the standard median split. The midpoint splitting rule[12] selects on the middle of the longest axis of the space being searched, regardless of the distribution of points. This guarantees that the aspect ratio will be at most 2:1, but the depth is dependent on the distribution of points. A variation, called sliding-midpoint, only splits on the middle if there are points on both sides of the split. Otherwise, it splits on point nearest to the middle. Maneewongvatana and Mount show that this offers "good enough" performance on common data sets. Using sliding-midpoint, an approximate nearest neighbour query can be answered in $O \left ( \frac{ 1 }{ { \epsilon\ }^d } \log n \right )$. Approximate range counting can be answered in $O \left ( \log n + { \left ( \frac{1}{ \epsilon\ } \right ) }^d \right )$ with this method.

• implicit k-d tree, a k-d tree defined by an implicit splitting function rather than an explicitly-stored set of splits
• min/max k-d tree, a k-d tree that associates a minimum and maximum value with each of its nodes
• Quadtree, a space-partitioning structure that splits at the geometric midpoint rather than the median coordinate
• Octree, a higher-dimensional generalization of a quadtree
• R-tree and bounding interval hierarchy, structure for partitioning objects rather than points, with overlapping regions
• Recursive partitioning, a technique for constructing statistical decision trees that are similar to k-d trees
• Klee's measure problem, a problem of computing the area of a union of rectangles, solvable using k-d trees
• Guillotine problem, a problem of finding a k-d tree whose cells are large enough to contain a given set of rectangles

## References

1. ^ J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509-517, 1975.
2. ^ a b de Berg, Mark et al. Computational Geometry: Algorithms and Applications, 3rd Edition, pages 99-105. Springer, 2008.
3. ^ I. Wald and V. Havran. On building fast kd-trees for ray tracing, and on doing that in O(NlogN) On building fast kd-trees for ray tracing, and on doing that in O(NlogN). IEEE Symposium on Interactive Ray Tracing, pp. 61-69, 2006.
4. ^
5. ^ Chandran, Sharat. Introduction to kd-trees. University of Maryland Department of Computer Science.
6. ^ Friedman, Jerome H., Bentley, Jon Louis, Finkel, Raphael Ari (Sep 1977). "An Algorithm for Finding Best Matches in Logarithmic Expected Time". ACM Trans. Math. Softw. (ACM) 3 (3): 209–226. doi:10.1145/355744.355745. ISSN 0098-3500. Retrieved 29 March 2013.
7. ^ Lee, D. T.; Wong, C. K. (1977). "Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees". Acta Informatica 9 (1): 23–29. doi:10.1007/BF00263763.
8. ^ Jacob E. Goodman, Joseph O'Rourke and Piotr Indyk (Ed.) (2004). "Chapter 39 : Nearest neighbours in high-dimensional spaces". Handbook of Discrete and Computational Geometry (2nd ed.). CRC Press.
9. ^ Cormen, Thomas H.; Leiserson, Charles E., Rivest, Ronald L.. Introduction to Algorithms. MIT Press and McGraw-Hill. Chapter 10.
10. ^ Rosenberg J. Geographical Data Structures Compared: A Study of Data Structures Supporting Region Queries. IEEE Transaction on CAD Integrated Circuits Systems 4(1):53-67
11. ^ Houthuys P. Box Sort, a multidimensional binary sorting method for rectangular boxes, used for quick range searching. The Visual Computer, 1987, 3:236-249
12. ^ S. Maneewongvatana and D. M. Mount. It's okay to be skinny, if your friends are fat. 4th Annual CGC Workshop on Computational Geometry, 1999.