Optimal binary search tree
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)(Learn how and when to remove this template message)
In computer science, an optimal binary search tree (Optimal BST), sometimes called a weight-balanced binary tree, is a binary search tree which provides the smallest possible search time (or expected search time) for a given sequence of accesses (or access probabilities). Optimal BSTs are generally divided into two types: static and dynamic.
In the static optimality problem, the tree cannot be modified after it has been constructed. In this case, there exists some particular layout of the nodes of the tree which provides the smallest expected search time for the given access probabilities. Various algorithms exist to construct or approximate the statically optimal tree given the information on the access probabilities of the elements.
In the dynamic optimality problem, the tree can be modified at any time, typically by permitting tree rotations. The tree is considered to have a cursor starting at the root which it can move or use to perform modifications. In this case, there exists some minimal-cost sequence of these operations which causes the cursor to visit every node in the target access sequence in order. The splay tree is conjectured to have a constant competitive ratio compared to the dynamically optimal tree in all cases, though this has not yet been proven.
In the static optimality problem as defined by Knuth, we are given a set of n ordered elements and a set of probabilities. We will denote the elements through and the probabilities through and through . is the probability of a search being done for element . For , is the probability of a search being done for an element between and , is the probability of a search being done for an element strictly less than , and is the probability of a search being done for an element strictly greater than . These probabilities cover all possible searches, and therefore add up to one.
The static optimality problem is the optimization problem of finding the binary search tree that minimizes the expected search time, given the probabilities. As the number of possible trees on a set of n elements is , which is exponential in n, brute-force search is not usually a feasible solution.
Knuth's dynamic programming algorithm
In 1971, Knuth published a relatively straightforward dynamic programming algorithm capable of constructing the statically optimal tree in only O(n2) time. Knuth's primary insight was that the static optimality problem exhibits optimal substructure; that is, if a certain tree is statically optimal for a given probability distribution, then its left and right subtrees must also be statically optimal for their appropriate subsets of the distribution.
To see this, consider what Knuth calls the "weighted path length" of a tree. The weighted path length of a tree on n elements is the sum of the lengths of all possible search paths, weighted by their respective probabilities. The tree with the minimal weighted path length is, by definition, statically optimal.
But weighted path lengths have an interesting property. Let E be the weighted path length of a binary tree, EL be the weighted path length of its left subtree, and ER be the weighted path length of its right subtree. Also let W be the sum of all the probabilities in the tree. Observe that when either subtree is attached to the root, the depth of each of its elements (and thus each of its search paths) is increased by one. Also observe that the root itself has a depth of one. This means that the difference in weighted path length between a tree and its two subtrees is exactly the sum of every single probability in the tree, leading to the following recurrence:
This recurrence leads to a natural dynamic programming solution. Let be the weighted path length of the statically optimal search tree for all values between ai and aj, let be the total weight of that tree, and let be the index of its root. The algorithm can be built using the following formulas:
- The naive implementation of this algorithm actually takes O(n3) time, but Knuth's paper includes some additional observations which can be used to produce a modified algorithm taking only O(n2) time.
Mehlhorn's approximation algorithm
While the O(n2) time taken by Knuth's algorithm is substantially better than the exponential time required for a brute-force search, it is still too slow to be practical when the number of elements in the tree is very large.
In 1975, Kurt Mehlhorn published a paper proving that a much simpler algorithm could be used to closely approximate the statically optimal tree in only time. In this algorithm, the root of the tree is chosen so as to most closely balance the total weight (by probability) of the left and right subtrees. This strategy is then applied recursively on each subtree.
That this strategy produces a good approximation can be seen intuitively by noting that the weights of the subtrees along any path form something very close to a geometrically decreasing sequence. In fact, this strategy generates a tree whose weighted path length is at most
where H is the entropy of the probability distribution. Since no optimal binary search tree can ever do better than a weighted path length of
this approximation is very close.
|Unsolved problem in computer science:
Do splay trees perform as well as any other binary search tree algorithm?
(more unsolved problems in computer science)
There are several different definitions of dynamic optimality, all of which are effectively equivalent to within a constant factor in terms of running-time. The problem was first introduced implicitly by Sleator and Tarjan in their paper on splay trees, but Demaine et al. give a very good formal statement of it.
In the dynamic optimality problem, we are given a sequence of accesses x1, ..., xm on the keys 1, ..., n. For each access, we are given a pointer to the root of our BST and can use the pointer to perform any of the following operations:
- Move the pointer to the left child of the current node.
- Move the pointer to the right child of the current node.
- Move the pointer to the parent of the current node.
- Perform a single rotation on the current node and its parent.
Our BST algorithm can perform any sequence of the above operations as long as the pointer eventually ends up on the node containing the target value xi. The time it takes a given dynamic BST algorithm to perform a sequence of accesses is equivalent to the total number of such operations performed during that sequence. Given any sequence of accesses on any set of elements, there is some BST algorithm which performs all accesses using the fewest total operations.
This model defines the fastest possible tree for a given sequence of accesses, but calculating the optimal tree in this sense therefore requires foreknowledge of exactly what the access sequence will be. If we let OPT(X) be the number of operations performed by the strictly optimal tree for an access sequence X, we can say that a tree is dynamically optimal as long as, for any X, it performs X in time O(OPT(X)) (that is, it has a constant competitive ratio).
There are several data structures conjectured to have this property, but none proven. It is an open problem whether there exists a dynamically optimal data structure in this model.
The splay tree is a data structure invented in 1985 by Daniel Sleator and Robert Tarjan which is conjectured to be dynamically optimal in the required sense. That is, a splay tree is believed to perform any sufficiently long access sequence X in time O(OPT(X)).
The tango tree is a data structure proposed in 2004 by Demaine et al. which has been proven to perform any sufficiently-long access sequence X in time . While this is not dynamically optimal, the competitive ratio of is still very small for reasonable values of n.
In 2013, John Iacono published a paper which uses the geometry of binary search trees to provide an algorithm which is dynamically optimal if any binary search tree algorithm is dynamically optimal.
- Tremblay, Jean-Paul; Cheston, Grant A. (2001). Data Structures and Software Development in an object-oriented domain. Eiffel Edition/Prentice Hall. ISBN 0-13-787946-6.
- Knuth, Donald E. (1971), "Optimum binary search trees", Acta Informatica, 1 (1): 14–25, doi:10.1007/BF00264289
- Mehlhorn, Kurt (1975), "Nearly optimal binary search trees", Acta Informatica, 5 (4): 287–295, doi:10.1007/BF00264563
- Demaine, Erik D.; Harmon, Dion; Iacono, John; Patrascu, Mihai (2004), "Dynamic Optimality — Almost", Dynamic optimality - almost, pp. 484–490, doi:10.1109/FOCS.2004.23, ISBN 0-7695-2228-9
- Sleator, Daniel; Tarjan, Robert (1985), "Self-adjusting binary search trees", Journal of the ACM, 32 (3): 652–686, doi:10.1145/3828.3835
- Iacono, John (2013), "In pursuit of the dynamic optimality conjecture", arXiv: