Disjoint-set data structure
Given a set of elements, it is often useful to break them up or partition them into a number of separate, nonoverlapping sets. A disjoint-set data structure is a data structure that keeps track of such a partitioning. A union-find algorithm is an algorithm that performs two useful operations on such a data structure:
- Find: Determine which set a particular element is in. Also useful for determining if two elements are in the same set.
- Union: Combine or merge two sets into a single set.
Because it supports these two operations, a disjoint-set data structure is sometimes called a union-find data structure or merge-find set. The other important operation, MakeSet, which makes a set containing only a given element (a singleton), is generally trivial. With these three operations, many practical partitioning problems can be solved (see the Applications section).
In order to define these operations more precisely, some way of representing the sets is needed. One common approach is to select a fixed element of each set, called its representative, to represent the set as a whole. Then, Find(x) returns the representative of the set that x belongs to, and Union takes two set representatives as its arguments.
Disjoint-set linked lists
A simple approach to creating a disjoint-set data structure is to create a linked list for each set. The element at the head of each list is chosen as its representative.
MakeSet creates a list of one element. Union appends the two lists, a constant-time operation. The drawback of this implementation is that Find requires Ω(n) or linear time.
This can be avoided by including in each linked list node a pointer to the head of the list; then Find takes constant time. However, Union now has to update each element of the list being appended to make it point to the head of the new combined list, requiring Ω(n) time.
When the length of each list is tracked, the required time can be ameliorated by always appending the smaller list to the longer. Using this weighted-union heuristic, a sequence of m MakeSet, Union, and Find operations on n elements requires O(m + nlog n) time [1]. For asymptotically faster operations, a different data structure is needed.
Disjoint-set forests
Disjoint-set forests are a data structure where each set is represented by a tree data structure, in which each node holds a reference to its parent node. They were first described by Bernard A. Galler and Michael J. Fischer in 1964[2], although their precise analysis took years.
In a disjoint-set forest, the representative of each set is the root of that set's tree. Find follows parent nodes until it reaches the root. Union combines two trees into one by attaching the root of one to the root of the other. One way of implementing these might be:
function MakeSet(x) x.parent := x function Find(x) if x.parent == x return x else return Find(x.parent) function Union(x, y) xRoot := Find(x) yRoot := Find(y) xRoot.parent := yRoot
In this naive form, this approach is no better than the linked-list approach, because the tree it creates can be highly unbalanced; however, it can be enhanced in two ways.
The first way, called union by rank, is to always attach the smaller tree to the root of the larger tree, rather than vice versa. To evaluate which tree is larger, a simple heuristic called rank is used: one-element trees have a rank of zero, and whenever two trees of the same rank r are united, the rank of the result is r+1. Just applying this technique alone yields an amortized running-time of per MakeSet, Union, or Find operation. Pseudocode for the improved MakeSet
and Union
:
function MakeSet(x) x.parent := x x.rank := 0 function Union(x, y) xRoot := Find(x) yRoot := Find(y) if xRoot.rank > yRoot.rank yRoot.parent := xRoot else if xRoot.rank < yRoot.rank xRoot.parent := yRoot else if xRoot != yRoot yRoot.parent := xRoot xRoot.rank := xRoot.rank + 1
The second improvement, called path compression, is a way of flattening the structure of the tree whenever Find is used on it. The idea is that each node visited on the way to a root node may as well be attached directly to the root node; they all share the same representative. To effect this, one traversal up to the root node is made, to find out what it is, and then another traversal is made, to update the parent reference of each visited node to point to the root node. The resulting tree is much flatter, speeding up future operations not only on these elements but on those referencing them, directly or indirectly. Here is the improved Find
:
function Find(x) if x.parent == x return x else x.parent := Find(x.parent) return x.parent
These two techniques complement each other; applied together, the amortized time per operation is only , where is the inverse of the function , and is the extremely quickly-growing Ackermann function. Since is its inverse, it's less than 5 for all remotely practical values of . Thus, the amortized running time per operation is effectively a small constant.
In fact, this can't be optimized any more: Fredman and Saks showed in 1989 that words must be accessed by any disjoint-set data structure per operation on average.[1]
Applications
Disjoint-set data structures model the partitioning of a set, for example to keep track of the components of a graph of an undirected graph. This model can then be used to determine whether two vertices belong to the same component, or whether adding an edge between them would result in a cycle.
This data structure is used by the Boost Graph Library to implement its Incremental Connected Components functionality. It is also used for implementing Kruskal's algorithm to find the minimum spanning tree of a graph.
Note that the implementation as disjoint-set forests doesn't allow deletion of edges — even without path compression or the rank heuristic, this is not as easy, although more complex schemes have been designed[3] that can deal with this type of incremental update.
History
While the ideas used in disjoint-set forests have long been familiar, Robert Tarjan was the first to prove the upper bound (and a restricted version of the lower bound) in terms of the inverse Ackermann function. Until this time the best bound on the time per operation, proven by Hopcroft and Ullman, was O(log* n), the iterated logarithm of n, another slowly-growing function (but not quite as slow as the inverse Ackermann function). Tarjan and van Leeuwen also developed one-pass Find algorithms that are more efficient in practice. The algorithm was made well-known by the popular textbook Introduction to Algorithms.[4]
In 2007, as part of the Workshop on ML, Sylvain Conchon and Jean-Christophe Filliâtre developed a persistent version of the disjoint-set forest data structure, allowing previous versions of the structure to be efficiently retained, and formalized its correctness using the proof assistant Coq.[2]
External links
- Union/Find Algorithm Visualization, An easy-to-follow visualization of the algorithm with examples.
- Compaq Research: Zeus: Animation of Union-Find Algorithms
- Java applet: A Graphical Union-Find Implementation, by Rory L. P. McGuire
- Union-Find Source-Code (C++) with documentation
- The abstract data type Union-Find, a simple C implementation by Vašek Chvátal
- Wait-free Parallel Algorithms for the Union-Find Problem, a 1994 paper by Richard J. Anderson and Heather Woll describing a parallelized version of Union-Find that never needs to block
References
- ^ Bernard A. Galler and Michael J. Fischer. An improved equivalence algorithm. Communications of the ACM, Volume 7, Issue 5 (May 1964), pages 301-303. The paper originating disjoint-set forests. ACM Digital Library
- ^ Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 21: Data structures for Disjoint Sets, pp.498–524.
- ^ Zvi Galil and Giuseppe F. Italiano. Data structures and algorithms for disjoint set union problems, ACM Computing Surveys, Volume 23, Issue 3 (September 1991), pages 319-344. ACM Digital Library
- J. A. Spirko and A. P. Hickman, Molecular-dynamics simulations of collisions of Ne with La@C82, Phys. Rev. A 57, 3674–3682 (1998).
- ^ M. Fredman and M. Saks. The cell probe complexity of dynamic data structures. Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing, pages 345–354. May 1989. "Theorem 5: Any CPROBE(log n) implementation of the set union problem requires Ω(m α(m, n)) time to execute m Find's and n−1 Union's, beginning with n singleton sets."
- ^ Sylvain Conchon and Jean-Christophe Filliâtre. A Persistent Union-Find Data Structure. In ACM SIGPLAN Workshop on ML, Freiburg, Germany, October 2007.