Rand index
The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. The adjusted-for-chance form of the Rand index is the adjusted Rand index.
Contents |
[edit] Rand index
[edit] Definition
Given a set of
elements
and two partitions of
to compare,
and
, the following is defined:
, the number of pairs of elements in
that are in the same set in
and in the same set in 
, the number of pairs of elements in
that are in different sets in
and in different sets in 
, the number of pairs of elements in
that are in the same set in
and in different sets in 
, the number of pairs of elements in
that are in different sets in
and in the same set in 
The Rand index,
, is:
Intuitively,
can be considered as the number of agreements between
and
and
as the number of disagreements between
and
.
[edit] Properties
The Rand index has a value between 0 and 1, with 0 indicating that the two data clusters do not agree on any pair of points and 1 indicating that the data clusters are exactly the same.
In mathematical terms, a, b, c, d are defined as follows:
, where 
, where 
, where 
, where 
for some
.
[edit] Adjusted Rand index
The adjusted Rand index is the corrected-for-chance version of the Rand index.
[edit] The contingency table
Given a set
of
data points and two groupings (e.g. clusterings) of these points, namely
and
, the overlappings between
and
can be summarized in a contingency table where
denotes the number of common objects of groups
and
:
.
| U\V | ![]() |
![]() |
![]() |
![]() |
Sums |
|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Sums | ![]() |
![]() |
![]() |
![]() |
[edit] Definition
The adjusted form of the Rand Index, the Adjusted Rand Index, is
, more specifically
![ARI = \frac{ \sum_{ij} \binom{n_{ij}}{2} - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{N}{2} }{ \frac{1}{2} [\sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}] - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{N}{2} }](http://upload.wikimedia.org/wikipedia/en/math/b/4/3/b43907f64617d77e95b8ec2eea36a935.png)
where
are values from the contingency table.
[edit] Properties
The maximum value of the Adjusted Rand Index is 1.
|
|
This article includes a list of references, related reading or external links, but its sources remain unclear because it lacks inline citations. Please improve this article by introducing more precise citations. (January 2011) |
[edit] References
- W. M. Rand (1971). "Objective criteria for the evaluation of clustering methods". Journal of the American Statistical Association (American Statistical Association) 66 (336): 846–850. doi:10.2307/2284239. JSTOR 2284239.
- Lawrence Hubert and Phipps Arabie (1985). "Comparing partitions". Journal of Classification 2 (1): 193–218. doi:10.1007/BF01908075.
- Nguyen Xuan Vinh, Julien Epps and James Bailey (2009). "Information Theoretic Measures for Clustering Comparison: Is a Correction for Chance Necessary?". ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning. ACM. pp. 1073–1080.PDF.
, the number of pairs of elements in
, the number of pairs of elements in
, the number of pairs of elements in
, the number of pairs of elements in 
, where 
, where 
, where 
, where 























