Rand index

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. The adjusted-for-chance form of the Rand index is the adjusted Rand index.

Contents

[edit] Rand index

[edit] Definition

Given a set of n elements S = \{O_1, \ldots, O_n\} and two partitions of S to compare, X = \{x_1, \ldots, x_r\} and Y = \{y_1, \ldots, y_s\}, the following is defined:

  • a, the number of pairs of elements in S that are in the same set in X and in the same set in Y
  • b, the number of pairs of elements in S that are in different sets in X and in different sets in Y
  • c, the number of pairs of elements in S that are in the same set in X and in different sets in Y
  • d, the number of pairs of elements in S that are in different sets in X and in the same set in Y

The Rand index, R, is:

 R = \frac{a+b}{a+b+c+d} = \frac{a+b}{{n \choose 2 }}

Intuitively, a + b can be considered as the number of agreements between X and Y and c + d as the number of disagreements between X and Y.

[edit] Properties

The Rand index has a value between 0 and 1, with 0 indicating that the two data clusters do not agree on any pair of points and 1 indicating that the data clusters are exactly the same.

In mathematical terms, a, b, c, d are defined as follows:

  • a = |S^{*}|, where S^{*} = \{ (o_{i}, o_{j}) | o_{i}, o_{j} \in X_{k}, o_{i}, o_{j} \in Y_{l}\}
  • b = |S^{*}|, where S^{*} = \{ (o_{i}, o_{j}) | o_{i} \in X_{k_{1}}, o_{j} \in X_{k_{2}}, o_{i} \in Y_{l_{1}}, o_{j} \in Y_{l_{2}}\}
  • c = |S^{*}|, where S^{*} = \{ (o_{i}, o_{j}) | o_{i}, o_{j} \in X_{k}, o_{i} \in Y_{l_{1}}, o_{j} \in Y_{l_{2}}\}
  • d = |S^{*}|, where S^{*} = \{ (o_{i}, o_{j}) | o_{i} \in X_{k_{1}}, o_{j} \in X_{k_{2}}, o_{i}, o_{j} \in Y_{l}\}

for some 1 \leq i,j \leq n, i \neq j, 1 \leq k, k_{1}, k_{2} \leq r, k_{1} \neq k_{2}, 1 \leq l, l_{1},l_{2} \leq s, l_{1} \neq l_{2}.

[edit] Adjusted Rand index

The adjusted Rand index is the corrected-for-chance version of the Rand index.

[edit] The contingency table

Given a set S of N data points and two groupings (e.g. clusterings) of these points, namely U = \{ U_1, U_2, \ldots ,U_R \} and V = \{ V_1, V_2, \ldots , V_C \}, the overlappings between U and V can be summarized in a contingency table where n_{ij} denotes the number of common objects of groups U_i and V_j : n_{ij}=|U_i \cap V_j|.

U\V V_1 V_2 \ldots V_C Sums
U_1 n_{11} n_{12} \ldots n_{1C} a_1
U_2 n_{21} n_{22} \ldots n_{2C} a_2
\vdots \vdots \vdots \ddots \vdots \vdots
U_R n_{R1} n_{R2} \ldots n_{RC} a_R
Sums b_1 b_2 \ldots b_C

[edit] Definition

The adjusted form of the Rand Index, the Adjusted Rand Index, is AdjustedIndex = \frac{Index - ExpectedIndex}{MaxIndex - ExpectedIndex}, more specifically
ARI = \frac{ \sum_{ij} \binom{n_{ij}}{2} - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{N}{2} }{ \frac{1}{2} [\sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}] - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{N}{2} }
where n_{ij}, a_i, b_j are values from the contingency table.

[edit] Properties

The maximum value of the Adjusted Rand Index is 1.

[edit] References

  • Lawrence Hubert and Phipps Arabie (1985). "Comparing partitions". Journal of Classification 2 (1): 193–218. doi:10.1007/BF01908075. 
  • Nguyen Xuan Vinh, Julien Epps and James Bailey (2009). "Information Theoretic Measures for Clustering Comparison: Is a Correction for Chance Necessary?". ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning. ACM. pp. 1073–1080. PDF.
Personal tools
Namespaces

Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages