# Variation of information

In probability theory and information theory, the variation of information or shared information distance is a measure of the distance between two clusterings (partitions of elements). It is closely related to mutual information; indeed, it is a simple linear expression involving the mutual information. Unlike the mutual information, however, the variation of information is a true metric, in that it obeys the triangle inequality.[1] [2] [3]

Information diagram illustrating the relation between information entropies, mutual information and variation of information.

## Definition

Suppose we have two partitions ${\displaystyle X}$ and ${\displaystyle Y}$ of a set ${\displaystyle A}$ into disjoint subsets, namely ${\displaystyle X=\{X_{1},X_{2},..,,X_{k}\}}$ and ${\displaystyle Y=\{Y_{1},Y_{2},..,,Y_{l}\}}$. Let ${\displaystyle n=\sum _{i}|X_{i}|=\sum _{j}|Y_{j}|=|A|}$, ${\displaystyle p_{i}=|X_{i}|/n}$ , ${\displaystyle q_{j}=|Y_{j}|/n}$, ${\displaystyle r_{ij}=|X_{i}\cap Y_{j}|/n}$. Then the variation of information between the two partitions is:

${\displaystyle \mathrm {VI} (X;Y)=-\sum _{i,j}r_{ij}\left[\log(r_{ij}/p_{i})+\log(r_{ij}/q_{j})\right]}$.

This is equivalent to the shared information distance between the random variables i and j with respect to the uniform probability measure on ${\displaystyle A}$ defined by ${\displaystyle \mu (B):=|B|/n}$ for ${\displaystyle B\subseteq A}$.

### Explicit information content

We can rewrite this definition in terms that explicitly highlight the information content of this metric.

The set of all partitions of a set form a compact Lattice where the partial order induces two operations, the meet ${\displaystyle \wedge }$ and the join ${\displaystyle \vee }$, where the maximum ${\displaystyle {\overline {\mathrm {1} }}}$ is the partition with only one block, i.e., all elements grouped together, and the minimum is ${\displaystyle {\overline {\mathrm {0} }}}$, the partition consisting of all elements as singletons. The meet of two partitions ${\displaystyle X}$ and ${\displaystyle Y}$ is easy to understand as that partition formed by all pair intersections of one block of, ${\displaystyle X_{i}}$, of ${\displaystyle X}$ and one, ${\displaystyle Y_{i}}$, of ${\displaystyle Y}$. It then follows that ${\displaystyle X\wedge Y\subseteq X}$ and ${\displaystyle X\wedge Y\subseteq Y}$.

Let's define the entropy of a partition ${\displaystyle X}$ as

${\displaystyle H\left(X\right)\,=\,-\sum _{i}\,p_{i}\log p_{i}}$,

where ${\displaystyle p_{i}=|X_{i}|/n}$. Clearly, ${\displaystyle H({\overline {\mathrm {1} }})=0}$ and ${\displaystyle H({\overline {\mathrm {0} }})=\log \,n}$. The entropy of a partition is a monotonous function on the lattice of partitions in the sense that ${\displaystyle X\subseteq Y\Rightarrow H(X)\geq H(Y)}$.

Then the VI distance between ${\displaystyle X}$ and ${\displaystyle Y}$ is given by

${\displaystyle \mathrm {VI} (X,Y)\,=\,2H(X\wedge Y)\,-\,H(X)\,-\,H(Y)}$.

The difference ${\displaystyle d(X,Y)\equiv |H\left(X\right)-H\left(Y\right)|}$ is a pseudo-metric as ${\displaystyle d(X,Y)=0}$ doesn't necessarily imply that ${\displaystyle X=Y}$. From the definition of ${\displaystyle {\overline {\mathrm {1} }}}$, it is ${\displaystyle \mathrm {VI} (X,\mathrm {1} )\,=\,H\left(X\right)}$.

If in the Hasse diagram we draw an edge from every partition to the maximum ${\displaystyle {\overline {\mathrm {1} }}}$ and assign it a weight equal to the VI distance between the given partition and ${\displaystyle {\overline {\mathrm {1} }}}$, we can interpret the VI distance as basically an average of differences of edge weights to the maximum

${\displaystyle \mathrm {VI} (X,Y)\,=\,|\mathrm {VI} (X,{\overline {\mathrm {1} }})\,-\,\mathrm {VI} (X\wedge Y,{\overline {\mathrm {1} }})|\,+\,|\mathrm {VI} (Y,{\overline {\mathrm {1} }})\,-\,\mathrm {VI} (X\wedge Y,{\overline {\mathrm {1} }})|\,=\,d(X,X\wedge Y)\,+\,d(Y,X\wedge Y)}$.

For ${\displaystyle H(X)}$ as defined above, it holds that the joint information of two partitions coincides with the entropy of the meet

${\displaystyle H(X,Y)\,=\,H(X\wedge Y)}$

and we also have that ${\displaystyle d(X,X\wedge Y)\,=\,H(X\wedge Y|X)}$ coincides with the conditional entropy of the meet (intersection) ${\displaystyle X\wedge Y}$ relative to ${\displaystyle X}$.

## Identities

The variation of information satisfies

${\displaystyle \mathrm {VI} (X;Y)=H(X)+H(Y)-2I(X,Y)}$,

where ${\displaystyle H(X)}$ is the entropy of ${\displaystyle X}$, and ${\displaystyle I(X,Y)}$ is mutual information between ${\displaystyle X}$ and ${\displaystyle Y}$ with respect to the uniform probability measure on ${\displaystyle A}$. This can be rewritten as

${\displaystyle \mathrm {VI} (X;Y)=H(X,Y)-I(X,Y)}$,

where ${\displaystyle H(X,Y)}$ is the joint entropy of ${\displaystyle X}$ and ${\displaystyle Y}$, or

${\displaystyle \mathrm {VI} (X;Y)=H(X|Y)+H(Y|X)}$,

where ${\displaystyle H(X|Y)}$ and ${\displaystyle H(Y|X)}$ are the respective conditional entropies.

The variation of information can also be bounded, either in terms of the number of elements:

${\displaystyle \mathrm {VI} (X;Y)\leq \log(n)}$,

Or with respect to a maximum number of clusters, ${\displaystyle K^{*}}$:

${\displaystyle \mathrm {VI} (X;Y)\leq 2\log(K^{*})}$

## References

1. ^ P. Arabie, S.A. Boorman, S. A., "Multidimensional scaling of measures of distance between partitions", Journal of Mathematical Psychology (1973) , vol. 10, 2, pp. 148–203, doi: 10.1016/0022-2496(73)90012-6
2. ^ W.H. Zurek, Nature, vol 341, p119 (1989); W.H. Zurek, Physics Review A, vol 40, p4731 (1989)
3. ^ Marina Meila, "Comparing Clusterings by the Variation of Information", Learning Theory and Kernel Machines (2003), vol. 2777 , pp. 173–187, doi:10.1007/978-3-540-45167-9_14, Lecture Notes in Computer Science, ISBN 978-3-540-40720-1