Interaction information

The interaction information (McGill 1954), or amounts of information (Hu Kuo Ting, 1962) or co-information (Bell 2003), is one of several generalizations of the mutual information.

Interaction information expresses the amount information (redundancy or synergy) bound up in a set of variables, beyond that which is present in any subset of those variables. Unlike the mutual information, the interaction information can be either positive or negative. This confusing property has likely retarded its wider adoption as an information measure in machine learning and cognitive science. These functions, their negativity and minima have a direct interpretation in algebraic topology (Baudot & Bennequin, 2015).

The three-variable case

For three variables ${\displaystyle \{X,Y,Z\}}$, the interaction information ${\displaystyle I(X;Y;Z)}$ is given by

${\displaystyle {\begin{matrix}I(X;Y;Z)&=&I(X;Y)-I(X;Y|Z)\\\ &=&I(X;Z)-I(X;Z|Y)\\\ &=&I(Y;Z)-I(Y;Z|X)\end{matrix}}}$

where, for example, ${\displaystyle I(X;Y)}$ is the mutual information between variables ${\displaystyle X}$ and ${\displaystyle Y}$, and ${\displaystyle I(X;Y|Z)}$ is the conditional mutual information between variables ${\displaystyle X}$ and ${\displaystyle Y}$ given ${\displaystyle Z}$. Formally,

{\displaystyle {\begin{aligned}I(X;Y|Z)&=H(X|Z)+H(Y|Z)-H(X,Y|Z)\\\ &=H(X|Z)-H(X|Y,Z)\\\ &=H(X,Z)+H(Y,Z)-H(Z)-H(X,Y,Z),\end{aligned}}}

and

{\displaystyle {\begin{aligned}I(X;Y)&=H(X)+H(Y)-H(X,Y).\end{aligned}}}

It thus follows that

{\displaystyle {\begin{alignedat}{3}I(X;Y;Z)=&\quad &&[H(X)+H(Y)+H(Z)]\\&-&&[H(X,Y)+H(X,Z)+H(Y,Z)]\\&+&&H(X,Y,Z)\end{alignedat}}}

For the three-variable case, the interaction information ${\displaystyle I(X;Y;Z)}$ is the difference between the information shared by ${\displaystyle \{Y,X\}}$ when ${\displaystyle Z}$ has been fixed and when ${\displaystyle Z}$ has not been fixed. (See also Fano's 1961 textbook.) Interaction information measures the influence of a variable ${\displaystyle Z}$ on the amount of information shared between ${\displaystyle \{Y,X\}}$. Because the term ${\displaystyle I(X;Y|Z)}$ can be lager than ${\displaystyle I(X;Y)}$ — for example, when both ${\displaystyle X}$ and ${\displaystyle Y}$ have a joint effect on ${\displaystyle Z}$ but are independent of each other without knowing ${\displaystyle Z}$, the interaction information can be negative as well as positive. Positive interaction information indicates that variable ${\displaystyle Z}$ inhibits (i.e., accounts for or explains some of) the correlation between ${\displaystyle \{Y,X\}}$, whereas negative interaction information indicates that variable ${\displaystyle Z}$ facilitates or enhances the correlation between ${\displaystyle \{Y,X\}}$.

Interaction information is bounded. In the three variable case, it is bounded by (Yeung 91)

${\displaystyle -min\ \{I(X;Y|Z),I(Y;Z|X),I(X;Z|Y)\}\leq I(X;Y;Z)\leq min\ \{I(X;Y),I(Y;Z),I(X;Z)\}}$

Example of positive interaction information

Positive interaction information seems much more natural than negative interaction information in the sense that such explanatory effects are typical of common-cause structures. For example, clouds cause rain and also block the sun; therefore, the correlation between rain and darkness is partly accounted for by the presence of clouds, ${\displaystyle I(rain;dark|cloud). The result is positive interaction information ${\displaystyle I(rain;dark;cloud)}$.

Example of negative interaction information

The case of negative interaction information seems a bit less natural. A prototypical example of negative ${\displaystyle I(X;Y;Z)}$ has ${\displaystyle X}$ as the output of an XOR gate to which ${\displaystyle Y}$ and ${\displaystyle Z}$ are the independent random inputs. In this case ${\displaystyle I(Y;Z)}$ will be zero, but ${\displaystyle I(Y;Z|X)}$ will be positive (1 bit) since once output ${\displaystyle X}$ is known, the value on input ${\displaystyle Y}$ completely determines the value on input ${\displaystyle Z}$. Since ${\displaystyle I(Y;Z|X)>I(Y;Z)}$, the result is negative interaction information ${\displaystyle I(X;Y;Z)}$. It may seem that this example relies on a peculiar ordering of ${\displaystyle X,Y,Z}$ to obtain the negative interaction, but the symmetry of the definition for ${\displaystyle I(X;Y;Z)}$ indicates that the same negative interaction information results regardless of which variable we consider as the interloper or conditioning variable. For example, input ${\displaystyle Y}$ and output ${\displaystyle X}$ are also independent until input ${\displaystyle Z}$ is fixed, at which time they are totally dependent (obviously), and we have the same negative interaction information as before, ${\displaystyle I(X;Y;Z)=I(X;Y)-I(X;Y|Z)}$.

This situation is an instance where fixing the common effect ${\displaystyle X}$ of causes ${\displaystyle Y}$ and ${\displaystyle Z}$ induces a dependency among the causes that did not formerly exist. This behavior is colloquially referred to as explaining away and is thoroughly discussed in the Bayesian Network literature (e.g., Pearl 1988). Pearl's example is auto diagnostics: A car's engine can fail to start ${\displaystyle (X)}$ due either to a dead battery ${\displaystyle (Y)}$ or due to a blocked fuel pump ${\displaystyle (Z)}$. Ordinarily, we assume that battery death and fuel pump blockage are independent events, because of the essential modularity of such automotive systems. Thus, in the absence of other information, knowing whether or not the battery is dead gives us no information about whether or not the fuel pump is blocked. However, if we happen to know that the car fails to start (i.e., we fix common effect ${\displaystyle X}$), this information induces a dependency between the two causes battery death and fuel blockage. Thus, knowing that the car fails to start, if an inspection shows the battery to be in good health, we can conclude that the fuel pump must be blocked.

Battery death and fuel blockage are thus dependent, conditional on their common effect car starting. What the foregoing discussion indicates is that the obvious directionality in the common-effect graph belies a deep informational symmetry: If conditioning on a common effect increases the dependency between its two parent causes, then conditioning on one of the causes must create the same increase in dependency between the second cause and the common effect. In Pearl's automotive example, if conditioning on car starts induces ${\displaystyle I(X;Y;Z)}$ bits of dependency between the two causes battery dead and fuel blocked, then conditioning on fuel blocked must induce ${\displaystyle I(X;Y;Z)}$ bits of dependency between battery dead and car starts. This may seem odd because battery dead and car starts are already governed by the implication battery dead ${\displaystyle \rightarrow }$ car doesn't start. However, these variables are still not totally correlated because the converse is not true. Conditioning on fuel blocked removes the major alternate cause of failure to start, and strengthens the converse relation and therefore the association between battery dead and car starts. A paper by Tsujishita (1995) focuses in greater depth on the third-order mutual information.

Positivity for Markov chains

If three variables form a Markov chain ${\displaystyle X\to Y\to Z}$, then ${\displaystyle I(X;Z|Y)=0}$, but ${\displaystyle I(X;Z)\geq 0}$. Hence, we concluded that

${\displaystyle I(X;Y;Z)=I(X;Z)-I(X;Z|Y)=I(X;Z)\geq 0.}$

The four-variable case

One can recursively define the n-dimensional interaction information in terms of the ${\displaystyle (n-1)}$-dimensional interaction information. For example, the four-dimensional interaction information can be defined as

{\displaystyle {\begin{aligned}I(W;X;Y;Z)&=-I(X;Y;Z|W)+I(X;Y;Z)\\\ &=I(X;Y|Z,W)-I(X;Y|W)-I(X;Y|Z)+I(X;Y)\end{aligned}}}

or, equivalently,

{\displaystyle {\begin{aligned}I(W;X;Y;Z)=&\ H(W)+H(X)+H(Y)+H(Z)\\\ &-H(W,X)-H(W,Y)-H(W,Z)-H(X,Y)-H(X,Z)-H(Y,Z)\\\ &+H(W,X,Y)+H(W,X,Z)+H(W,Y,Z)+H(X,Y,Z)-H(W,X,Y,Z)\end{aligned}}}

The n-variable case

It is possible to extend all of these results to an arbitrary number of dimensions. The general expression for interaction information on variable set ${\displaystyle {\mathcal {V}}=\{X_{1},X_{2},\ldots ,X_{n}\}}$ in terms of the marginal entropies is given by Hu Kuo Ting (1962), Jakulin & Bratko (2003).

${\displaystyle I({\mathcal {V}})\equiv -\sum _{{\mathcal {T}}\subseteq {\mathcal {V}}}(-1)^{\left\vert {\mathcal {V}}\right\vert -\left\vert {\mathcal {T}}\right\vert }H({\mathcal {T}})}$

which is an alternating (inclusion-exclusion) sum over all subsets ${\displaystyle {\mathcal {T}}\subseteq {\mathcal {V}}}$, where ${\displaystyle \left\vert {\mathcal {V}}\right\vert =n}$. Note that this is the information-theoretic analog to the Kirkwood approximation.

Difficulties interpreting interaction information

The possible negativity of interaction information can be the source of some confusion (Bell 2003). As an example of this confusion, consider a set of eight independent binary variables ${\displaystyle \{X_{1},X_{2},X_{3},X_{4},X_{5},X_{6},X_{7},X_{8}\}}$. Agglomerate these variables as follows:

${\displaystyle {\begin{matrix}Y_{1}&=&\{X_{1},X_{2},X_{3},X_{4},X_{5},X_{6},X_{7}\}\\Y_{2}&=&\{X_{4},X_{5},X_{6},X_{7}\}\\Y_{3}&=&\{X_{5},X_{6},X_{7},X_{8}\}\end{matrix}}}$

Because the ${\displaystyle Y_{i}}$'s overlap each other (are redundant) on the three binary variables ${\displaystyle \{X_{5},X_{6},X_{7}\}}$, we would expect the interaction information ${\displaystyle I(Y_{1};Y_{2};Y_{3})}$ to equal ${\displaystyle -3}$ bits, which it does. However, consider now the agglomerated variables

${\displaystyle {\begin{matrix}Y_{1}&=&\{X_{1},X_{2},X_{3},X_{4},X_{5},X_{6},X_{7}\}\\Y_{2}&=&\{X_{4},X_{5},X_{6},X_{7}\}\\Y_{3}&=&\{X_{5},X_{6},X_{7},X_{8}\}\\Y_{4}&=&\{X_{7},X_{8}\}\end{matrix}}}$

These are the same variables as before with the addition of ${\displaystyle Y_{4}=\{X_{7},X_{8}\}}$. Because the ${\displaystyle Y_{i}}$'s now overlap each other (are redundant) on only one binary variable ${\displaystyle \{X_{7}\}}$, we would expect the interaction information ${\displaystyle I(Y_{1};Y_{2};Y_{3};Y_{4})}$ to equal ${\displaystyle -1}$ bit. However, ${\displaystyle I(Y_{1};Y_{2};Y_{3};Y_{4})}$ in this case is actually equal to ${\displaystyle +1}$ bit, indicating a synergy rather than a redundancy. This is correct in the sense that

${\displaystyle {\begin{matrix}I(Y_{1};Y_{2};Y_{3};Y_{4})&=&I(Y_{1};Y_{2};Y_{3}|Y_{4})-I(Y_{1};Y_{2};Y_{3})\\\ &=&-2+3\\\ &=&1\end{matrix}}}$

but it remains difficult to interpret.

Uses

• Jakulin and Bratko (2003b) provide a machine learning algorithm which uses interaction information.
• Killian, Kravitz and Gilson (2007) use mutual information expansion to extract entropy estimates from molecular simulations.
• LeVine and Weinstein (2014) use interaction information and other N-body information measures to quantify allosteric couplings in molecular simulations.
• Moore et al. (2006), Chanda P, Zhang A, Brazeau D, Sucheston L, Freudenheim JL, Ambrosone C, Ramanathan M. (2007) and Chanda P, Sucheston L, Zhang A, Brazeau D, Freudenheim JL, Ambrosone C, Ramanathan M. (2008) demonstrate the use of interaction information for analyzing gene-gene and gene-environmental interactions associated with complex diseases.
• Pandey and Sarkar (2017) use interaction information in Cosmology to study the influence of large-scale environments on galaxy properties.

References

• Baudot, P.; Bennequin, D. (2015). "The homological nature of entropy" (PDF). Entropy. 17 (5): 1–66. Bibcode:2015Entrp..17.3253B. doi:10.3390/e17053253.
• Bell, A J (2003), The co-information lattice [1]
• Fano, R M (1961), Transmission of Information: A Statistical Theory of Communications, MIT Press, Cambridge, MA.
• Garner W R (1962). Uncertainty and Structure as Psychological Concepts, JohnWiley & Sons, New York.
• Han, T S (1978). "Nonnegative entropy measures of multivariate symmetric correlations". Information and Control. 36 (2): 133–156. doi:10.1016/s0019-9958(78)90275-9.
• Han, T S (1980). "Multiple mutual information and multiple interactions in frequency data". Information and Control. 46: 26–45. doi:10.1016/s0019-9958(80)90478-7.
• Hu Kuo Tin (1962), On the Amount of Information. Theory Probab. Appl.,7(4), 439-44. PDF
• Jakulin A & Bratko I (2003a). Analyzing Attribute Dependencies, in N Lavra\quad{c}, D Gamberger, L Todorovski & H Blockeel, eds, Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Springer, Cavtat-Dubrovnik, Croatia, pp. 229–240.
• Jakulin A & Bratko I (2003b). Quantifying and visualizing attribute interactions [2].
• Margolin, A; Wang, K; Califano, A; Nemenman, I (2010). "Multivariate dependence and genetic networks inference". IET Syst Biol. 4 (6): 428–440. arXiv:1001.1681. doi:10.1049/iet-syb.2010.0009. PMID 21073241.
• McGill, W J (1954). "Multivariate information transmission". Psychometrika. 19 (2): 97–116. doi:10.1007/bf02289159.
• Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC (2006). A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility, Journal of Theoretical Biology 241, 252-261. [3]
• Nemenman I (2004). Information theory, multivariate dependence, and genetic network inference [4].
• Pearl, J (1988), Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Mateo, CA.
• Tsujishita, T (1995), On triple mutual information, Advances in applied mathematics 16, 269-274.
• Chanda, P; Zhang, A; Brazeau, D; Sucheston, L; Freudenheim, JL; Ambrosone, C; Ramanathan, M (2007). "Information-theoretic metrics for visualizing gene-environment interactions". American Journal of Human Genetics. 81 (5): 939–63. doi:10.1086/521878. PMC 2265645. PMID 17924337.
• Chanda, P; Sucheston, L; Zhang, A; Brazeau, D; Freudenheim, JL; Ambrosone, C; Ramanathan, M (2008). "AMBIENCE: a novel approach and efficient algorithm for identifying informative genetic and environmental associations with complex phenotypes". Genetics. 180 (2): 1191–210. doi:10.1534/genetics.108.088542. PMC 2567367. PMID 18780753.
• Killian, B J; Kravitz, J Y; Gilson, M K (2007). "Extraction of configurational entropy from molecular simulations via an expansion approximation". J. Chem. Phys. 127 (2): 024107. Bibcode:2007JChPh.127b4107K. doi:10.1063/1.2746329. PMC 2707031. PMID 17640119.
• LeVine MV, Weinstein H (2014), NbIT - A New Information Theory-Based Analysis of Allosteric Mechanisms Reveals Residues that Underlie Function in the Leucine Transporter LeuT. PLoS Computational Biology. [5]
• Pandey, Biswajit; Sarkar, Suman (2017). "How much a galaxy knows about its large-scale environment?: An information theoretic perspective". Monthly Notices of the Royal Astronomical Society Letters. 467 (1): L6. arXiv:1611.00283. Bibcode:2017MNRAS.467L...6P. doi:10.1093/mnrasl/slw250.
• https://www3.nd.edu/~jnl/ee80653/Fall2005/tutorials/sunil.pdf
• Yeung R W (1992). A new outlook on Shannon's information measures. in IEEE Transactions on Information Theory.