# Jensen–Shannon divergence

In probability theory and statistics, the JensenShannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad)[1] or total divergence to the average.[2] It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it is always a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen-Shannon distance.[3][4][5]

## Definition

Consider the set ${\displaystyle M_{+}^{1}(A)}$ of probability distributions where A is a set provided with some σ-algebra of measurable subsets. In particular we can take A to be a finite or countable set with all subsets being measurable.

The Jensen–Shannon divergence (JSD) ${\displaystyle M_{+}^{1}(A)\times M_{+}^{1}(A)\rightarrow [0,\infty {})}$ is a symmetrized and smoothed version of the Kullback–Leibler divergence ${\displaystyle D(P\parallel Q)}$. It is defined by

${\displaystyle {\rm {JSD}}(P\parallel Q)={\frac {1}{2}}D(P\parallel M)+{\frac {1}{2}}D(Q\parallel M)}$

where ${\displaystyle M={\frac {1}{2}}(P+Q)}$

A more general definition, allowing for the comparison of more than two probability distributions, is:

${\displaystyle {\rm {JSD}}_{\pi _{1},\ldots ,\pi _{n}}(P_{1},P_{2},\ldots ,P_{n})=H\left(\sum _{i=1}^{n}\pi _{i}P_{i}\right)-\sum _{i=1}^{n}\pi _{i}H(P_{i})}$

where ${\displaystyle \pi _{1},\ldots ,\pi _{n}}$ are weights that are selected for the probability distributions ${\displaystyle P_{1},P_{2},\ldots ,P_{n}}$ and ${\displaystyle H(P)}$ is the Shannon entropy for distribution ${\displaystyle P}$. For the two-distribution case described above,

${\displaystyle P_{1}=P,P_{2}=Q,\pi _{1}=\pi _{2}={\frac {1}{2}}.\ }$

## Bounds

The Jensen–Shannon divergence is bounded by 1 for two probability distributions, given that one uses the base 2 logarithm.[6]

${\displaystyle 0\leq {\rm {JSD}}(P\parallel Q)\leq 1}$

For log base e, or ln, which is commonly used in statistical thermodynamics, the upper bound is ln(2):

${\displaystyle 0\leq {\rm {JSD}}(P\parallel Q)\leq \ln(2)}$

A more general bound, the Jensen–Shannon divergence is bounded by ${\displaystyle \log _{2}(n)}$ for more than two probability distributions, given that one uses the base 2 logarithm.[6]

${\displaystyle 0\leq {\rm {JSD}}_{\pi _{1},\ldots ,\pi _{n}}(P_{1},P_{2},\ldots ,P_{n})\leq \log _{2}(n)}$

## Relation to mutual information

The Jensen–Shannon divergence is the mutual information between a random variable ${\displaystyle X}$ associated to a mixture distribution between ${\displaystyle P}$ and ${\displaystyle Q}$ and the binary indicator variable ${\displaystyle Z}$ that is used to switch between ${\displaystyle P}$ and ${\displaystyle Q}$ to produce the mixture. Let ${\displaystyle X}$ be some abstract function on the underlying set of events that discriminates well between events, and choose the value of ${\displaystyle X}$ according to ${\displaystyle P}$ if ${\displaystyle Z=0}$ and according to ${\displaystyle Q}$ if ${\displaystyle Z=1}$. That is, we are choosing ${\displaystyle X}$ according to the probability measure ${\displaystyle M=(P+Q)/2}$, and its distribution is the mixture distribution. We compute

{\displaystyle {\begin{aligned}I(X;Z)&=H(X)-H(X|Z)\\&=-\sum M\log M+{\frac {1}{2}}\left[\sum P\log P+\sum Q\log Q\right]\\&=-\sum {\frac {P}{2}}\log M-\sum {\frac {Q}{2}}\log M+{\frac {1}{2}}\left[\sum P\log P+\sum Q\log Q\right]\\&={\frac {1}{2}}\sum P\left(\log P-\log M\right)+{\frac {1}{2}}\sum Q\left(\log Q-\log M\right)\\&={\rm {JSD}}(P\parallel Q)\end{aligned}}}

It follows from the above result that the Jensen–Shannon divergence is bounded by 0 and 1 because mutual information is non-negative and bounded by ${\displaystyle H(Z)=1}$. The JSD is not always bounded by 0 and 1: the upper limit of 1 arises here because we are considering the specific case involving the binary variable ${\displaystyle Z}$.

One can apply the same principle to a joint distribution and the product of its two marginal distribution (in analogy to Kullback–Leibler divergence and mutual information) and to measure how reliably one can decide if a given response comes from the joint distribution or the product distribution—subject to the assumption that these are the only two possibilities.[7]

## Quantum Jensen–Shannon divergence

The generalization of probability distributions on density matrices allows to define quantum Jensen–Shannon divergence (QJSD).[8][9] It is defined for a set of density matrices ${\displaystyle (\rho _{1},\ldots ,\rho _{n})}$ and a probability distribution ${\displaystyle \pi =(\pi _{1},\ldots ,\pi _{n})}$ as

${\displaystyle {\rm {QJSD}}(\rho _{1},\ldots ,\rho _{n})=S\left(\sum _{i=1}^{n}\pi _{i}\rho _{i}\right)-\sum _{i=1}^{n}\pi _{i}S(\rho _{i})}$

where ${\displaystyle S(\rho )}$ is the von Neumann entropy of ${\displaystyle \rho }$. This quantity was introduced in quantum information theory, where it is called the Holevo information: it gives the upper bound for amount of classical information encoded by the quantum states ${\displaystyle (\rho _{1},\ldots ,\rho _{n})}$ under the prior distribution ${\displaystyle \pi }$ (see Holevo's theorem).[10] Quantum Jensen–Shannon divergence for ${\displaystyle \pi =\left({\frac {1}{2}},{\frac {1}{2}}\right)}$ and two density matrices is a symmetric function, everywhere defined, bounded and equal to zero only if two density matrices are the same. It is a square of a metric for pure states[11] but it is unknown whether the metric property holds in general.[9] The Bures metric is closely related to the quantum JS divergence; it is the quantum analog of the Fisher information metric.

## Applications

The Jensen–Shannon divergence has been applied in bioinformatics and genome comparison,[12][13] in protein surface comparison,[14] in the social sciences,[15] in the quantitative study of history,[16] and in machine learning.[17]

## Notes

1. ^ Hinrich Schütze; Christopher D. Manning (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass: MIT Press. p. 304. ISBN 0-262-13360-1.
2. ^ Dagan, Ido; Lillian Lee; Fernando Pereira (1997). "Similarity-Based Methods For Word Sense Disambiguation". Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics: 56–63. doi:10.3115/979617.979625. Retrieved 2008-03-09.
3. ^ Endres, D. M.; J. E. Schindelin (2003). "A new metric for probability distributions". IEEE Trans. Inf. Theory. 49 (7): 1858–1860. doi:10.1109/TIT.2003.813506.
4. ^ Ôsterreicher, F.; I. Vajda (2003). "A new class of metric divergences on probability spaces and its statistical applications". Ann. Inst. Statist. Math. 55 (3): 639–653. doi:10.1007/BF02517812.
5. ^ Fuglede, Bent; Topsøe, Flemming (2004). "Jensen-Shannon divergence and Hilbert space embedding - IEEE Conference Publication". ieeexplore.ieee.org. Retrieved 2017-12-24.
6. ^ a b Lin, J. (1991). "Divergence measures based on the shannon entropy" (PDF). IEEE Transactions on Information Theory. 37 (1): 145–151. doi:10.1109/18.61115.
7. ^ Schneidman, Elad; Bialek, W; Berry, M.J. 2nd (2003). "Synergy, Redundancy, and Independence in Population Codes". Journal of Neuroscience. 23 (37): 11539–11553. PMID 14684857.
8. ^ Majtey, A.; Lamberti, P.; Prato, D. (2005). "Jensen-Shannon divergence as a measure of distinguishability between mixed quantum states". Physical Review A. 72 (5). doi:10.1103/PhysRevA.72.052310.
9. ^ a b Briët, Jop; Harremoës, Peter (2009). "Properties of classical and quantum Jensen-Shannon divergence". Physical Review A. 79 (5). doi:10.1103/PhysRevA.79.052311.
10. ^ Holevo, A. S. (1973), "Bounds for the quantity of information transmitted by a quantum communication channel", Problemy Peredachi Informatsii (in Russian), 9: 3–11. English translation: Probl. Inf. Transm., 9, 177–183 (1975)) MR456936
11. ^ Braunstein, Samuel; Caves, Carlton (1994). "Statistical distance and the geometry of quantum states". Physical Review Letters. 72 (22): 3439–3443. doi:10.1103/PhysRevLett.72.3439. PMID 10056200.
12. ^ Sims, GE; Jun, SR; Wu, GA; Kim, SH (2009). "Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions". Proceedings of the National Academy of Sciences of the United States of America. 106 (8): 2677–82. doi:10.1073/pnas.0813249106. PMC . PMID 19188606.
13. ^ Itzkovitz, S; Hodis, E; Segal, E (2010). "Overlapping codes within protein-coding sequences". Genome Research. 20 (11): 1582–9. doi:10.1101/gr.105072.110. PMC . PMID 20841429.
14. ^ Ofran, Y; Rost, B (2003). "Analysing six types of protein-protein interfaces". Journal of Molecular Biology. 325 (2): 377–87. doi:10.1016/s0022-2836(02)01223-8. PMID 12488102.
15. ^ DeDeo, Simon; Hawkins, Robert X. D.; Klingenstein, Sara; Hitchcock, Tim (2013). "Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems". "Entropy". 15 (6): 2246–2276. doi:10.3390/e15062246.
16. ^ Klingenstein, Sara; Hitchcock, Tim; DeDeo, Simon (2014). "The civilizing process in London's Old Bailey". "Proceedings of the National Academy of Sciences". 111 (26): 9419–9424. doi:10.1073/pnas.1405984111.
17. ^ Goodfellow, Ian J.; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Networks. NIPS. arXiv:. http://arxiv.org/abs/1406.2661