# Jensen–Shannon divergence

(Redirected from Jensen-Shannon divergence)

In probability theory and statistics, the JensenShannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad)[1] [2] or total divergence to the average.[3] It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance.[4][5][6]

## Definition

Consider the set ${\displaystyle M_{+}^{1}(A)}$ of probability distributions where ${\displaystyle A}$ is a set provided with some σ-algebra of measurable subsets. In particular we can take ${\displaystyle A}$ to be a finite or countable set with all subsets being measurable.

The Jensen–Shannon divergence (JSD) is a symmetrized and smoothed version of the Kullback–Leibler divergence ${\displaystyle D(P\parallel Q)}$. It is defined by

${\displaystyle {\rm {JSD}}(P\parallel Q)={\frac {1}{2}}D(P\parallel M)+{\frac {1}{2}}D(Q\parallel M),}$

where ${\displaystyle M={\frac {1}{2}}(P+Q)}$.

The geometric Jensen–Shannon divergence[7] (or G-Jensen–Shannon divergence) yields a closed-form formula for divergence between two Gaussian distributions by taking the geometric mean.

A more general definition, allowing for the comparison of more than two probability distributions, is:

{\displaystyle {\begin{aligned}{\rm {JSD}}_{\pi _{1},\ldots ,\pi _{n}}(P_{1},P_{2},\ldots ,P_{n})&=\sum _{i}\pi _{i}D(P_{i}\parallel M)\\&=H\left(M\right)-\sum _{i=1}^{n}\pi _{i}H(P_{i})\end{aligned}}}

where

{\displaystyle {\begin{aligned}M&:=\sum _{i=1}^{n}\pi _{i}P_{i}\end{aligned}}}

and ${\displaystyle \pi _{1},\ldots ,\pi _{n}}$ are weights that are selected for the probability distributions ${\displaystyle P_{1},P_{2},\ldots ,P_{n}}$, and ${\displaystyle H(P)}$ is the Shannon entropy for distribution ${\displaystyle P}$. For the two-distribution case described above,

${\displaystyle P_{1}=P,P_{2}=Q,\pi _{1}=\pi _{2}={\frac {1}{2}}.\ }$

Hence, for those distributions ${\displaystyle P,Q}$

${\displaystyle JSD=H(M)-{\frac {1}{2}}{\bigg (}H(P)+H(Q){\bigg )}}$

## Bounds

The Jensen–Shannon divergence is bounded by 1 for two probability distributions, given that one uses the base 2 logarithm.[8]

${\displaystyle 0\leq {\rm {JSD}}(P\parallel Q)\leq 1}$

With this normalization, it is a lower bound on the total variation distance between P and Q:

${\displaystyle {\rm {JSD}}(P\parallel Q)\leq {\frac {1}{2}}\|P-Q\|_{1}={\frac {1}{2}}\sum _{\omega \in \Omega }|P(\omega )-Q(\omega )|.}$

With base-e logarithm, which is commonly used in statistical thermodynamics, the upper bound is ${\displaystyle \ln(2)}$. In general, the bound in base b is ${\displaystyle \log _{b}(2)}$:

${\displaystyle 0\leq {\rm {JSD}}(P\parallel Q)\leq \log _{b}(2)}$

A more general bound, the Jensen–Shannon divergence is bounded by ${\displaystyle \log _{b}(n)}$ for more than two probability distributions.[8]

${\displaystyle 0\leq {\rm {JSD}}_{\pi _{1},\ldots ,\pi _{n}}(P_{1},P_{2},\ldots ,P_{n})\leq \log _{b}(n)}$

## Relation to mutual information

The Jensen–Shannon divergence is the mutual information between a random variable ${\displaystyle X}$ associated to a mixture distribution between ${\displaystyle P}$ and ${\displaystyle Q}$ and the binary indicator variable ${\displaystyle Z}$ that is used to switch between ${\displaystyle P}$ and ${\displaystyle Q}$ to produce the mixture. Let ${\displaystyle X}$ be some abstract function on the underlying set of events that discriminates well between events, and choose the value of ${\displaystyle X}$ according to ${\displaystyle P}$ if ${\displaystyle Z=0}$ and according to ${\displaystyle Q}$ if ${\displaystyle Z=1}$, where ${\displaystyle Z}$ is equiprobable. That is, we are choosing ${\displaystyle X}$ according to the probability measure ${\displaystyle M=(P+Q)/2}$, and its distribution is the mixture distribution. We compute

{\displaystyle {\begin{aligned}I(X;Z)&=H(X)-H(X|Z)\\&=-\sum M\log M+{\frac {1}{2}}\left[\sum P\log P+\sum Q\log Q\right]\\&=-\sum {\frac {P}{2}}\log M-\sum {\frac {Q}{2}}\log M+{\frac {1}{2}}\left[\sum P\log P+\sum Q\log Q\right]\\&={\frac {1}{2}}\sum P\left(\log P-\log M\right)+{\frac {1}{2}}\sum Q\left(\log Q-\log M\right)\\&={\rm {JSD}}(P\parallel Q)\end{aligned}}}

It follows from the above result that the Jensen–Shannon divergence is bounded by 0 and 1 because mutual information is non-negative and bounded by ${\displaystyle H(Z)=1}$ in base 2 logarithm.

One can apply the same principle to a joint distribution and the product of its two marginal distribution (in analogy to Kullback–Leibler divergence and mutual information) and to measure how reliably one can decide if a given response comes from the joint distribution or the product distribution—subject to the assumption that these are the only two possibilities.[9]

## Quantum Jensen–Shannon divergence

The generalization of probability distributions on density matrices allows to define quantum Jensen–Shannon divergence (QJSD).[10][11] It is defined for a set of density matrices ${\displaystyle (\rho _{1},\ldots ,\rho _{n})}$ and a probability distribution ${\displaystyle \pi =(\pi _{1},\ldots ,\pi _{n})}$ as

${\displaystyle {\rm {QJSD}}(\rho _{1},\ldots ,\rho _{n})=S\left(\sum _{i=1}^{n}\pi _{i}\rho _{i}\right)-\sum _{i=1}^{n}\pi _{i}S(\rho _{i})}$

where ${\displaystyle S(\rho )}$ is the von Neumann entropy of ${\displaystyle \rho }$. This quantity was introduced in quantum information theory, where it is called the Holevo information: it gives the upper bound for amount of classical information encoded by the quantum states ${\displaystyle (\rho _{1},\ldots ,\rho _{n})}$ under the prior distribution ${\displaystyle \pi }$ (see Holevo's theorem).[12] Quantum Jensen–Shannon divergence for ${\displaystyle \pi =\left({\frac {1}{2}},{\frac {1}{2}}\right)}$ and two density matrices is a symmetric function, everywhere defined, bounded and equal to zero only if two density matrices are the same. It is a square of a metric for pure states,[13] and it was recently shown that this metric property holds for mixed states as well.[14][15] The Bures metric is closely related to the quantum JS divergence; it is the quantum analog of the Fisher information metric.

## Jensen–Shannon centroid

The centroid C* of a finite set of probability distributions can be defined as the minimizer of the average sum of the Jensen-Shannon divergences between a probability distribution and the prescribed set of distributions:

${\displaystyle C^{*}=\arg \min _{Q}\sum _{i=1}^{n}{\rm {JSD}}(P_{i}\parallel Q)}$
An efficient algorithm[16] (CCCP) based on difference of convex functions is reported to calculate the Jensen-Shannon centroid of a set of discrete distributions (histograms).

## Applications

The Jensen–Shannon divergence has been applied in bioinformatics and genome comparison,[17][18] in protein surface comparison,[19] in the social sciences,[20] in the quantitative study of history,[21], fire experiments[22] and in machine learning.[23]

## Notes

1. ^ Frank Nielsen (2021). "On a variational definition for the Jensen-Shannon symmetrization of distances based on the information radius". Entropy. MDPI. 23 (4): 464. doi:10.3390/e21050485. PMID 33267199.
2. ^ Hinrich Schütze; Christopher D. Manning (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass: MIT Press. p. 304. ISBN 978-0-262-13360-9.
3. ^ Dagan, Ido; Lillian Lee; Fernando Pereira (1997). "Similarity-Based Methods For Word Sense Disambiguation". Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics: 56–63. arXiv:cmp-lg/9708010. Bibcode:1997cmp.lg....8010D. doi:10.3115/979617.979625. Retrieved 2008-03-09.
4. ^ Endres, D. M.; J. E. Schindelin (2003). "A new metric for probability distributions" (PDF). IEEE Trans. Inf. Theory. 49 (7): 1858–1860. doi:10.1109/TIT.2003.813506. hdl:10023/1591. S2CID 14437777.
5. ^ Ôsterreicher, F.; I. Vajda (2003). "A new class of metric divergences on probability spaces and its statistical applications". Ann. Inst. Statist. Math. 55 (3): 639–653. doi:10.1007/BF02517812. S2CID 13085920.
6. ^ Fuglede, B.; Topsoe, F. (2004). "Jensen-Shannon divergence and Hilbert space embedding" (PDF). Proceedings of the International Symposium on Information Theory, 2004. IEEE. p. 30. doi:10.1109/ISIT.2004.1365067. ISBN 978-0-7803-8280-0. S2CID 7891037.
7. ^ Frank Nielsen (2019). "On the Jensen-Shannon symmetrization of distances relying on abstract means". Entropy. MDPI. 21 (5): 485. arXiv:1904.04017. Bibcode:2019Entrp..21..485N. doi:10.3390/e21050485.
8. ^ a b Lin, J. (1991). "Divergence measures based on the shannon entropy" (PDF). IEEE Transactions on Information Theory. 37 (1): 145–151. CiteSeerX 10.1.1.127.9167. doi:10.1109/18.61115.
9. ^ Schneidman, Elad; Bialek, W; Berry, M.J. 2nd (2003). "Synergy, Redundancy, and Independence in Population Codes". Journal of Neuroscience. 23 (37): 11539–11553. doi:10.1523/JNEUROSCI.23-37-11539.2003. PMC 6740962. PMID 14684857.
10. ^ Majtey, A.; Lamberti, P.; Prato, D. (2005). "Jensen-Shannon divergence as a measure of distinguishability between mixed quantum states". Physical Review A. 72 (5): 052310. arXiv:quant-ph/0508138. Bibcode:2005PhRvA..72e2310M. doi:10.1103/PhysRevA.72.052310. S2CID 32062112.
11. ^ Briët, Jop; Harremoës, Peter (2009). "Properties of classical and quantum Jensen-Shannon divergence". Physical Review A. 79 (5): 052311. arXiv:0806.4472. Bibcode:2009PhRvA..79e2311B. doi:10.1103/PhysRevA.79.052311.
12. ^ Holevo, A. S. (1973), "Bounds for the quantity of information transmitted by a quantum communication channel", Problemy Peredachi Informatsii (in Russian), 9: 3–11. English translation: Probl. Inf. Transm., 9: 177–183 (1975) MR456936
13. ^ Braunstein, Samuel; Caves, Carlton (1994). "Statistical distance and the geometry of quantum states". Physical Review Letters. 72 (22): 3439–3443. Bibcode:1994PhRvL..72.3439B. doi:10.1103/PhysRevLett.72.3439. PMID 10056200.
14. ^ Virosztek, Dániel (2021). "The metric property of the quantum Jensen-Shannon divergence". Advances in Mathematics. 380: 107595. arXiv:1910.10447. doi:10.1016/j.aim.2021.107595. S2CID 204837864.
15. ^ Sra, Suvrit (2019). "Metrics Induced by Quantum Jensen-Shannon-Renyí and Related Divergences". arXiv:1911.02643 [cs.IT].
16. ^ Frank Nielsen (2021). "On a generalization of the Jensen-Shannon divergence and the Jensen--Shannon centroid". Entropy. Springer. 22 (2): 221. doi:10.3390/e22020221. PMC 7516653. PMID 33285995.
17. ^ Sims, GE; Jun, SR; Wu, GA; Kim, SH (2009). "Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions". Proceedings of the National Academy of Sciences of the United States of America. 106 (8): 2677–82. Bibcode:2009PNAS..106.2677S. doi:10.1073/pnas.0813249106. PMC 2634796. PMID 19188606.
18. ^ Itzkovitz, S; Hodis, E; Segal, E (2010). "Overlapping codes within protein-coding sequences". Genome Research. 20 (11): 1582–9. doi:10.1101/gr.105072.110. PMC 2963821. PMID 20841429.
19. ^ Ofran, Y; Rost, B (2003). "Analysing six types of protein-protein interfaces". Journal of Molecular Biology. 325 (2): 377–87. CiteSeerX 10.1.1.6.9207. doi:10.1016/s0022-2836(02)01223-8. PMID 12488102.
20. ^ DeDeo, Simon; Hawkins, Robert X. D.; Klingenstein, Sara; Hitchcock, Tim (2013). "Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems". Entropy. 15 (6): 2246–2276. arXiv:1302.0907. Bibcode:2013Entrp..15.2246D. doi:10.3390/e15062246.
21. ^ Klingenstein, Sara; Hitchcock, Tim; DeDeo, Simon (2014). "The civilizing process in London's Old Bailey". Proceedings of the National Academy of Sciences of the United States of America. 111 (26): 9419–9424. Bibcode:2014PNAS..111.9419K. doi:10.1073/pnas.1405984111. PMC 4084475. PMID 24979792.
22. ^ Flavia-Corina Mitroi-Symeonidis; Ion Anghel; Nicuşor Minculete (2020). "Parametric Jensen-Shannon statistical complexity and its applications on full-scale compartment fire data". Symmetry. 12 (1): 22. doi:10.3390/sym12010022.
23. ^ Goodfellow, Ian J.; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Networks. NIPS. arXiv:1406.2661. Bibcode:2014arXiv1406.2661G.