Jensen–Shannon divergence

In probability theory and statistics, the Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad)^[1]^[2] or total divergence to the average.^[3] It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance. The similarity between the distributions is greater when the Jensen-Shannon distance is closer to zero.^[4]^[5]^[6]

Definition[edit]

Consider the set $M_{+}^{1}(A)$ of probability distributions where $A$ is a set provided with some σ-algebra of measurable subsets. In particular we can take $A$ to be a finite or countable set with all subsets being measurable.

The Jensen–Shannon divergence (JSD) is a symmetrized and smoothed version of the Kullback–Leibler divergence $D(P\parallel Q)$ . It is defined by

{\rm {JSD}}(P\parallel Q)={\frac {1}{2}}D(P\parallel M)+{\frac {1}{2}}D(Q\parallel M),

where $M={\frac {1}{2}}(P+Q)$ is a mixture distribution of $P$ and $Q$ .

The geometric Jensen–Shannon divergence^[7] (or G-Jensen–Shannon divergence) yields a closed-form formula for divergence between two Gaussian distributions by taking the geometric mean.

A more general definition, allowing for the comparison of more than two probability distributions, is:

{\begin{aligned}{\rm {JSD}}_{\pi _{1},\ldots ,\pi _{n}}(P_{1},P_{2},\ldots ,P_{n})&=\sum _{i}\pi _{i}D(P_{i}\parallel M)\\&=H\left(M\right)-\sum _{i=1}^{n}\pi _{i}H(P_{i})\end{aligned}}

where

${\begin{aligned}M&:=\sum _{i=1}^{n}\pi _{i}P_{i}\end{aligned}}$

and $\pi _{1},\ldots ,\pi _{n}$ are weights that are selected for the probability distributions $P_{1},P_{2},\ldots ,P_{n}$ , and $H(P)$ is the Shannon entropy for distribution $P$ . For the two-distribution case described above,

$P_{1}=P,P_{2}=Q,\pi _{1}=\pi _{2}={\frac {1}{2}}.\$

Hence, for those distributions $P,Q$

$JSD=H(M)-{\frac {1}{2}}{\bigg (}H(P)+H(Q){\bigg )}$

Bounds[edit]

The Jensen–Shannon divergence is bounded by 1 for two probability distributions, given that one uses the base 2 logarithm:^[8]

0\leq {\rm {JSD}}(P\parallel Q)\leq 1

.

With this normalization, it is a lower bound on the total variation distance between P and Q:

{\rm {JSD}}(P\parallel Q)\leq {\frac {1}{2}}\|P-Q\|_{1}={\frac {1}{2}}\sum _{\omega \in \Omega }|P(\omega )-Q(\omega )|

.

With base-e logarithm, which is commonly used in statistical thermodynamics, the upper bound is $\ln(2)$ . In general, the bound in base b is $\log _{b}(2)$ :

0\leq {\rm {JSD}}(P\parallel Q)\leq \log _{b}(2)

.

A more general bound, the Jensen–Shannon divergence is bounded by $\log _{b}(n)$ for more than two probability distributions:^[8]

0\leq {\rm {JSD}}_{\pi _{1},\ldots ,\pi _{n}}(P_{1},P_{2},\ldots ,P_{n})\leq \log _{b}(n)

.

Relation to mutual information[edit]

The Jensen–Shannon divergence is the mutual information between a random variable $X$ associated to a mixture distribution between $P$ and $Q$ and the binary indicator variable $Z$ that is used to switch between $P$ and $Q$ to produce the mixture. Let $X$ be some abstract function on the underlying set of events that discriminates well between events, and choose the value of $X$ according to $P$ if $Z=0$ and according to $Q$ if $Z=1$ , where $Z$ is equiprobable. That is, we are choosing $X$ according to the probability measure $M=(P+Q)/2$ , and its distribution is the mixture distribution. We compute

{\begin{aligned}I(X;Z)&=H(X)-H(X|Z)\\&=-\sum M\log M+{\frac {1}{2}}\left[\sum P\log P+\sum Q\log Q\right]\\&=-\sum {\frac {P}{2}}\log M-\sum {\frac {Q}{2}}\log M+{\frac {1}{2}}\left[\sum P\log P+\sum Q\log Q\right]\\&={\frac {1}{2}}\sum P\left(\log P-\log M\right)+{\frac {1}{2}}\sum Q\left(\log Q-\log M\right)\\&={\rm {JSD}}(P\parallel Q)\end{aligned}}

It follows from the above result that the Jensen–Shannon divergence is bounded by 0 and 1 because mutual information is non-negative and bounded by $H(Z)=1$ in base 2 logarithm.

One can apply the same principle to a joint distribution and the product of its two marginal distribution (in analogy to Kullback–Leibler divergence and mutual information) and to measure how reliably one can decide if a given response comes from the joint distribution or the product distribution—subject to the assumption that these are the only two possibilities.^[9]

Quantum Jensen–Shannon divergence[edit]

The generalization of probability distributions on density matrices allows to define quantum Jensen–Shannon divergence (QJSD).^[10]^[11] It is defined for a set of density matrices $(\rho _{1},\ldots ,\rho _{n})$ and a probability distribution $\pi =(\pi _{1},\ldots ,\pi _{n})$ as

{\rm {QJSD}}(\rho _{1},\ldots ,\rho _{n})=S\left(\sum _{i=1}^{n}\pi _{i}\rho _{i}\right)-\sum _{i=1}^{n}\pi _{i}S(\rho _{i})

where $S(\rho )$ is the von Neumann entropy of $\rho$ . This quantity was introduced in quantum information theory, where it is called the Holevo information: it gives the upper bound for amount of classical information encoded by the quantum states $(\rho _{1},\ldots ,\rho _{n})$ under the prior distribution $\pi$ (see Holevo's theorem).^[12] Quantum Jensen–Shannon divergence for $\pi =\left({\frac {1}{2}},{\frac {1}{2}}\right)$ and two density matrices is a symmetric function, everywhere defined, bounded and equal to zero only if two density matrices are the same. It is a square of a metric for pure states,^[13] and it was recently shown that this metric property holds for mixed states as well.^[14]^[15] The Bures metric is closely related to the quantum JS divergence; it is the quantum analog of the Fisher information metric.

Jensen–Shannon centroid[edit]

The centroid C* of a finite set of probability distributions can be defined as the minimizer of the average sum of the Jensen-Shannon divergences between a probability distribution and the prescribed set of distributions:

C^{*}=\arg \min _{Q}\sum _{i=1}^{n}{\rm {JSD}}(P_{i}\parallel Q)

An efficient algorithm^[16] (CCCP) based on difference of convex functions is reported to calculate the Jensen-Shannon centroid of a set of discrete distributions (histograms).

Applications[edit]

The Jensen–Shannon divergence has been applied in bioinformatics and genome comparison,^[17]^[18] in protein surface comparison,^[19] in the social sciences,^[20] in the quantitative study of history,^[21] in fire experiments,^[22] and in machine learning.^[23]

Notes[edit]

^ Frank Nielsen (2021). "On a variational definition for the Jensen-Shannon symmetrization of distances based on the information radius". Entropy. 23 (4). MDPI: 464. doi:10.3390/e21050485. PMC 7514974. PMID 33267199.
^ Hinrich Schütze; Christopher D. Manning (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass: MIT Press. p. 304. ISBN 978-0-262-13360-9.
^ Dagan, Ido; Lillian Lee; Fernando Pereira (1997). "Similarity-based methods for word sense disambiguation". Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics -. pp. 56–63. arXiv:cmp-lg/9708010. Bibcode:1997cmp.lg....8010D. doi:10.3115/979617.979625. Retrieved 2008-03-09. {{cite book}}: |journal= ignored (help)
^ Endres, D. M.; J. E. Schindelin (2003). "A new metric for probability distributions" (PDF). IEEE Trans. Inf. Theory. 49 (7): 1858–1860. doi:10.1109/TIT.2003.813506. hdl:10023/1591. S2CID 14437777.
^ Ôsterreicher, F.; I. Vajda (2003). "A new class of metric divergences on probability spaces and its statistical applications". Ann. Inst. Statist. Math. 55 (3): 639–653. doi:10.1007/BF02517812. S2CID 13085920.
^ Fuglede, B.; Topsoe, F. (2004). "Jensen-Shannon divergence and Hilbert space embedding" (PDF). Proceedings of the International Symposium on Information Theory, 2004. IEEE. p. 30. doi:10.1109/ISIT.2004.1365067. ISBN 978-0-7803-8280-0. S2CID 7891037.
^ Frank Nielsen (2019). "On the Jensen-Shannon symmetrization of distances relying on abstract means". Entropy. 21 (5). MDPI: 485. arXiv:1904.04017. Bibcode:2019Entrp..21..485N. doi:10.3390/e21050485. PMC 7514974. PMID 33267199.
^ ^a ^b Lin, J. (1991). "Divergence measures based on the shannon entropy" (PDF). IEEE Transactions on Information Theory. 37 (1): 145–151. CiteSeerX 10.1.1.127.9167. doi:10.1109/18.61115. S2CID 12121632.
^ Schneidman, Elad; Bialek, W; Berry, M.J. 2nd (2003). "Synergy, Redundancy, and Independence in Population Codes". Journal of Neuroscience. 23 (37): 11539–11553. doi:10.1523/JNEUROSCI.23-37-11539.2003. PMC 6740962. PMID 14684857.{{cite journal}}: CS1 maint: numeric names: authors list (link)
^ Majtey, A.; Lamberti, P.; Prato, D. (2005). "Jensen-Shannon divergence as a measure of distinguishability between mixed quantum states". Physical Review A. 72 (5): 052310. arXiv:quant-ph/0508138. Bibcode:2005PhRvA..72e2310M. doi:10.1103/PhysRevA.72.052310. S2CID 32062112.
^ Briët, Jop; Harremoës, Peter (2009). "Properties of classical and quantum Jensen-Shannon divergence". Physical Review A. 79 (5): 052311. arXiv:0806.4472. Bibcode:2009PhRvA..79e2311B. doi:10.1103/PhysRevA.79.052311.
^ Holevo, A. S. (1973), "Bounds for the quantity of information transmitted by a quantum communication channel", Problemy Peredachi Informatsii (in Russian), 9: 3–11. English translation: Probl. Inf. Transm., 9: 177–183 (1975) MR456936
^ Braunstein, Samuel; Caves, Carlton (1994). "Statistical distance and the geometry of quantum states". Physical Review Letters. 72 (22): 3439–3443. Bibcode:1994PhRvL..72.3439B. doi:10.1103/PhysRevLett.72.3439. PMID 10056200.
^ Virosztek, Dániel (2021). "The metric property of the quantum Jensen-Shannon divergence". Advances in Mathematics. 380: 107595. arXiv:1910.10447. doi:10.1016/j.aim.2021.107595. S2CID 204837864.
^ Sra, Suvrit (2019). "Metrics Induced by Quantum Jensen-Shannon-Renyí and Related Divergences". arXiv:1911.02643 [cs.IT].
^ Frank Nielsen (2021). "On a generalization of the Jensen-Shannon divergence and the Jensen--Shannon centroid". Entropy. 22 (2). MDPI: 221. doi:10.3390/e22020221. PMC 7516653. PMID 33285995.
^ Sims, GE; Jun, SR; Wu, GA; Kim, SH (2009). "Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions". Proceedings of the National Academy of Sciences of the United States of America. 106 (8): 2677–82. Bibcode:2009PNAS..106.2677S. doi:10.1073/pnas.0813249106. PMC 2634796. PMID 19188606.
^ Itzkovitz, S; Hodis, E; Segal, E (2010). "Overlapping codes within protein-coding sequences". Genome Research. 20 (11): 1582–9. doi:10.1101/gr.105072.110. PMC 2963821. PMID 20841429.
^ Ofran, Y; Rost, B (2003). "Analysing six types of protein-protein interfaces". Journal of Molecular Biology. 325 (2): 377–87. CiteSeerX 10.1.1.6.9207. doi:10.1016/s0022-2836(02)01223-8. PMID 12488102.
^ DeDeo, Simon; Hawkins, Robert X. D.; Klingenstein, Sara; Hitchcock, Tim (2013). "Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems". Entropy. 15 (6): 2246–2276. arXiv:1302.0907. Bibcode:2013Entrp..15.2246D. doi:10.3390/e15062246.
^ Klingenstein, Sara; Hitchcock, Tim; DeDeo, Simon (2014). "The civilizing process in London's Old Bailey". Proceedings of the National Academy of Sciences of the United States of America. 111 (26): 9419–9424. Bibcode:2014PNAS..111.9419K. doi:10.1073/pnas.1405984111. PMC 4084475. PMID 24979792.
^ Flavia-Corina Mitroi-Symeonidis; Ion Anghel; Nicuşor Minculete (2020). "Parametric Jensen-Shannon statistical complexity and its applications on full-scale compartment fire data". Symmetry. 12 (1): 22. doi:10.3390/sym12010022.
^ Goodfellow, Ian J.; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Networks. NIPS. arXiv:1406.2661. Bibcode:2014arXiv1406.2661G.

External links[edit]

[1] Frank Nielsen (2021). "On a variational definition for the Jensen-Shannon symmetrization of distances based on the information radius". Entropy. 23 (4). MDPI: 464. doi:10.3390/e21050485. PMC 7514974. PMID 33267199.

[2] Hinrich Schütze; Christopher D. Manning (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass: MIT Press. p. 304. ISBN 978-0-262-13360-9.

[3] Dagan, Ido; Lillian Lee; Fernando Pereira (1997). "Similarity-based methods for word sense disambiguation". Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics -. pp. 56–63. arXiv:cmp-lg/9708010. Bibcode:1997cmp.lg....8010D. doi:10.3115/979617.979625. Retrieved 2008-03-09. {{cite book}}: |journal= ignored (help)

[4] Endres, D. M.; J. E. Schindelin (2003). "A new metric for probability distributions" (PDF). IEEE Trans. Inf. Theory. 49 (7): 1858–1860. doi:10.1109/TIT.2003.813506. hdl:10023/1591. S2CID 14437777.

[5] Ôsterreicher, F.; I. Vajda (2003). "A new class of metric divergences on probability spaces and its statistical applications". Ann. Inst. Statist. Math. 55 (3): 639–653. doi:10.1007/BF02517812. S2CID 13085920.

[6] Fuglede, B.; Topsoe, F. (2004). "Jensen-Shannon divergence and Hilbert space embedding" (PDF). Proceedings of the International Symposium on Information Theory, 2004. IEEE. p. 30. doi:10.1109/ISIT.2004.1365067. ISBN 978-0-7803-8280-0. S2CID 7891037.

[7] Frank Nielsen (2019). "On the Jensen-Shannon symmetrization of distances relying on abstract means". Entropy. 21 (5). MDPI: 485. arXiv:1904.04017. Bibcode:2019Entrp..21..485N. doi:10.3390/e21050485. PMC 7514974. PMID 33267199.

[Lin-8] Lin, J. (1991). "Divergence measures based on the shannon entropy" (PDF). IEEE Transactions on Information Theory. 37 (1): 145–151. CiteSeerX 10.1.1.127.9167. doi:10.1109/18.61115. S2CID 12121632.

[9] Schneidman, Elad; Bialek, W; Berry, M.J. 2nd (2003). "Synergy, Redundancy, and Independence in Population Codes". Journal of Neuroscience. 23 (37): 11539–11553. doi:10.1523/JNEUROSCI.23-37-11539.2003. PMC 6740962. PMID 14684857.{{cite journal}}: CS1 maint: numeric names: authors list (link)

[10] Majtey, A.; Lamberti, P.; Prato, D. (2005). "Jensen-Shannon divergence as a measure of distinguishability between mixed quantum states". Physical Review A. 72 (5): 052310. arXiv:quant-ph/0508138. Bibcode:2005PhRvA..72e2310M. doi:10.1103/PhysRevA.72.052310. S2CID 32062112.

[briet-11] Briët, Jop; Harremoës, Peter (2009). "Properties of classical and quantum Jensen-Shannon divergence". Physical Review A. 79 (5): 052311. arXiv:0806.4472. Bibcode:2009PhRvA..79e2311B. doi:10.1103/PhysRevA.79.052311.

[12] Holevo, A. S. (1973), "Bounds for the quantity of information transmitted by a quantum communication channel", Problemy Peredachi Informatsii (in Russian), 9: 3–11. English translation: Probl. Inf. Transm., 9: 177–183 (1975) MR456936

[13] Braunstein, Samuel; Caves, Carlton (1994). "Statistical distance and the geometry of quantum states". Physical Review Letters. 72 (22): 3439–3443. Bibcode:1994PhRvL..72.3439B. doi:10.1103/PhysRevLett.72.3439. PMID 10056200.

[14] Virosztek, Dániel (2021). "The metric property of the quantum Jensen-Shannon divergence". Advances in Mathematics. 380: 107595. arXiv:1910.10447. doi:10.1016/j.aim.2021.107595. S2CID 204837864.

[15] Sra, Suvrit (2019). "Metrics Induced by Quantum Jensen-Shannon-Renyí and Related Divergences". arXiv:1911.02643 [cs.IT].

[16] Frank Nielsen (2021). "On a generalization of the Jensen-Shannon divergence and the Jensen--Shannon centroid". Entropy. 22 (2). MDPI: 221. doi:10.3390/e22020221. PMC 7516653. PMID 33285995.

[Sims-17] Sims, GE; Jun, SR; Wu, GA; Kim, SH (2009). "Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions". Proceedings of the National Academy of Sciences of the United States of America. 106 (8): 2677–82. Bibcode:2009PNAS..106.2677S. doi:10.1073/pnas.0813249106. PMC 2634796. PMID 19188606.

[It-18] Itzkovitz, S; Hodis, E; Segal, E (2010). "Overlapping codes within protein-coding sequences". Genome Research. 20 (11): 1582–9. doi:10.1101/gr.105072.110. PMC 2963821. PMID 20841429.

[Ofran-19] Ofran, Y; Rost, B (2003). "Analysing six types of protein-protein interfaces". Journal of Molecular Biology. 325 (2): 377–87. CiteSeerX 10.1.1.6.9207. doi:10.1016/s0022-2836(02)01223-8. PMID 12488102.

[DeDeo-20] DeDeo, Simon; Hawkins, Robert X. D.; Klingenstein, Sara; Hitchcock, Tim (2013). "Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems". Entropy. 15 (6): 2246–2276. arXiv:1302.0907. Bibcode:2013Entrp..15.2246D. doi:10.3390/e15062246.

[Klingenstein-21] Klingenstein, Sara; Hitchcock, Tim; DeDeo, Simon (2014). "The civilizing process in London's Old Bailey". Proceedings of the National Academy of Sciences of the United States of America. 111 (26): 9419–9424. Bibcode:2014PNAS..111.9419K. doi:10.1073/pnas.1405984111. PMC 4084475. PMID 24979792.

[22] Flavia-Corina Mitroi-Symeonidis; Ion Anghel; Nicuşor Minculete (2020). "Parametric Jensen-Shannon statistical complexity and its applications on full-scale compartment fire data". Symmetry. 12 (1): 22. doi:10.3390/sym12010022.

[23] Goodfellow, Ian J.; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Networks. NIPS. arXiv:1406.2661. Bibcode:2014arXiv1406.2661G.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]