Jensen–Shannon divergence

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In probability theory and statistics, the JensenShannon divergence is a popular method of measuring the similarity between two probability distributions. It is also known as information radius (IRad)[1] or total divergence to the average.[2] It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it is always a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen-Shannon distance.[3][4]


Consider the set M_+^1(A) of probability distributions where A is a set provided with some σ-algebra of measurable subsets. In particular we can take A to be a finite or countable set with all subsets being measurable.

The Jensen–Shannon divergence (JSD) M_+^1(A) \times M_+^1(A) \rightarrow [0,\infty{}) is a symmetrized and smoothed version of the Kullback–Leibler divergence D(P \parallel Q). It is defined by

{\rm JSD}(P \parallel Q)= \frac{1}{2}D(P \parallel M)+\frac{1}{2}D(Q \parallel M)

where M=\frac{1}{2}(P+Q)

A more general definition, allowing for the comparison of more than two probability distributions, is:

{\rm JSD}_{\pi_1, \ldots, \pi_n}(P_1, P_2, \ldots, P_n) = H\left(\sum_{i=1}^n \pi_i P_i\right) - \sum_{i=1}^n \pi_i H(P_i)

where \pi_1, \ldots, \pi_n are weights that are selected for the probability distributions P_1, P_2, \ldots, P_n and H(P) is the Shannon entropy for distribution P. For the two-distribution case described above,

P_1=P, P_2=Q, \pi_1 = \pi_2 = \frac{1}{2}.\


The Jensen–Shannon divergence is bounded by 1 for two probability distributions, given that one uses the base 2 logarithm.[5]

0 \leq {\rm JSD}( P \parallel Q ) \leq 1

For log base e, or ln, which is commonly used in statistical thermodynamics, the upper bound is ln(2):

0 \leq {\rm JSD}( P \parallel Q ) \leq \ln(2)
A more general bounds, the Jensen–Shannon divergence is bounded by log2(n) for more than two probability distributions, given that one uses the base 2 logarithm.[5]
0 \leq {\rm JSD}_{\pi_1, \ldots, \pi_n}(P_1, P_2, \ldots, P_n) \leq \log_{2}(n)

Relation to mutual information[edit]

The Jensen–Shannon divergence is the mutual information between a random variable X associated to a mixture distribution between P and Q and the binary indicator variable Z that is used to switch between P and Q to produce the mixture. Let X be some abstract function on the underlying set of events that discriminates well between events, and choose the value of X according to P if Z = 0 and according to Q if Z = 1. That is, we are choosing X according to the probability measure M=(P+Q)/2, and its distribution is the mixture distribution. We compute

I(X; Z) &= H(X) - H(X|Z)\\
&= -\sum M \log M + \frac{1}{2} \left[ \sum P \log P + \sum Q \log Q \right] \\
&= -\sum \frac{P}{2} \log M - \sum \frac{Q}{2} \log M + \frac{1}{2} \left[ \sum P \log P + \sum Q \log Q \right] \\
&= \frac{1}{2} \sum P \left( \log P - \log M\right ) + \frac{1}{2} \sum Q  \left( \log Q - \log M \right) \\
&= {\rm JSD}(P \parallel Q)

It follows from the above result that the Jensen–Shannon divergence is bounded by 0 and 1 because mutual information is non-negative and bounded by H(Z) = 1. The JSD is not always bounded by 0 and 1: the upper limit of 1 arises here because we are considering the specific case involving the binary variable Z.

One can apply the same principle to a joint distribution and the product of its two marginal distribution (in analogy to Kullback–Leibler divergence and mutual information) and to measure how reliably one can decide if a given response comes from the joint distribution or the product distribution—subject to the assumption that these are the only two possibilities.[6]

Quantum Jensen–Shannon divergence[edit]

The generalization of probability distributions on density matrices allows to define quantum Jensen–Shannon divergence (QJSD).[7][8] It is defined for a set of density matrices (\rho_1,\ldots,\rho_n) and probability distribution \pi=(\pi_1,\ldots,\pi_n) as

{\rm QJSD}(\rho_1,\ldots,\rho_n)= S\left(\sum_{i=1}^n \pi_i \rho_i\right)-\sum_{i=1}^n \pi_i S(\rho_i)

where S(\pi_i) is the von Neumann entropy. This quantity was introduced in quantum information theory, where it is called the Holevo information: it gives the upper bound for amount of classical information encoded by the quantum states (\rho_1,\ldots,\rho_n) under the prior distribution \pi (see Holevo's theorem)[9] Quantum Jensen–Shannon divergence for \pi=\left(\frac{1}{2},\frac{1}{2}\right) and two density matrices is a symmetric function, everywhere defined, bounded and equal to zero only if two density matrices are the same. It is a square of a metric for pure states[10] but it is unknown whether the metric property holds in general.[8] The Bures metric is closely related to the quantum JS divergence; it is the quantum analog of the Fisher information metric.


The Jensen–Shannon divergence has been applied in bioinformatics and genome comparison,[11][12] in protein surface comparison,[13] in the social sciences,[14] in the quantitative study of history,[15] and in machine learning.[16]


  1. ^ Hinrich Schütze; Christopher D. Manning (1999). Foundations of Statistical Natural Language Processing. Cambridge, Mass: MIT Press. p. 304. ISBN 0-262-13360-1. 
  2. ^ Dagan, Ido; Lillian Lee; Fernando Pereira (1997). "Similarity-Based Methods For Word Sense Disambiguation". Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics: pp. 56–63. doi:10.3115/979617.979625. Retrieved 2008-03-09. 
  3. ^ Endres, D. M.; J. E. Schindelin (2003). "A new metric for probability distributions". IEEE Trans. Inf. Theory 49 (7): pp. 1858–1860. doi:10.1109/TIT.2003.813506. 
  4. ^ Ôsterreicher, F.; I. Vajda (2003). "A new class of metric divergences on probability spaces and its statistical applications". Ann. Inst. Statist. Math. 55 (3): pp. 639–653. doi:10.1007/BF02517812. 
  5. ^ a b Lin, J. (1991). "Divergence measures based on the shannon entropy" (PDF). IEEE Transactions on Information Theory 37 (1): 145–151. doi:10.1109/18.61115. 
  6. ^ Schneidman, Elad; Bialek, W; Berry, M.J. 2nd (2003). "Synergy, Redundancy, and Independence in Population Codes". Journal of Neuroscience 23 (37): 11539–11553. PMID 14684857. 
  7. ^ Majtey, A.; Lamberti, P.; Prato, D. (2005). "Jensen-Shannon divergence as a measure of distinguishability between mixed quantum states". Physical Review A 72 (5). doi:10.1103/PhysRevA.72.052310. 
  8. ^ a b Briët, Jop; Harremoës, Peter (2009). "Properties of classical and quantum Jensen-Shannon divergence". Physical Review A 79 (5). doi:10.1103/PhysRevA.79.052311. 
  9. ^ Holevo, A. S. (1973), "Bounds for the quantity of information transmitted by a quantum communication channel", Problemy Peredachi Informatsii (in Russian) 9: 3–11 . English translation: Probl. Inf. Transm., 9, 177–183 (1975)) MR 456936
  10. ^ Braunstein, Samuel; Caves, Carlton (1994). "Statistical distance and the geometry of quantum states". Physical Review Letters 72 (22): 3439–3443. doi:10.1103/PhysRevLett.72.3439. PMID 10056200. 
  11. ^ Sims, GE; Jun, SR; Wu, GA; Kim, SH (2009). "Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions". Proceedings of the National Academy of Sciences of the United States of America 106 (8): 2677–82. doi:10.1073/pnas.0813249106. PMC 2634796. PMID 19188606. 
  12. ^ Itzkovitz, S; Hodis, E; Segal, E (2010). "Overlapping codes within protein-coding sequences". Genome Research 20 (11): 1582–9. doi:10.1101/gr.105072.110. PMC 2963821. PMID 20841429. 
  13. ^ Ofran, Y; Rost, B (2003). "Analysing six types of protein-protein interfaces". Journal of Molecular Biology 325 (2): 377–87. doi:10.1016/s0022-2836(02)01223-8. PMID 12488102. 
  14. ^ DeDeo, Simon; Hawkins, Robert X. D.; Klingenstein, Sara; Hitchcock, Tim (2013). "Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems". "Entropy" 15 (6): 2246–2276. doi:10.3390/e15062246. 
  15. ^ Klingenstein, Sara; Hitchcock, Tim; DeDeo, Simon (2014). "The civilizing process in London’s Old Bailey". "Proceedings of the National Academy of Sciences" 111 (26): 9419–9424. doi:10.1073/pnas.1405984111. 
  16. ^ Goodfellow, Ian J.; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Networks. NIPS. arXiv:1406.2661.

Further reading[edit]

External links[edit]