# Jensen–Shannon divergence

In probability theory and statistics, the JensenShannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad)[1] [2] or total divergence to the average.[3] It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance.[4][5][6]

## Definition

Consider the set ${\displaystyle M_{+}^{1}(A)}$ of probability distributions where ${\displaystyle A}$ is a set provided with some σ-algebra of measurable subsets. In particular we can take ${\displaystyle A}$ to be a finite or countable set with all subsets being measurable.

The Jensen–Shannon divergence (JSD) is a symmetrized and smoothed version of the Kullback–Leibler divergence ${\displaystyle D(P\parallel Q)}$. It is defined by

${\displaystyle {\rm {JSD}}(P\parallel Q)={\frac {1}{2}}D(P\parallel M)+{\frac {1}{2}}D(Q\parallel M),}$

where ${\displaystyle M={\frac {1}{2}}(P+Q)}$.

The geometric Jensen–Shannon divergence[7] (or G-Jensen–Shannon divergence) yields a closed-form formula for divergence between two Gaussian distributions by taking the geometric mean.

A more general definition, allowing for the comparison of more than two probability distributions, is:

{\displaystyle {\begin{aligned}{\rm {JSD}}_{\pi _{1},\ldots ,\pi _{n}}(P_{1},P_{2},\ldots ,P_{n})&=\sum _{i}\pi _{i}D(P_{i}\parallel M)\\&=H\left(M\right)-\sum _{i=1}^{n}\pi _{i}H(P_{i})\end{aligned}}}

where

{\displaystyle {\begin{aligned}M&:=\sum _{i=1}^{n}\pi _{i}P_{i}\end{aligned}}}

and ${\displaystyle \pi _{1},\ldots ,\pi _{n}}$ are weights that are selected for the probability distributions ${\displaystyle P_{1},P_{2},\ldots ,P_{n}}$, and ${\displaystyle H(P)}$ is the Shannon entropy for distribution ${\displaystyle P}$. For the two-distribution case described above,

${\displaystyle P_{1}=P,P_{2}=Q,\pi _{1}=\pi _{2}={\frac {1}{2}}.\ }$

Hence, for those distributions ${\displaystyle P,Q}$

${\displaystyle JSD=H(M)-{\frac {1}{2}}{\bigg (}H(P)+H(Q){\bigg )}}$

## Bounds

The Jensen–Shannon divergence is bounded by 1 for two probability distributions, given that one uses the base 2 logarithm.[8]

${\displaystyle 0\leq {\rm {JSD}}(P\parallel Q)\leq 1}$

With this normalization, it is a lower bound on the total variation distance between P and Q:

${\displaystyle {\rm {JSD}}(P\parallel Q)\leq {\frac {1}{2}}\|P-Q\|_{1}={\frac {1}{2}}\sum _{\omega \in \Omega }|P(\omega )-Q(\omega )|.}$

With base-e logarithm, which is commonly used in statistical thermodynamics, the upper bound is ${\displaystyle \ln(2)}$. In general, the bound in base b is ${\displaystyle \log _{b}(2)}$:

${\displaystyle 0\leq {\rm {JSD}}(P\parallel Q)\leq \log _{b}(2)}$

A more general bound, the Jensen–Shannon divergence is bounded by ${\displaystyle \log _{b}(n)}$ for more than two probability distributions.[8]

${\displaystyle 0\leq {\rm {JSD}}_{\pi _{1},\ldots ,\pi _{n}}(P_{1},P_{2},\ldots ,P_{n})\leq \log _{b}(n)}$

## Relation to mutual information

The Jensen–Shannon divergence is the mutual information between a random variable ${\displaystyle X}$ associated to a mixture distribution between ${\displaystyle P}$ and ${\displaystyle Q}$ and the binary indicator variable ${\displaystyle Z}$ that is used to switch between ${\displaystyle P}$ and ${\displaystyle Q}$ to produce the mixture. Let ${\displaystyle X}$ be some abstract function on the underlying set of events that discriminates well between events, and choose the value of ${\displaystyle X}$ according to ${\displaystyle P}$ if ${\displaystyle Z=0}$ and according to ${\displaystyle Q}$ if ${\displaystyle Z=1}$, where ${\displaystyle Z}$ is equiprobable. That is, we are choosing ${\displaystyle X}$ according to the probability measure ${\displaystyle M=(P+Q)/2}$, and its distribution is the mixture distribution. We compute

{\displaystyle {\begin{aligned}I(X;Z)&=H(X)-H(X|Z)\\&=-\sum M\log M+{\frac {1}{2}}\left[\sum P\log P+\sum Q\log Q\right]\\&=-\sum {\frac {P}{2}}\log M-\sum {\frac {Q}{2}}\log M+{\frac {1}{2}}\left[\sum P\log P+\sum Q\log Q\right]\\&={\frac {1}{2}}\sum P\left(\log P-\log M\right)+{\frac {1}{2}}\sum Q\left(\log Q-\log M\right)\\&={\rm {JSD}}(P\parallel Q)\end{aligned}}}

It follows from the above result that the Jensen–Shannon divergence is bounded by 0 and 1 because mutual information is non-negative and bounded by ${\displaystyle H(Z)=1}$ in base 2 logarithm.

One can apply the same principle to a joint distribution and the product of its two marginal distribution (in analogy to Kullback–Leibler divergence and mutual information) and to measure how reliably one can decide if a given response comes from the joint distribution or the product distribution—subject to the assumption that these are the only two possibilities.[9]

## Quantum Jensen–Shannon divergence

The generalization of probability distributions on density matrices allows to define quantum Jensen–Shannon divergence (QJSD).[10][11] It is defined for a set of density matrices ${\displaystyle (\rho _{1},\ldots ,\rho _{n})}$ and a probability distribution ${\displaystyle \pi =(\pi _{1},\ldots ,\pi _{n})}$ as

${\displaystyle {\rm {QJSD}}(\rho _{1},\ldots ,\rho _{n})=S\left(\sum _{i=1}^{n}\pi _{i}\rho _{i}\right)-\sum _{i=1}^{n}\pi _{i}S(\rho _{i})}$

where ${\displaystyle S(\rho )}$ is the von Neumann entropy of ${\displaystyle \rho }$. This quantity was introduced in quantum information theory, where it is called the Holevo information: it gives the upper bound for amount of classical information encoded by the quantum states ${\displaystyle (\rho _{1},\ldots ,\rho _{n})}$ under the prior distribution ${\displaystyle \pi }$ (see Holevo's theorem).[12] Quantum Jensen–Shannon divergence for ${\displaystyle \pi =\left({\frac {1}{2}},{\frac {1}{2}}\right)}$ and two density matrices is a symmetric function, everywhere defined, bounded and equal to zero only if two density matrices are the same. It is a square of a metric for pure states,[13] and it was recently shown that this metric property holds for mixed states as well.[14][15] The Bures metric is closely related to the quantum JS divergence; it is the quantum analog of the Fisher information metric.

## Jensen–Shannon centroid

The centroid C* of a finite set of probability distributions can be defined as the minimizer of the average sum of the Jensen-Shannon divergences between a probability distribution and the prescribed set of distributions:

${\displaystyle C^{*}=\arg \min _{Q}\sum _{i=1}^{n}{\rm {JSD}}(P_{i}\parallel Q)}$
An efficient algorithm[16] (CCCP) based on difference of convex functions is reported to calculate the Jensen-Shannon centroid of a set of discrete distributions (histograms).

## Applications

The Jensen–Shannon divergence has been applied in bioinformatics and genome comparison,[17][18] in protein surface comparison,[19] in the social sciences,[20] in the quantitative study of history,[21], fire experiments[22] and in machine learning.[23]

