# Hellinger distance

In probability and statistics, the Hellinger distance (closely related to, although different from, the Bhattacharyya distance) is used to quantify the similarity between two probability distributions. It is a type of f-divergence. The Hellinger distance is defined in terms of the Hellinger integral, which was introduced by Ernst Hellinger in 1909.[1][2]

It is sometimes called the Jeffreys distance.[3][4]

## Definition

### Measure theory

To define the Hellinger distance in terms of measure theory, let ${\displaystyle P}$ and ${\displaystyle Q}$ denote two probability measures on a measure space ${\displaystyle {\mathcal {X}}}$ that are absolutely continuous with respect to an auxiliary measure ${\displaystyle \lambda }$. Such a measure always exists, e.g ${\displaystyle \lambda =(P+Q)}$. The square of the Hellinger distance between ${\displaystyle P}$ and ${\displaystyle Q}$ is defined as the quantity

${\displaystyle H^{2}(P,Q)={\frac {1}{2}}\displaystyle \int _{\mathcal {X}}\left({\sqrt {p(x)}}-{\sqrt {q(x)}}\right)^{2}\lambda (dx).}$

Here, ${\displaystyle P(dx)=p(x)\lambda (dx)}$ and ${\displaystyle Q(dx)=q(x)\lambda (dx)}$, i.e. ${\displaystyle p}$ and ${\displaystyle q}$ are the Radon–Nikodym derivatives of P and Q respectively with respect to ${\displaystyle \lambda }$. This definition does not depend on ${\displaystyle \lambda }$, i.e. the Hellinger distance between P and Q does not change if ${\displaystyle \lambda }$ is replaced with a different probability measure with respect to which both P and Q are absolutely continuous. For compactness, the above formula is often written as

${\displaystyle H^{2}(P,Q)={\frac {1}{2}}\int _{\mathcal {X}}\left({\sqrt {P(dx)}}-{\sqrt {Q(dx)}}\right)^{2}.}$

### Probability theory using Lebesgue measure

To define the Hellinger distance in terms of elementary probability theory, we take λ to be the Lebesgue measure, so that dP /  and dQ / dλ are simply probability density functions. If we denote the densities as f and g, respectively, the squared Hellinger distance can be expressed as a standard calculus integral

${\displaystyle H^{2}(f,g)={\frac {1}{2}}\int \left({\sqrt {f(x)}}-{\sqrt {g(x)}}\right)^{2}\,dx=1-\int {\sqrt {f(x)g(x)}}\,dx,}$

where the second form can be obtained by expanding the square and using the fact that the integral of a probability density over its domain equals 1.

The Hellinger distance H(PQ) satisfies the property (derivable from the Cauchy–Schwarz inequality)

${\displaystyle 0\leq H(P,Q)\leq 1.}$

### Discrete distributions

For two discrete probability distributions ${\displaystyle P=(p_{1},\ldots ,p_{k})}$ and ${\displaystyle Q=(q_{1},\ldots ,q_{k})}$, their Hellinger distance is defined as

${\displaystyle H(P,Q)={\frac {1}{\sqrt {2}}}\;{\sqrt {\sum _{i=1}^{k}({\sqrt {p_{i}}}-{\sqrt {q_{i}}})^{2}}},}$

which is directly related to the Euclidean norm of the difference of the square root vectors, i.e.

${\displaystyle H(P,Q)={\frac {1}{\sqrt {2}}}\;{\bigl \|}{\sqrt {P}}-{\sqrt {Q}}{\bigr \|}_{2}.}$

Also, ${\displaystyle 1-H^{2}(P,Q)=\sum _{i=1}^{k}{\sqrt {p_{i}q_{i}}}.}$

## Properties

The Hellinger distance forms a bounded metric on the space of probability distributions over a given probability space.

The maximum distance 1 is achieved when P assigns probability zero to every set to which Q assigns a positive probability, and vice versa.

Sometimes the factor ${\displaystyle 1/2}$ in front of the integral is omitted, in which case the Hellinger distance ranges from zero to the square root of two.

The Hellinger distance is related to the Bhattacharyya coefficient ${\displaystyle BC(P,Q)}$ as it can be defined as

${\displaystyle H(P,Q)={\sqrt {1-BC(P,Q)}}.}$

Hellinger distances are used in the theory of sequential and asymptotic statistics.[5][6]

The squared Hellinger distance between two normal distributions ${\displaystyle P\sim {\mathcal {N}}(\mu _{1},\sigma _{1}^{2})}$ and ${\displaystyle Q\sim {\mathcal {N}}(\mu _{2},\sigma _{2}^{2})}$ is:

${\displaystyle H^{2}(P,Q)=1-{\sqrt {\frac {2\sigma _{1}\sigma _{2}}{\sigma _{1}^{2}+\sigma _{2}^{2}}}}\,e^{-{\frac {1}{4}}{\frac {(\mu _{1}-\mu _{2})^{2}}{\sigma _{1}^{2}+\sigma _{2}^{2}}}}.}$

The squared Hellinger distance between two multivariate normal distributions ${\displaystyle P\sim {\mathcal {N}}(\mu _{1},\Sigma _{1})}$ and ${\displaystyle Q\sim {\mathcal {N}}(\mu _{2},\Sigma _{2})}$ is [7]

${\displaystyle H^{2}(P,Q)=1-{\frac {\det(\Sigma _{1})^{1/4}\det(\Sigma _{2})^{1/4}}{\det \left({\frac {\Sigma _{1}+\Sigma _{2}}{2}}\right)^{1/2}}}\exp \left\{-{\frac {1}{8}}(\mu _{1}-\mu _{2})^{T}\left({\frac {\Sigma _{1}+\Sigma _{2}}{2}}\right)^{-1}(\mu _{1}-\mu _{2})\right\}}$

The squared Hellinger distance between two exponential distributions ${\displaystyle P\sim \mathrm {Exp} (\alpha )}$ and ${\displaystyle Q\sim \mathrm {Exp} (\beta )}$ is:

${\displaystyle H^{2}(P,Q)=1-{\frac {2{\sqrt {\alpha \beta }}}{\alpha +\beta }}.}$

The squared Hellinger distance between two Weibull distributions ${\displaystyle P\sim \mathrm {W} (k,\alpha )}$ and ${\displaystyle Q\sim \mathrm {W} (k,\beta )}$ (where ${\displaystyle k}$ is a common shape parameter and ${\displaystyle \alpha \,,\beta }$ are the scale parameters respectively):

${\displaystyle H^{2}(P,Q)=1-{\frac {2(\alpha \beta )^{k/2}}{\alpha ^{k}+\beta ^{k}}}.}$

The squared Hellinger distance between two Poisson distributions with rate parameters ${\displaystyle \alpha }$ and ${\displaystyle \beta }$, so that ${\displaystyle P\sim \mathrm {Poisson} (\alpha )}$ and ${\displaystyle Q\sim \mathrm {Poisson} (\beta )}$, is:

${\displaystyle H^{2}(P,Q)=1-e^{-{\frac {1}{2}}({\sqrt {\alpha }}-{\sqrt {\beta }})^{2}}.}$

The squared Hellinger distance between two beta distributions ${\displaystyle P\sim {\text{Beta}}(a_{1},b_{1})}$ and ${\displaystyle Q\sim {\text{Beta}}(a_{2},b_{2})}$ is:

${\displaystyle H^{2}(P,Q)=1-{\frac {B\left({\frac {a_{1}+a_{2}}{2}},{\frac {b_{1}+b_{2}}{2}}\right)}{\sqrt {B(a_{1},b_{1})B(a_{2},b_{2})}}}}$

where ${\displaystyle B}$ is the beta function.

The squared Hellinger distance between two gamma distributions ${\displaystyle P\sim {\text{Gamma}}(a_{1},b_{1})}$ and ${\displaystyle Q\sim {\text{Gamma}}(a_{2},b_{2})}$ is:

${\displaystyle H^{2}(P,Q)=1-\Gamma \left({\scriptstyle {\frac {a_{1}+a_{2}}{2}}}\right)\left({\frac {b_{1}+b_{2}}{2}}\right)^{-(a_{1}+a_{2})/2}{\sqrt {\frac {b_{1}^{a_{1}}b_{2}^{a_{2}}}{\Gamma (a_{1})\Gamma (a_{2})}}}}$

where ${\displaystyle \Gamma }$ is the gamma function.

## Connection with total variation distance

The Hellinger distance ${\displaystyle H(P,Q)}$ and the total variation distance (or statistical distance) ${\displaystyle \delta (P,Q)}$ are related as follows:[8]

${\displaystyle H^{2}(P,Q)\leq \delta (P,Q)\leq {\sqrt {2}}H(P,Q)\,.}$

The constants in this inequality may change depending on which renormalization you choose (${\displaystyle 1/2}$ or ${\displaystyle 1/{\sqrt {2}}}$).

These inequalities follow immediately from the inequalities between the 1-norm and the 2-norm.

## Notes

1. ^ Nikulin, M.S. (2001) [1994], "Hellinger distance", Encyclopedia of Mathematics, EMS Press
2. ^ Hellinger, Ernst (1909), "Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen", Journal für die reine und angewandte Mathematik (in German), 1909 (136): 210–271, doi:10.1515/crll.1909.136.210, JFM 40.0393.01, S2CID 121150138
3. ^ "Jeffreys distance - Encyclopedia of Mathematics". encyclopediaofmath.org. Retrieved 2022-05-24.
4. ^ Jeffreys, Harold (1946-09-24). "An invariant form for the prior probability in estimation problems". Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences. 186 (1007): 453–461. Bibcode:1946RSPSA.186..453J. doi:10.1098/rspa.1946.0056. ISSN 0080-4630. PMID 20998741. S2CID 19490929.
5. ^ Torgerson, Erik (1991). "Comparison of Statistical Experiments". Encyclopedia of Mathematics. Vol. 36. Cambridge University Press.
6. ^ Liese, Friedrich; Miescke, Klaus-J. (2008). Statistical Decision Theory: Estimation, Testing, and Selection. Springer. ISBN 978-0-387-73193-3.
7. ^ Pardo, L. (2006). Statistical Inference Based on Divergence Measures. New York: Chapman and Hall/CRC. p. 51. ISBN 1-58488-600-5.
8. ^ Harsha, Prahladh (September 23, 2011). "Lecture notes on communication complexity" (PDF).

## References

• Yang, Grace Lo; Le Cam, Lucien M. (2000). Asymptotics in Statistics: Some Basic Concepts. Berlin: Springer. ISBN 0-387-95036-2.
• Vaart, A. W. van der (19 June 2000). Asymptotic Statistics (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge, UK: Cambridge University Press. ISBN 0-521-78450-6.
• Pollard, David E. (2002). A user's guide to measure theoretic probability. Cambridge, UK: Cambridge University Press. ISBN 0-521-00289-3.