Divergence (statistics)

In statistics and information geometry, divergence or a contrast function is a function which establishes the "distance" of one probability distribution to the other on a statistical manifold. The divergence is a weaker notion than that of the distance in mathematics, in particular the divergence need not be symmetric (that is, in general the divergence from p to q is not equal to the divergence from q to p), and need not satisfy the triangle inequality.

Definition

Suppose S is a space of all probability distributions with common support. Then a divergence on S is a function D(· || ·): S×SR satisfying [1]

1. D(p || q) ≥ 0 for all p, qS,
2. D(p || q) = 0 if and only if p = q,
3. The matrix g(D) (see definition in the "geometrical properties" section) is strictly positive-definite everywhere on S.[2]

The dual divergence D* is defined as

$D^*(p \parallel q) = D(q \parallel p).$

Geometrical properties

Many properties of divergences can be derived if we restrict S to be a statistical manifold, meaning that it can be parametrized with a finite-dimensional coordinate system θ, so that for a distribution pS we can write p = p(θ).

For a pair of points p, qS with coordinates θp and θq, denote the partial derivatives of D(p || q) as

\begin{align} D((\partial_i)_p \parallel q) \ \ &\stackrel{\mathrm{def}}{=}\ \ \tfrac{\partial}{\partial\theta^i_p} D(p \parallel q), \\ D((\partial_i\partial_j)_p \parallel (\partial_k)_q) \ \ &\stackrel{\mathrm{def}}{=}\ \ \tfrac{\partial}{\partial\theta^i_p} \tfrac{\partial}{\partial\theta^j_p}\tfrac{\partial}{\partial\theta^k_q}D(p \parallel q), \ \ \mathrm{etc.} \end{align}

Now we restrict these functions to a diagonal p = q, and denote [3]

\begin{align} D[\partial_i\parallel\cdot]\ &:\ p \mapsto D((\partial_i)_p \parallel p), \\ D[\partial_i\parallel\partial_j]\ &:\ p \mapsto D((\partial_i)_p \parallel (\partial_j)_p),\ \ \mathrm{etc.} \end{align}

By definition, the function D(p || q) is minimized at p = q, and therefore

\begin{align} & D[\partial_i\parallel\cdot] = D[\cdot\parallel\partial_i] = 0, \\ & D[\partial_i\partial_j\parallel\cdot] = D[\cdot\parallel\partial_i\partial_j] = -D[\partial_i\parallel\partial_j] \ \equiv\ g_{ij}^{(D)}, \end{align}

where matrix g(D) is positive semi-definite and defines a unique Riemannian metric on the manifold S.

Divergence D(· || ·) also defines a unique torsion-free affine connection(D) with coefficients

$\Gamma_{ij,k}^{(D)} = -D[\partial_i\partial_j\parallel\partial_k],$

and the dual to this connection ∇* is generated by the dual divergence D*.

Thus, a divergence D(· || ·) generates on a statistical manifold a unique dualistic structure (g(D), ∇(D), ∇(D*)). The converse is also true: every torsion-free dualistic structure on a statistical manifold is induced from some globally defined divergence function (which however need not be unique).[4]

For example, when D is an f-divergence for some function ƒ(·), then it generates the metric g(Df) = c·g and the connection (Df) = ∇(α), where g is the canonical Fisher information metric, ∇(α) is the α-connection, c = ƒ′′(1), and α = 3 + 2ƒ′′′(1)/ƒ′′(1).

Examples

The largest and most frequently used class of divergences form the so-called f-divergences, however other types of divergence functions are also encountered in the literature.

f-divergences

Main article: f-divergence

This family of divergences are generated through functions f(u), convex on u > 0 and such that f(1) = 0. Then an f-divergence is defined as

$D_f(p\parallel q) = \int p(x)f\bigg(\frac{q(x)}{p(x)}\bigg) dx$
 Kullback–Leibler divergence: $D_\mathrm{KL}(p \parallel q) = \int p(x)\ln\left( \frac{p(x)}{q(x)}\right) dx$ squared Hellinger distance: $H^2(p,\, q) = 2 \int \Big( \sqrt{p(x)} - \sqrt{q(x)}\, \Big)^2 dx$ Jeffreys divergence: $D_J(p \parallel q) = \int (p(x) - q(x))\big( \ln p(x) - \ln q(x) \big) dx$ Chernoff's α-divergence: $D^{(\alpha)}(p \parallel q) = \frac{4}{1-\alpha^2}\bigg(1 - \int p(x)^\frac{1-\alpha}{2} q(x)^\frac{1+\alpha}{2} dx \bigg)$ exponential divergence: $D_e(p \parallel q) = \int p(x)\big( \ln p(x) - \ln q(x) \big)^2 dx$ Kagan's divergence: $D_{\chi^2}(p \parallel q) = \frac12 \int \frac{(p(x) - q(x))^2}{p(x)} dx$ (α,β)-product divergence: $D_{\alpha,\beta}(p \parallel q) = \frac{2}{(1-\alpha)(1-\beta)} \int \Big(1 - \Big(\tfrac{q(x)}{p(x)}\Big)^{\!\!\frac{1-\alpha}{2}} \Big) \Big(1 - \Big(\tfrac{q(x)}{p(x)}\Big)^{\!\!\frac{1-\beta}{2}} \Big) p(x) dx$