Divergence (statistics)

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In statistics and information geometry, divergence or a contrast function is a function which establishes the "distance" of one probability distribution to the other on a statistical manifold. The divergence is a weaker notion than that of the distance in mathematics, in particular the divergence need not be symmetric (that is, in general the divergence from p to q is not equal to the divergence from q to p), and need not satisfy the triangle inequality.


Suppose S is a space of all probability distributions with common support. Then a divergence on S is a function D(· || ·): S×SR satisfying [1]

  1. D(p || q) ≥ 0 for all p, qS,
  2. D(p || q) = 0 if and only if p = q,
  3. The matrix g(D) (see definition in the "geometrical properties" section) is strictly positive-definite everywhere on S.[2]

The dual divergence D* is defined as

    D^*(p \parallel q) = D(q \parallel p).

Geometrical properties[edit]

Many properties of divergences can be derived if we restrict S to be a statistical manifold, meaning that it can be parametrized with a finite-dimensional coordinate system θ, so that for a distribution pS we can write p = p(θ).

For a pair of points p, qS with coordinates θp and θq, denote the partial derivatives of D(p || q) as

    D((\partial_i)_p \parallel q) \ \ &\stackrel{\mathrm{def}}{=}\ \ \tfrac{\partial}{\partial\theta^i_p} D(p \parallel q), \\
    D((\partial_i\partial_j)_p \parallel (\partial_k)_q) \ \ &\stackrel{\mathrm{def}}{=}\ \ \tfrac{\partial}{\partial\theta^i_p} \tfrac{\partial}{\partial\theta^j_p}\tfrac{\partial}{\partial\theta^k_q}D(p \parallel q), \ \ \mathrm{etc.}

Now we restrict these functions to a diagonal p = q, and denote [3]

    D[\partial_i\parallel\cdot]\ &:\ p \mapsto D((\partial_i)_p \parallel p), \\
    D[\partial_i\parallel\partial_j]\ &:\ p \mapsto D((\partial_i)_p \parallel (\partial_j)_p),\ \ \mathrm{etc.}

By definition, the function D(p || q) is minimized at p = q, and therefore

    & D[\partial_i\parallel\cdot] = D[\cdot\parallel\partial_i] = 0, \\
    & D[\partial_i\partial_j\parallel\cdot] = D[\cdot\parallel\partial_i\partial_j] = -D[\partial_i\parallel\partial_j] \ \equiv\ g_{ij}^{(D)},

where matrix g(D) is positive semi-definite and defines a unique Riemannian metric on the manifold S.

Divergence D(· || ·) also defines a unique torsion-free affine connection(D) with coefficients

    \Gamma_{ij,k}^{(D)} = -D[\partial_i\partial_j\parallel\partial_k],

and the dual to this connection ∇* is generated by the dual divergence D*.

Thus, a divergence D(· || ·) generates on a statistical manifold a unique dualistic structure (g(D), ∇(D), ∇(D*)). The converse is also true: every torsion-free dualistic structure on a statistical manifold is induced from some globally defined divergence function (which however need not be unique).[4]

For example, when D is an f-divergence for some function ƒ(·), then it generates the metric g(Df) = c·g and the connection (Df) = ∇(α), where g is the canonical Fisher information metric, ∇(α) is the α-connection, c = ƒ′′(1), and α = 3 + 2ƒ′′′(1)/ƒ′′(1).


The largest and most frequently used class of divergences form the so-called f-divergences, however other types of divergence functions are also encountered in the literature.


Main article: f-divergence

This family of divergences are generated through functions f(u), convex on u > 0 and such that f(1) = 0. Then an f-divergence is defined as

    D_f(p\parallel q) = \int p(x)f\bigg(\frac{q(x)}{p(x)}\bigg) dx
Kullback–Leibler divergence: 
    D_\mathrm{KL}(p \parallel q) = \int p(x)\ln\left( \frac{p(x)}{q(x)}\right) dx
squared Hellinger distance: 
    H^2(p,\, q) = 2 \int \Big( \sqrt{p(x)} - \sqrt{q(x)}\, \Big)^2 dx
Jeffreys divergence: 
    D_J(p \parallel q) = \int (p(x) - q(x))\big( \ln p(x) - \ln q(x) \big) dx
Chernoff's α-divergence: 
    D^{(\alpha)}(p \parallel q) = \frac{4}{1-\alpha^2}\bigg(1 - \int p(x)^\frac{1-\alpha}{2} q(x)^\frac{1+\alpha}{2} dx \bigg)
exponential divergence: 
    D_e(p \parallel q) = \int p(x)\big( \ln p(x) - \ln q(x) \big)^2 dx
Kagan's divergence: 
    D_{\chi^2}(p \parallel q) = \frac12 \int \frac{(p(x) - q(x))^2}{p(x)} dx
(α,β)-product divergence: 
    D_{\alpha,\beta}(p \parallel q) = \frac{2}{(1-\alpha)(1-\beta)} \int 
        \Big(1 - \Big(\tfrac{q(x)}{p(x)}\Big)^{\!\!\frac{1-\alpha}{2}} \Big) 
        \Big(1 - \Big(\tfrac{q(x)}{p(x)}\Big)^{\!\!\frac{1-\beta}{2}} \Big) 
        p(x) dx



See also[edit]