Divergence (statistics)

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In statistics and information geometry, divergence or a contrast function is a function which establishes the "distance" of one probability distribution to the other on a statistical manifold. The divergence is a weaker notion than that of the distance, in particular the divergence need not be symmetric (that is, in general the divergence from p to q is not equal to the divergence from q to p), and need not satisfy the triangle inequality.

Definition[edit]

Suppose S is a space of all probability distributions with common support. Then a divergence on S is a function D(· || ·): S×SR satisfying [1]

  1. D(p || q) ≥ 0 for all p, qS,
  2. D(p || q) = 0 if and only if p = q,

The dual divergence D* is defined as

Geometrical properties[edit]

Many properties of divergences can be derived if we restrict S to be a statistical manifold, meaning that it can be parametrized with a finite-dimensional coordinate system θ, so that for a distribution pS we can write p = p(θ).

For a pair of points p, qS with coordinates θp and θq, denote the partial derivatives of D(p || q) as

Now we restrict these functions to a diagonal p = q, and denote [2]

By definition, the function D(p || q) is minimized at p = q, and therefore

where matrix g(D) is positive semi-definite and defines a unique Riemannian metric on the manifold S.

Divergence D(· || ·) also defines a unique torsion-free affine connection(D) with coefficients

and the dual to this connection ∇* is generated by the dual divergence D*.

Thus, a divergence D(· || ·) generates on a statistical manifold a unique dualistic structure (g(D), ∇(D), ∇(D*)). The converse is also true: every torsion-free dualistic structure on a statistical manifold is induced from some globally defined divergence function (which however need not be unique).[3]

For example, when D is an f-divergence for some function ƒ(·), then it generates the metric g(Df) = c·g and the connection (Df) = ∇(α), where g is the canonical Fisher information metric, ∇(α) is the α-connection, c = ƒ′′(1), and α = 3 + 2ƒ′′′(1)/ƒ′′(1).

Examples[edit]

The largest and most frequently used class of divergences form the so-called f-divergences, however other types of divergence functions are also encountered in the literature.

f-divergences[edit]

This family of divergences are generated through functions f(u), convex on u > 0 and such that f(1) = 0. Then an f-divergence is defined as

Kullback–Leibler divergence:
squared Hellinger distance:
Jeffreys divergence:
Chernoff's α-divergence:
exponential divergence:
Kagan's divergence:
(α,β)-product divergence:

M-divergences[edit]

S-divergences[edit]

See also[edit]

References[edit]