f-divergence

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search


In probability theory, an ƒ-divergence is a function Df (P  || Q) that measures the difference between two probability distributions P and Q. It helps the intuition to think of the divergence as an average, weighted by the function f, of the odds ratio given by P and Q[citation needed].

These divergences were introduced by Alfréd Rényi[1] in the same paper where he introduced the well-known Rényi entropy. He proved that these divergences decrease in Markov Processes. f-divergences were studied further independently by Csiszár (1963), Morimoto (1963) and Ali & Silvey (1966) and are sometimes known as Csiszár ƒ-divergences, Csiszár-Morimoto divergences or Ali-Silvey distances.

Definition[edit]

Let P and Q be two probability distributions over a space Ω such that P is absolutely continuous with respect to Q. Then, for a convex function f such that f(1) = 0, the f-divergence of P from Q is defined as

If P and Q are both absolutely continuous with respect to a reference distribution μ on Ω then their probability densities p and q satisfy dP = p dμ and dQ = q dμ. In this case the f-divergence can be written as

The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances (Nielsen & Nock (2013)).

Instances of f-divergences[edit]

Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of f-divergence, coinciding with a particular choice of f. The following table lists many of the common divergences between probability distributions and the f function to which they correspond (cf. Liese & Vajda (2006)).

Divergence Corresponding f(t)
KL-divergence
reverse KL-divergence
squared Hellinger distance
Total variation distance
Pearson -divergence
Neyman -divergence (reverse Pearson)
α-divergence
Jensen-Shannon Divergence
α-divergence (other designation)

The function is defined up to the summand , where is any constant.

Properties[edit]

  • Non-negativity: the ƒ-divergence is always positive; it is zero if the measures P and Q coincide. This follows immediately from Jensen’s inequality:
  • Monotonicity: if κ is an arbitrary transition probability that transforms measures P and Q into Pκ and Qκ correspondingly, then
    The equality here holds if and only if the transition is induced from a sufficient statistic with respect to {P, Q}.
  • Joint Convexity: for any 0 ≤ λ ≤ 1
    This follows from the convexity of the mapping on .

In particular, the monotonicity implies that if a Markov process has a positive equilibrium probability distribution then is a monotonic (non-increasing) function of time, where the probability distribution is a solution of the Kolmogorov forward equations (or Master equation), used to describe the time evolution of the probability distribution in the Markov process. This means that all f-divergences are the Lyapunov functions of the Kolmogorov forward equations. Reverse statement is also true: If is a Lyapunov function for all Markov chains with positive equilibrium and is of the trace-form () then , for some convex function f.[2][3] For example, Bregman divergences in general do not have such property and can increase in Markov processes.[4]

Financial interpretation[edit]

A pair of probability distributions can be viewed as a game of chance in which one of the distributions defines the official odds and the other contains the actual probabilities. Knowledge of the actual probabilities allows a player to profit from the game. For a large class of rational players the expected profit rate has the same general form as the ƒ-divergence.[5]


See also[edit]

References[edit]

  1. ^ Rényi, Alfréd (1961). On measures of entropy and information (PDF). The 4th Berkeley Symposium on Mathematics, Statistics and Probability, 1960. Berkeley, CA: University of California Press. pp. 547–561. Eq. (4.20)
  2. ^ Gorban, Pavel A. (15 October 2003). "Monotonically equivalent entropies and solution of additivity equation". Physica A. 328 (3–4): 380–390. arXiv:cond-mat/0304131. doi:10.1016/S0378-4371(03)00578-8.
  3. ^ Amari, Shun'ichi (2009). Leung, C.S.; Lee, M.; Chan, J.H. (eds.). Divergence, Optimization, Geometry. The 16th International Conference on Neural Information Processing (ICONIP 20009), Bangkok, Thailand, 1--5 December 2009. Lecture Notes in Computer Science, vol 5863. Berlin, Heidelberg: Springer. pp. 185–193. doi:10.1007/978-3-642-10677-4_21.
  4. ^ Gorban, Alexander N. (29 April 2014). "General H-theorem and Entropies that Violate the Second Law". Entropy. 16 (5): 2408–2432. arXiv:1212.6767. doi:10.3390/e16052408.
  5. ^ Soklakov, Andrei N. (2020). "Economics of Disagreement—Financial Intuition for the Rényi Divergence". Entropy. 22 (8): 860. doi:10.3390/e22080860.