# f-divergence

Jump to navigation Jump to search

In probability theory, an ƒ-divergence is a function Df (P  || Q) that measures the difference between two probability distributions P and Q. It helps the intuition to think of the divergence as an average, weighted by the function f, of the odds ratio given by P and Q[citation needed].

These divergences were introduced by Alfréd Rényi[1] in the same paper where he introduced the well-known Rényi entropy. He proved that these divergences decrease in Markov Processes. f-divergences were studied further independently by Csiszár (1963), Morimoto (1963) and Ali & Silvey (1966) and are sometimes known as Csiszár ƒ-divergences, Csiszár-Morimoto divergences or Ali-Silvey distances.

## Definition

Let P and Q be two probability distributions over a space Ω such that P is absolutely continuous with respect to Q. Then, for a convex function f such that f(1) = 0, the f-divergence of P from Q is defined as

${\displaystyle D_{f}(P\parallel Q)\equiv \int _{\Omega }f\left({\frac {dP}{dQ}}\right)\,dQ.}$

If P and Q are both absolutely continuous with respect to a reference distribution μ on Ω then their probability densities p and q satisfy dP = p dμ and dQ = q dμ. In this case the f-divergence can be written as

${\displaystyle D_{f}(P\parallel Q)=\int _{\Omega }f\left({\frac {p(x)}{q(x)}}\right)q(x)\,d\mu (x).}$

The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances (Nielsen & Nock (2013)).

## Instances of f-divergences

Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of f-divergence, coinciding with a particular choice of f. The following table lists many of the common divergences between probability distributions and the f function to which they correspond (cf. Liese & Vajda (2006)).

Divergence Corresponding f(t)
KL-divergence ${\displaystyle t\log t}$
reverse KL-divergence ${\displaystyle -\log t}$
squared Hellinger distance ${\displaystyle ({\sqrt {t}}-1)^{2},\,2(1-{\sqrt {t}})}$
Total variation distance ${\displaystyle {\frac {1}{2}}|t-1|\,}$
Pearson ${\displaystyle \chi ^{2}}$-divergence ${\displaystyle (t-1)^{2},\,t^{2}-1,\,t^{2}-t}$
Neyman ${\displaystyle \chi ^{2}}$-divergence (reverse Pearson) ${\displaystyle {\frac {1}{t}}-1,\,{\frac {1}{t}}-t}$
α-divergence ${\displaystyle {\begin{cases}{\frac {4}{1-\alpha ^{2}}}{\big (}1-t^{(1+\alpha )/2}{\big )},&{\text{if}}\ \alpha \neq \pm 1,\\t\ln t,&{\text{if}}\ \alpha =1,\\-\ln t,&{\text{if}}\ \alpha =-1\end{cases}}}$
Jensen-Shannon Divergence ${\displaystyle {\frac {1}{2}}[(t+1)\log {\big (}{\frac {2}{t+1}}{\big )}+t\log t]}$
α-divergence (other designation) ${\displaystyle {\begin{cases}{\frac {t^{\alpha }-t}{\alpha (\alpha -1)}},&{\text{if}}\ \alpha \neq 0,\,\alpha \neq 1,\\t\ln t,&{\text{if}}\ \alpha =1,\\-\ln t,&{\text{if}}\ \alpha =0\end{cases}}}$

The function ${\displaystyle f(t)}$ is defined up to the summand ${\displaystyle c(t-1)}$, where ${\displaystyle c}$ is any constant.

## Properties

• Non-negativity: the ƒ-divergence is always positive; it is zero if the measures P and Q coincide. This follows immediately from Jensen’s inequality:
${\displaystyle D_{f}(P\!\parallel \!Q)=\int \!f{\bigg (}{\frac {dP}{dQ}}{\bigg )}dQ\geq f{\bigg (}\int {\frac {dP}{dQ}}dQ{\bigg )}=f(1)=0.}$
• Monotonicity: if κ is an arbitrary transition probability that transforms measures P and Q into Pκ and Qκ correspondingly, then
${\displaystyle D_{f}(P\!\parallel \!Q)\geq D_{f}(P_{\kappa }\!\parallel \!Q_{\kappa }).}$
The equality here holds if and only if the transition is induced from a sufficient statistic with respect to {P, Q}.
• Joint Convexity: for any 0 ≤ λ ≤ 1
${\displaystyle D_{f}{\Big (}\lambda P_{1}+(1-\lambda )P_{2}\parallel \lambda Q_{1}+(1-\lambda )Q_{2}{\Big )}\leq \lambda D_{f}(P_{1}\!\parallel \!Q_{1})+(1-\lambda )D_{f}(P_{2}\!\parallel \!Q_{2}).}$
This follows from the convexity of the mapping ${\displaystyle (p,q)\mapsto qf(p/q)}$ on ${\displaystyle \mathbb {R} _{+}^{2}}$.

In particular, the monotonicity implies that if a Markov process has a positive equilibrium probability distribution ${\displaystyle P^{*}}$ then ${\displaystyle D_{f}(P(t)\parallel P^{*})}$ is a monotonic (non-increasing) function of time, where the probability distribution ${\displaystyle P(t)}$ is a solution of the Kolmogorov forward equations (or Master equation), used to describe the time evolution of the probability distribution in the Markov process. This means that all f-divergences ${\displaystyle D_{f}(P(t)\parallel P^{*})}$ are the Lyapunov functions of the Kolmogorov forward equations. Reverse statement is also true: If ${\displaystyle H(P)}$ is a Lyapunov function for all Markov chains with positive equilibrium ${\displaystyle P^{*}}$ and is of the trace-form (${\displaystyle H(P)=\sum _{i}f(P_{i},P_{i}^{*})}$) then ${\displaystyle H(P)=D_{f}(P(t)\parallel P^{*})}$, for some convex function f.[2][3] For example, Bregman divergences in general do not have such property and can increase in Markov processes.[4]

## Financial interpretation

A pair of probability distributions can be viewed as a game of chance in which one of the distributions defines the official odds and the other contains the actual probabilities. Knowledge of the actual probabilities allows a player to profit from the game. For a large class of rational players the expected profit rate has the same general form as the ƒ-divergence.[5]

## References

1. ^ Rényi, Alfréd (1961). On measures of entropy and information (PDF). The 4th Berkeley Symposium on Mathematics, Statistics and Probability, 1960. Berkeley, CA: University of California Press. pp. 547–561. Eq. (4.20)
2. ^ Gorban, Pavel A. (15 October 2003). "Monotonically equivalent entropies and solution of additivity equation". Physica A. 328 (3–4): 380–390. arXiv:cond-mat/0304131. doi:10.1016/S0378-4371(03)00578-8.
3. ^ Amari, Shun'ichi (2009). Leung, C.S.; Lee, M.; Chan, J.H. (eds.). Divergence, Optimization, Geometry. The 16th International Conference on Neural Information Processing (ICONIP 20009), Bangkok, Thailand, 1--5 December 2009. Lecture Notes in Computer Science, vol 5863. Berlin, Heidelberg: Springer. pp. 185–193. doi:10.1007/978-3-642-10677-4_21.
4. ^ Gorban, Alexander N. (29 April 2014). "General H-theorem and Entropies that Violate the Second Law". Entropy. 16 (5): 2408–2432. arXiv:1212.6767. doi:10.3390/e16052408.
5. ^ Soklakov, Andrei N. (2020). "Economics of Disagreement—Financial Intuition for the Rényi Divergence". Entropy. 22 (8): 860. doi:10.3390/e22080860.