f-divergence

In probability theory, an ${\displaystyle f}$-divergence is a function ${\displaystyle D_{f}(P\|Q)}$ that measures the difference between two probability distributions ${\displaystyle P}$ and ${\displaystyle Q}$. Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of ${\displaystyle f}$-divergence.

History

These divergences were introduced by Alfréd Rényi[1] in the same paper where he introduced the well-known Rényi entropy. He proved that these divergences decrease in Markov processes. f-divergences were studied further independently by Csiszár (1963), Morimoto (1963) and Ali & Silvey (1966) and are sometimes known as Csiszár ${\displaystyle f}$-divergences, Csiszár-Morimoto divergences, or Ali-Silvey distances.

Definition

Non-singular case

Let ${\displaystyle P}$ and ${\displaystyle Q}$ be two probability distributions over a space ${\displaystyle \Omega }$, such that ${\displaystyle P\ll Q}$, that is, ${\displaystyle P}$ is absolutely continuous with respect to ${\displaystyle Q}$. Then, for a convex function ${\displaystyle f:[0,\infty )\to (-\infty ,\infty ]}$ such that ${\displaystyle f(x)}$ is finite for all ${\displaystyle x>0}$, ${\displaystyle f(1)=0}$, and ${\displaystyle f(0)=\lim _{t\to 0^{+}}f(t)}$ (which could be infinite), the ${\displaystyle f}$-divergence of ${\displaystyle P}$ from ${\displaystyle Q}$ is defined as

${\displaystyle D_{f}(P\parallel Q)\equiv \int _{\Omega }f\left({\frac {dP}{dQ}}\right)\,dQ.}$

${\displaystyle f}$ is called the generator of ${\displaystyle D_{f}}$.

In concrete applications, there is usually a reference distribution ${\displaystyle \mu }$ on ${\displaystyle \Omega }$ (for example, when ${\displaystyle \Omega =\mathbb {R} ^{n}}$, the reference distribution is the Lebesgue measure), such that ${\displaystyle P,Q\ll \mu }$, then we can use Radon-Nikodym theorem to take their probability densities ${\displaystyle p}$ and ${\displaystyle q}$, giving

${\displaystyle D_{f}(P\parallel Q)=\int _{\Omega }f\left({\frac {p(x)}{q(x)}}\right)q(x)\,d\mu (x).}$

When there is no such reference distribution ready at hand, we can simply define ${\displaystyle \mu =P+Q}$, and proceed as above. This is a useful technique in more abstract proofs.

Extension to singular measures

The above definition can be extended to cases where ${\displaystyle P\ll Q}$ is no longer satisfied. (Definition 7.1 of [2])

Since ${\displaystyle f}$ is convex, and ${\displaystyle f(1)=0}$ , the function ${\displaystyle {\frac {f(x)}{x-1}}}$ must nondecrease, so there exists ${\displaystyle f'(\infty ):=\lim _{x\to \infty }f(x)/x}$, taking value in ${\displaystyle (-\infty ,+\infty ]}$.

Since for any ${\displaystyle p(x)>0}$, we have ${\displaystyle \lim _{q(x)\to 0}q(x)f\left({\frac {p(x)}{q(x)}}\right)=p(x)f'(\infty )}$ , we can extend f-divergence to the ${\displaystyle P\not \ll Q}$.

Properties

Basic properties

• Linearity: ${\displaystyle D_{\sum _{i}a_{i}f_{i}}=\sum _{i}a_{i}D_{f_{i}}}$ given a finite sequence of nonnegative real numbers ${\displaystyle a_{i}}$ and generators ${\displaystyle f_{i}}$.
• ${\displaystyle D_{f}=D_{g}}$ iff ${\displaystyle f(x)=g(x)+c(x-1)}$ for some ${\displaystyle c\in \mathbb {R} }$.
Proof

If ${\displaystyle f(x)=g(x)+c(x-1)}$, then ${\displaystyle D_{f}=D_{g}}$ by definition.

Conversely, if ${\displaystyle D_{f}-D_{g}=0}$, then let ${\displaystyle h=f-g}$. For any two probability measures ${\displaystyle P,Q}$ on the set ${\displaystyle \{0,1\}}$, since ${\displaystyle D_{f}(P\|Q)-D_{g}(P\|Q)=0}$, we get ${\displaystyle h(P_{1}/Q_{1})=-{\frac {Q_{0}}{Q_{1}}}h(P_{0}/Q_{0})}$

Since each probability measure ${\displaystyle P,Q}$ has one degree of freedom, we can solve ${\displaystyle {\frac {P_{0}}{Q_{0}}}=a,{\frac {P_{1}}{Q_{1}}}=x}$ for every choice of ${\displaystyle 0.

Linear algebra yields ${\displaystyle Q_{0}={\frac {x-1}{x-a}},Q_{1}={\frac {1-a}{x-a}}}$, which is a valid probability measure. Then we obtain ${\displaystyle h(x)={\frac {h(a)}{a-1}}(x-1),h(a)={\frac {h(x)}{x-1}}(a-1)}$.

Thus ${\displaystyle h(x)={\begin{cases}c_{1}(x-1)\quad {\text{if }}x>1,\\c_{0}(x-1)\quad {\text{if }}0 for some constants ${\displaystyle c_{0},c_{1}}$. Plugging the formula into ${\displaystyle h(x)={\frac {h(a)}{a-1}}(x-1)}$ yields ${\displaystyle c_{0}=c_{1}}$.

• Non-negativity: the ƒ-divergence is always positive; it is zero if the measures P and Q coincide. This follows immediately from Jensen’s inequality:
${\displaystyle D_{f}(P\!\parallel \!Q)=\int \!f{\bigg (}{\frac {dP}{dQ}}{\bigg )}dQ\geq f{\bigg (}\int {\frac {dP}{dQ}}dQ{\bigg )}=f(1)=0.}$
• Data processing inequality: if κ is an arbitrary transition probability that transforms measures P and Q into Pκ and Qκ correspondingly, then
${\displaystyle D_{f}(P\!\parallel \!Q)\geq D_{f}(P_{\kappa }\!\parallel \!Q_{\kappa }).}$
The equality here holds if and only if the transition is induced from a sufficient statistic with respect to {P, Q}.
• Joint Convexity: for any 0 ≤ λ ≤ 1
${\displaystyle D_{f}{\Big (}\lambda P_{1}+(1-\lambda )P_{2}\parallel \lambda Q_{1}+(1-\lambda )Q_{2}{\Big )}\leq \lambda D_{f}(P_{1}\!\parallel \!Q_{1})+(1-\lambda )D_{f}(P_{2}\!\parallel \!Q_{2}).}$
This follows from the convexity of the mapping ${\displaystyle (p,q)\mapsto qf(p/q)}$ on ${\displaystyle \mathbb {R} _{+}^{2}}$.
• Reversal by convex inversion: for any function ${\displaystyle f}$, its convex inversion is defined as ${\displaystyle g(t):=tf(1/t)}$. When ${\displaystyle f}$ satisfies the defining features of a f-divergence generator (${\displaystyle f(x)}$ is finite for all ${\displaystyle x>0}$, ${\displaystyle f(1)=0}$, and ${\displaystyle f(0)=\lim _{t\to 0^{+}}f(t)}$), then ${\displaystyle g}$ satisfies the same features, and thus defines a f-divergence ${\displaystyle D_{g}}$. This is the "reverse" of ${\displaystyle D_{f}}$, in the sense that ${\displaystyle D_{g}(P\|Q)=D_{f}(Q\|P)}$ for all ${\displaystyle P,Q}$ that are absolutely continuous with respect to each other. In this way, every f-divergence ${\displaystyle D_{f}}$ can be turned symmetric by ${\displaystyle D_{{\frac {1}{2}}(f+g)}}$. For example, performing this symmetrization turns KL-divergence into Jensen-Shannon divergence.

In particular, the monotonicity implies that if a Markov process has a positive equilibrium probability distribution ${\displaystyle P^{*}}$ then ${\displaystyle D_{f}(P(t)\parallel P^{*})}$ is a monotonic (non-increasing) function of time, where the probability distribution ${\displaystyle P(t)}$ is a solution of the Kolmogorov forward equations (or Master equation), used to describe the time evolution of the probability distribution in the Markov process. This means that all f-divergences ${\displaystyle D_{f}(P(t)\parallel P^{*})}$ are the Lyapunov functions of the Kolmogorov forward equations. Reverse statement is also true: If ${\displaystyle H(P)}$ is a Lyapunov function for all Markov chains with positive equilibrium ${\displaystyle P^{*}}$ and is of the trace-form (${\displaystyle H(P)=\sum _{i}f(P_{i},P_{i}^{*})}$) then ${\displaystyle H(P)=D_{f}(P(t)\parallel P^{*})}$, for some convex function f.[3][4] For example, Bregman divergences in general do not have such property and can increase in Markov processes.[5]

Analytic properties

The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances (Nielsen & Nock (2013)).

Variational representations

Let ${\displaystyle f^{*}}$ be the convex conjugate of ${\displaystyle f}$. Let ${\displaystyle \mathrm {effdom} (f^{*})}$ be the effective domain of ${\displaystyle f^{*}}$, that is, ${\displaystyle \mathrm {effdom} (f^{*})=\{y:f^{*}(y)<\infty \}}$. Then we have two variational representations of ${\displaystyle D_{f}}$:

Theorem — ${\displaystyle D_{f}(P;Q)=\sup _{g:\Omega \to \mathrm {effdom} (f^{*})}E_{P}[g]-E_{Q}[f^{*}\circ g]}$

This is Theorem 7.24 in.[2]

Example applications

Using this theorem on total variation distance, with generator ${\displaystyle f(x)={\frac {1}{2}}|x-1|,}$ its convex conjugate is ${\displaystyle f^{*}(x^{*})={\begin{cases}x^{*}{\text{ on }}[-1/2,1/2],\\+\infty {\text{ else.}}\end{cases}}}$, and we obtain

${\displaystyle TV(P\|Q)=\sup _{|g|\leq 1/2}E_{P}[g(X)]-E_{Q}[g(X)]}$
For chi-squared divergence, defined by ${\displaystyle f(x)=(x-1)^{2},f^{*}(y)=y^{2}/4+y}$, we obtain
${\displaystyle \chi ^{2}(P;Q)=\sup _{g}E_{P}[g(X)]-E_{Q}[g(X)^{2}/4+g(X)]}$
Since the variation term is not affine-invariant in ${\displaystyle g}$, even though the domain over which ${\displaystyle g}$ varies is affine-invariant, we can use up the affine-invariance to obtain a leaner expression.

Replace ${\displaystyle g}$ by ${\displaystyle ag+b}$, and take maximum over ${\displaystyle a,b\in \mathbb {R} }$, we obtain

${\displaystyle \chi ^{2}(P;Q)=\sup _{g}{\frac {(E_{P}[g(X)]-E_{Q}[g(X)])^{2}}{Var_{Q}[g(X)]}}}$
which is just a few steps away from the Hammersley–Chapman–Robbins bound and the Cramér–Rao bound (Theorem 29.1 and its corollary in [2]).

For ${\displaystyle \alpha }$-divergence with ${\displaystyle \alpha \in (-\infty ,0)\cup (0,1)}$, we have ${\displaystyle f_{\alpha }(x)={\frac {x^{\alpha }-\alpha x-(1-\alpha )}{\alpha (\alpha -1)}}}$, with range ${\displaystyle x\in [0,\infty )}$. Its convex conjugate is ${\displaystyle f_{\alpha }^{*}(y)={\frac {1}{\alpha }}(x(y)^{\alpha }-1)}$ with range ${\displaystyle y\in (-\infty ,(1-\alpha )^{-1})}$, where ${\displaystyle x(y)=((\alpha -1)y+1)^{\frac {1}{\alpha -1}}}$.

Applying this theorem yields, after substitution with ${\displaystyle h=((\alpha -1)g+1)^{\frac {1}{\alpha -1}}}$,

${\displaystyle D_{\alpha }(P\|Q)={\frac {1}{\alpha (1-\alpha )}}-\inf _{h:\Omega \to (0,\infty )}\left(E_{Q}\left[{\frac {h^{\alpha }}{\alpha }}\right]+E_{P}\left[{\frac {h^{\alpha -1}}{1-\alpha }}\right]\right)}$
or, releasing the constraint on ${\displaystyle h}$,
${\displaystyle D_{\alpha }(P\|Q)={\frac {1}{\alpha (1-\alpha )}}-\inf _{h:\Omega \to \mathbb {R} }\left(E_{Q}\left[{\frac {|h|^{\alpha }}{\alpha }}\right]+E_{P}\left[{\frac {|h|^{\alpha -1}}{1-\alpha }}\right]\right)}$
Setting ${\displaystyle \alpha =-1}$ yields the variational representation of ${\displaystyle \chi ^{2}}$-divergence obtained above.

The domain over which ${\displaystyle h}$ varies is not affine-invariant in general, unlike the ${\displaystyle \chi ^{2}}$-divergence case. The ${\displaystyle \chi ^{2}}$-divergence is special, since in that case, we can remove the ${\displaystyle |\cdot |}$ from ${\displaystyle |h|}$.

For general ${\displaystyle \alpha \in (-\infty ,0)\cup (0,1)}$, the domain over which ${\displaystyle h}$ varies is merely scale-invariant. Similar to above, we can replace ${\displaystyle h}$ by ${\displaystyle ah}$, and take minimum over ${\displaystyle a>0}$ to obtain

${\displaystyle D_{\alpha }(P\|Q)=\sup _{h>0}\left[{\frac {1}{\alpha (1-\alpha )}}\left(1-{\frac {E_{P}[h^{\alpha -1}]^{\alpha }}{E_{Q}[h^{\alpha }]^{\alpha -1}}}\right)\right]}$
Setting ${\displaystyle \alpha ={\frac {1}{2}}}$, and performing another substitution by ${\displaystyle g={\sqrt {h}}}$, yields two variational representations of the squared Hellinger distance:
${\displaystyle H^{2}(P\|Q)={\frac {1}{2}}D_{1/2}(P\|Q)=2-\inf _{h>0}\left(E_{Q}\left[h(X)\right]+E_{P}\left[h(X)^{-1}\right]\right)}$
${\displaystyle H^{2}(P\|Q)=2\sup _{h>0}\left(1-{\sqrt {E_{P}[h^{-1}]E_{Q}[h]}}\right)}$
Applying this theorem to the KL-divergence, defined by ${\displaystyle f(x)=x\ln x,f^{*}(y)=e^{y-1}}$ yields

${\displaystyle D_{KL}(P;Q)=\sup _{g}E_{P}[g(X)]-e^{-1}E_{Q}[e^{g(X)}]}$
This is strictly less efficient than the Donsker-Varadhan representation
${\displaystyle D_{KL}(P;Q)=\sup _{g}E_{P}[g(X)]-\ln E_{Q}[e^{g(X)}]}$
This defect is fixed by the next theorem.

Theorem — If ${\displaystyle f(x)=+\infty }$ on ${\displaystyle x<0}$ (redefine ${\displaystyle f}$ if necessary), then

${\displaystyle D_{f}(P\|Q)=f^{\prime }(\infty )P\left[S^{c}\right]+\sup _{g}\mathbb {E} _{P}\left[g1_{S}\right]-\Psi _{Q,P}^{*}(g)}$

where ${\displaystyle \Psi _{Q,P}^{*}(g):=\inf _{a\in \mathbb {R} }\mathbb {E} _{Q}\left[f^{*}(g(X)-a)\right]+aP[S]}$ and ${\displaystyle S:=\{q>0\}}$, where ${\displaystyle q}$ is the probability density function of ${\displaystyle Q}$ with respect to some underlying measure.

In the special case of ${\displaystyle f^{\prime }(\infty )=\infty }$, we have

${\displaystyle D_{f}(P\|Q)=\sup _{g}\mathbb {E} _{P}[g]-\Psi _{Q}^{*}(g),\quad \Psi _{Q}^{*}(g):=\inf _{a\in \mathbb {R} }\mathbb {E} _{Q}\left[f^{*}(g(X)-a)\right]+a}$

This is Theorem 7.25 in.[2]

Example applications

Applying this theorem to KL-divergence yields the Donsker-Varadhan representation.

Attempting to apply this theorem to general ${\displaystyle \alpha }$-divergence with ${\displaystyle \alpha \in (-\infty ,0)\cup (0,1)}$ does not yield a closed-form solution.

Common examples of f-divergences

The following table lists many of the common divergences between probability distributions and the possible generating functions to which they correspond. Notably, except for total variation distance, all others are special cases of ${\displaystyle \alpha }$-divergence, or linear sums of ${\displaystyle \alpha }$-divergences.

For each f-divergence ${\displaystyle D_{f}}$, its generating function is not uniquely defined, but only up to ${\displaystyle c\cdot (t-1)}$, where ${\displaystyle c}$ is any real constant. That is, for any ${\displaystyle f}$ that generates an f-divergence, we have ${\displaystyle D_{f(t)}=D_{f(t)+c\cdot (t-1)}}$. This freedom is not only convenient, but actually necessary.

Divergence Corresponding f(t)
Total variation distance ${\displaystyle {\frac {1}{2}}|t-1|\,}$
α-divergence ${\displaystyle {\begin{cases}{\frac {t^{\alpha }-\alpha t-\left(1-\alpha \right)}{\alpha \left(\alpha -1\right)}}&{\text{if}}\ \alpha \neq 0,\,\alpha \neq 1,\\t\ln t-t+1,&{\text{if}}\ \alpha =1,\\-\ln t+t-1,&{\text{if}}\ \alpha =0\end{cases}}}$
KL-divergence (${\displaystyle \alpha =1}$) ${\displaystyle t\ln t}$
reverse KL-divergence (${\displaystyle \alpha =0}$) ${\displaystyle -\ln t}$
Jensen-Shannon divergence ${\displaystyle -(t+1)\ln \left({\frac {t+1}{2}}\right)+t\ln t}$
squared Hellinger distance (${\displaystyle \alpha ={\frac {1}{2}}}$) ${\displaystyle ({\sqrt {t}}-1)^{2},\,2(1-{\sqrt {t}})}$
Pearson ${\displaystyle \chi ^{2}}$-divergence (${\displaystyle \alpha =2}$) ${\displaystyle (t-1)^{2},\,t^{2}-1,\,t^{2}-t}$
Neyman ${\displaystyle \chi ^{2}}$-divergence (reverse Pearson)

(${\displaystyle \alpha =-1}$)

${\displaystyle {\frac {1}{t}}-1,\,{\frac {1}{t}}-t}$
Comparison between the generators of alpha-divergences, as alpha varies from -1 to 2.

Let ${\displaystyle f_{\alpha }}$ be the generator of ${\displaystyle \alpha }$-divergence, then ${\displaystyle f_{\alpha }}$ and ${\displaystyle f_{1-\alpha }}$ are convex inversions of each other, so ${\displaystyle D_{\alpha }(P\|Q)=D_{1-\alpha }(Q\|P)}$. In particular, this shows that the squared Hellinger distance and Jensen-Shannon divergence are symmetric.

In the literature, the ${\displaystyle \alpha }$-divergences are sometimes parametrized as

${\displaystyle {\begin{cases}{\frac {4}{1-\alpha ^{2}}}{\big (}1-t^{(1+\alpha )/2}{\big )},&{\text{if}}\ \alpha \neq \pm 1,\\t\ln t,&{\text{if}}\ \alpha =1,\\-\ln t,&{\text{if}}\ \alpha =-1\end{cases}}}$

which is equivalent to the parametrization in this page by substituting ${\displaystyle \alpha \leftarrow {\frac {\alpha +1}{2}}}$.

Relations to other statistical divergences

Rényi divergence

The Rényi divergences is a family of divergences defined by

${\displaystyle R_{\alpha }(P\|Q)={\frac {1}{\alpha -1}}\log {\Bigg (}E_{Q}\left[\left({\frac {dP}{dQ}}\right)^{\alpha }\right]{\Bigg )}\,}$

when ${\displaystyle \alpha \in (0,1)\cup (1,\infty )}$. It is extended to the cases of ${\displaystyle \alpha =0,1,\infty }$ by taking the limit.

Simple algebra shows that ${\displaystyle R_{\alpha }(P\|Q)={\frac {1}{\alpha -1}}\ln(1+\alpha (\alpha -1)D_{\alpha }(P\|Q))}$, where ${\displaystyle D_{\alpha }}$ is the ${\displaystyle \alpha }$-divergence defined above.

KL divergence

The KL divergence is the f-divergence generated by ${\displaystyle f(x)=x\ln x}$.

Bregman divergence

The only f-divergence that is also a Bregman divergence is the KL divergence.[6]

Financial interpretation

A pair of probability distributions can be viewed as a game of chance in which one of the distributions defines the official odds and the other contains the actual probabilities. Knowledge of the actual probabilities allows a player to profit from the game. For a large class of rational players the expected profit rate has the same general form as the ƒ-divergence.[7]