f-divergence

In probability theory, an ƒ-divergence is a function Df (P  || Q) that measures the difference between two probability distributions P and Q. It helps the intuition to think of the divergence as an average, weighted by the function f, of the odds ratio given by P and Q.

These divergences were introduced and studied independently by Csiszár (1963), Morimoto (1963) and Ali & Silvey (1966) and are sometimes known as Csiszár ƒ-divergences, Csiszár-Morimoto divergences or Ali-Silvey distances.

Definition

Let P and Q be two probability distributions over a space Ω such that P is absolutely continuous with respect to Q. Then, for a convex function f such that f(1) = 0, the f-divergence of P from Q is defined as

$D_{f}(P\parallel Q)\equiv \int _{\Omega }f\left({\frac {dP}{dQ}}\right)\,dQ.$ If P and Q are both absolutely continuous with respect to a reference distribution μ on Ω then their probability densities p and q satisfy dP = p dμ and dQ = q dμ. In this case the f-divergence can be written as

$D_{f}(P\parallel Q)=\int _{\Omega }f\left({\frac {p(x)}{q(x)}}\right)q(x)\,d\mu (x).$ The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances (Nielsen & Nock (2013)).

Instances of f-divergences

Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of f-divergence, coinciding with a particular choice of f. The following table lists many of the common divergences between probability distributions and the f function to which they correspond (cf. Liese & Vajda (2006)).

Divergence Corresponding f(t)
KL-divergence $t\log t$ reverse KL-divergence $-\log t$ Hellinger distance $({\sqrt {t}}-1)^{2},\,2(1-{\sqrt {t}})$ Total variation distance ${\frac {1}{2}}|t-1|\,$ Pearson $\chi ^{2}$ -divergence $(t-1)^{2},\,t^{2}-1,\,t^{2}-t$ Neyman $\chi ^{2}$ -divergence (reverse Pearson) ${\frac {1}{t}}-1,\,{\frac {1}{t}}-t$ α-divergence ${\begin{cases}{\frac {4}{1-\alpha ^{2}}}{\big (}1-t^{(1+\alpha )/2}{\big )},&{\text{if}}\ \alpha \neq \pm 1,\\t\ln t,&{\text{if}}\ \alpha =1,\\-\ln t,&{\text{if}}\ \alpha =-1\end{cases}}$ α-divergence (other designation) ${\begin{cases}{\frac {t^{\alpha }-t}{\alpha (\alpha -1)}},&{\text{if}}\ \alpha \neq 0,\,\alpha \neq 1,\\t\ln t,&{\text{if}}\ \alpha =1,\\-\ln t,&{\text{if}}\ \alpha =0\end{cases}}$ The function $f(t)$ is defined up to the summand $c(t-1)$ , where $c$ is any constant.

Properties

• Non-negativity: the ƒ-divergence is always positive; it's zero if and only if the measures P and Q coincide. This follows immediately from Jensen’s inequality:
$D_{f}(P\!\parallel \!Q)=\int \!f{\bigg (}{\frac {dP}{dQ}}{\bigg )}dQ\geq f{\bigg (}\int {\frac {dP}{dQ}}dQ{\bigg )}=f(1)=0.$ • Monotonicity: if κ is an arbitrary transition probability that transforms measures P and Q into Pκ and Qκ correspondingly, then
$D_{f}(P\!\parallel \!Q)\geq D_{f}(P_{\kappa }\!\parallel \!Q_{\kappa }).$ The equality here holds if and only if the transition is induced from a sufficient statistic with respect to {P, Q}.
• Joint Convexity: for any 0 ≤ λ ≤ 1
$D_{f}{\Big (}\lambda P_{1}+(1-\lambda )P_{2}\parallel \lambda Q_{1}+(1-\lambda )Q_{2}{\Big )}\leq \lambda D_{f}(P_{1}\!\parallel \!Q_{1})+(1-\lambda )D_{f}(P_{2}\!\parallel \!Q_{2}).$ This follows from the convexity of the mapping $(p,q)\mapsto qf(p/q)$ on $\mathbb {R} _{+}^{2}$ .