F-divergence: Difference between revisions

Content deleted Content added

Inline

Revision as of 23:40, 9 May 2020

In probability theory, an ƒ-divergence is a function D_f (P || Q) that measures the difference between two probability distributions P and Q. It helps the intuition to think of the divergence as an average, weighted by the function f, of the odds ratio given by P and Q^{[citation needed]}.

These divergences were introduced by Alfréd Rényi^[1]

in the same paper where he introduced the well-known Rényi entropy. He proved that these divergences decrease  in Markov Processes. f-divergences were studied further   independently by Csiszár (1963), Morimoto (1963) and Ali & Silvey (1966) and are sometimes known as Csiszár ƒ-divergences, Csiszár-Morimoto divergences or Ali-Silvey distances.

Definition

Let P and Q be two probability distributions over a space Ω such that P is absolutely continuous with respect to Q. Then, for a convex function f such that f(1) = 0, the f-divergence of P from Q is defined as

D_{f}(P\parallel Q)\equiv \int _{\Omega }f\left({\frac {dP}{dQ}}\right)\,dQ.

If P and Q are both absolutely continuous with respect to a reference distribution μ on Ω then their probability densities p and q satisfy dP = p dμ and dQ = q dμ. In this case the f-divergence can be written as

D_{f}(P\parallel Q)=\int _{\Omega }f\left({\frac {p(x)}{q(x)}}\right)q(x)\,d\mu (x).

The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances (Nielsen & Nock (2013)).

Instances of f-divergences

Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of f-divergence, coinciding with a particular choice of f. The following table lists many of the common divergences between probability distributions and the f function to which they correspond (cf. Liese & Vajda (2006)).

Divergence	Corresponding f(t)
KL-divergence	$t\log t$
reverse KL-divergence	$-\log t$
squared Hellinger distance	$({\sqrt {t}}-1)^{2},\,2(1-{\sqrt {t}})$
Total variation distance	${\frac {1}{2}}\|t-1\|\,$
Pearson $\chi ^{2}$ -divergence	$(t-1)^{2},\,t^{2}-1,\,t^{2}-t$
Neyman $\chi ^{2}$ -divergence (reverse Pearson)	${\frac {1}{t}}-1,\,{\frac {1}{t}}-t$
α-divergence	${\begin{cases}{\frac {4}{1-\alpha ^{2}}}{\big (}1-t^{(1+\alpha )/2}{\big )},&{\text{if}}\ \alpha \neq \pm 1,\\t\ln t,&{\text{if}}\ \alpha =1,\\-\ln t,&{\text{if}}\ \alpha =-1\end{cases}}$
α-divergence (other designation)	${\begin{cases}{\frac {t^{\alpha }-t}{\alpha (\alpha -1)}},&{\text{if}}\ \alpha \neq 0,\,\alpha \neq 1,\\t\ln t,&{\text{if}}\ \alpha =1,\\-\ln t,&{\text{if}}\ \alpha =0\end{cases}}$

The function $f(t)$ is defined up to the summand $c(t-1)$ , where $c$ is any constant.

Properties

Non-negativity: the ƒ-divergence is always positive; it's zero if and only if the measures P and Q coincide. This follows immediately from Jensen’s inequality:
$D_{f}(P\!\parallel \!Q)=\int \!f{\bigg (}{\frac {dP}{dQ}}{\bigg )}dQ\geq f{\bigg (}\int {\frac {dP}{dQ}}dQ{\bigg )}=f(1)=0.$
Monotonicity: if κ is an arbitrary transition probability that transforms measures P and Q into P_κ and Q_κ correspondingly, then
$D_{f}(P\!\parallel \!Q)\geq D_{f}(P_{\kappa }\!\parallel \!Q_{\kappa }).$
The equality here holds if and only if the transition is induced from a sufficient statistic with respect to {P, Q}.
Joint Convexity: for any 0 ≤ λ ≤ 1
$D_{f}{\Big (}\lambda P_{1}+(1-\lambda )P_{2}\parallel \lambda Q_{1}+(1-\lambda )Q_{2}{\Big )}\leq \lambda D_{f}(P_{1}\!\parallel \!Q_{1})+(1-\lambda )D_{f}(P_{2}\!\parallel \!Q_{2}).$
This follows from the convexity of the mapping $(p,q)\mapsto qf(p/q)$ on $\mathbb {R} _{+}^{2}$ .

In particular, the monotonicity implies that if a Markov process has a positive equilibrium probability distribution $P^{*}$ then $D_{f}(P(t)\parallel P^{*})$ is a monotonic (non-increasing) function of time, where the probability distribution $P(t)$ is a solution of the Kolmogorov forward equations (or Master equation), used to describe the time evolution of the probability distribution in the Markov process. This means that all f-divergences $D_{f}(P(t)\parallel P^{*})$ are the Lyapunov functions of the Kolmogorov forward equations. Reverse statement is also true: If $H(P)$ is a Lyapunov function for all Markov chains with positive equilibrium $P^{*}$ and is of the trace-form ( $H(P)=\sum _{i}f(P_{i},P_{i}^{*})$ ) then $H(P)=D_{f}(P(t)\parallel P^{*})$ , for some convex function f.^[2]^[3] For example, Bregman divergences in general do not have such property and can increase in Markov processes.^[4]

References

Csiszár, I. (1963). "Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitat von Markoffschen Ketten". Magyar. Tud. Akad. Mat. Kutato Int. Kozl. 8: 85–108.{{cite journal}}: CS1 maint: ref duplicates default (link)
Morimoto, T. (1963). "Markov processes and the H-theorem". J. Phys. Soc. Jpn. 18 (3): 328–331. Bibcode:1963JPSJ...18..328M. doi:10.1143/JPSJ.18.328.{{cite journal}}: CS1 maint: ref duplicates default (link)
Ali, S. M.; Silvey, S. D. (1966). "A general class of coefficients of divergence of one distribution from another". Journal of the Royal Statistical Society, Series B. 28 (1): 131–142. JSTOR 2984279. MR 0196777.{{cite journal}}: CS1 maint: ref duplicates default (link)
Csiszár, I. (1967). "Information-type measures of difference of probability distributions and indirect observation". Studia Scientiarum Mathematicarum Hungarica. 2: 229–318.
Csiszár, I.; Shields, P. (2004). "Information Theory and Statistics: A Tutorial" (PDF). Foundations and Trends in Communications and Information Theory. 1 (4): 417–528. doi:10.1561/0100000004. Retrieved 2009-04-08.
Liese, F.; Vajda, I. (2006). "On divergences and informations in statistics and information theory". IEEE Transactions on Information Theory. 52 (10): 4394–4412. doi:10.1109/TIT.2006.881731.{{cite journal}}: CS1 maint: ref duplicates default (link)
Nielsen, F.; Nock, R. (2013). "On the Chi square and higher-order Chi distances for approximating f-divergences". IEEE Signal Processing Letters. 21: 10–13. arXiv:1309.3029. Bibcode:2014ISPL...21...10N. doi:10.1109/LSP.2013.2288355.{{cite journal}}: CS1 maint: ref duplicates default (link)
Coeurjolly, J-F.; Drouilhet, R. (2006). "Normalized information-based divergences". arXiv:math/0604246.

^ Rényi, Alfréd (1961). On measures of entropy and information (PDF). The 4th Berkeley Symposium on Mathematics, Statistics and Probability, 1960. Berkeley, CA: University of California Press. pp. 547–561. Eq. (4.20)
^ Gorban, Pavel A. (15 October 2003). "Monotonically equivalent entropies and solution of additivity equation". Physica A. 328 (3–4): 380–390. arXiv:cond-mat/0304131. doi:10.1016/S0378-4371(03)00578-8.
^ Amari, Shun'ichi (2009). Leung, C.S.; Lee, M.; Chan, J.H. (eds.). Divergence, Optimization, Geometry. The 16th International Conference on Neural Information Processing (ICONIP 20009), Bangkok, Thailand, 1--5 December 2009. Lecture Notes in Computer Science, vol 5863. Berlin, Heidelberg: Springer. pp. 185--193. doi:10.1007/978-3-642-10677-4_21.
^ Gorban, Alexander N. (29 April 2014). "General H-theorem and Entropies that Violate the Second Law". Entropy. 16 (5): 2408–2432. arXiv:1212.6767. doi:10.3390/e16052408.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[1] Rényi, Alfréd (1961). On measures of entropy and information (PDF). The 4th Berkeley Symposium on Mathematics, Statistics and Probability, 1960. Berkeley, CA: University of California Press. pp. 547–561. Eq. (4.20)

[2] Gorban, Pavel A. (15 October 2003). "Monotonically equivalent entropies and solution of additivity equation". Physica A. 328 (3–4): 380–390. arXiv:cond-mat/0304131. doi:10.1016/S0378-4371(03)00578-8.

[3] Amari, Shun'ichi (2009). Leung, C.S.; Lee, M.; Chan, J.H. (eds.). Divergence, Optimization, Geometry. The 16th International Conference on Neural Information Processing (ICONIP 20009), Bangkok, Thailand, 1--5 December 2009. Lecture Notes in Computer Science, vol 5863. Berlin, Heidelberg: Springer. pp. 185--193. doi:10.1007/978-3-642-10677-4_21.

[4] Gorban, Alexander N. (29 April 2014). "General H-theorem and Entropies that Violate the Second Law". Entropy. 16 (5): 2408–2432. arXiv:1212.6767. doi:10.3390/e16052408.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[1]

[2]

[3]

[4]

@@ Line 4: / Line 4: @@
 In [[probability theory]], an '''''ƒ''-divergence''' is a function ''D''<sub>''f''</sub>&thinsp;(''P''&thinsp;&thinsp;||&thinsp;''Q'') that measures the difference between two [[probability distributions]] ''P'' and ''Q''. It helps the intuition to think of the [[Divergence (statistics)|divergence]] as an average, weighted by the function ''f'', of the [[odds ratio]] given by ''P'' and ''Q''{{Citation needed|date=November 2019}}.
+These divergences were introduced by [[Alfréd Rényi]]<ref>
-These divergences were introduced by [[Alfréd Rényi]]<ref>Rényi A. [http://digitalassets.lib.berkeley.edu/math/ucb/text/math_s4_v1_article-27.pdf On measures of entropy and information]. In Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability 1960, Volume 1; Berkeley, CA, USA: University of California Press; 1961; pp. 547–561. Eq. (4.20)</ref> in the same paper where he introduced the well-known [[Rényi entropy]]. He proved that these divergences decrease  in [[Markov Process]]es. ''f''-divergences were studied further   independently by {{harvtxt|Csiszár|1963}}, {{harvtxt|Morimoto|1963}} and {{harvtxt|Ali|Silvey|1966}} and are sometimes known as Csiszár ''ƒ''-divergences, Csiszár-Morimoto divergences or Ali-Silvey distances.
+{{cite conference   |title= On measures of entropy and information|first= Alfréd|last= Rényi |year= 1961|conference=The 4th Berkeley Symposium on Mathematics, Statistics and Probability, 1960|  publisher=  University of California Press|location= Berkeley, CA|pages= 547–561|url= http://digitalassets.lib.berkeley.edu/math/ucb/text/math_s4_v1_article-27.pdf}} Eq. (4.20)</ref>
+ in the same paper where he introduced the well-known [[Rényi entropy]]. He proved that these divergences decrease  in [[Markov Process]]es. ''f''-divergences were studied further   independently by {{harvtxt|Csiszár|1963}}, {{harvtxt|Morimoto|1963}} and {{harvtxt|Ali|Silvey|1966}} and are sometimes known as Csiszár ''ƒ''-divergences, Csiszár-Morimoto divergences or Ali-Silvey distances.
 == Definition ==
@@ Line 80: / Line 81: @@
 This follows from the convexity of the mapping <math>(p,q) \mapsto q f(p/q)</math> on <math>\mathbb{R}_+^2</math>.
 }}
+In particular, the monotonicity implies that if a [[Markov process]] has a positive equilibrium probability distribution <math>P^*</math> then <math>D_f(P(t)\parallel P^*)</math> is a monotonic (non-increasing) function of time, where the probability distribution <math>P(t)</math> is a solution of the [[Kolmogorov equations (Markov jump process)|Kolmogorov forward equations]] (or [[Master equation]]), used to describe the time evolution of the probability distribution in the Markov process. This means that all ''f''-divergences <math>D_f(P(t)\parallel P^*)</math> are the [[Lyapunov function]]s of the Kolmogorov forward equations. Reverse statement is also true:  If <math>H(P)</math> is a Lyapunov function   for all   Markov chains with positive equilibrium <math>P^*</math> and is of the trace-form
+(<math>H(P)=\sum_{i}f(P_{i},P_{i}^{*})</math>) then <math>H(P)= D_f(P(t)\parallel P^*)</math>, for some convex function ''f''.<ref>{{cite journal |last1= Gorban|first1= Pavel A.| date=  15 October 2003|title= Monotonically equivalent entropies and solution of additivity equation|journal= Physica A|volume= 328|issue=3-4 |pages= 380–390|doi=10.1016/S0378-4371(03)00578-8 |arxiv= cond-mat/0304131}}</ref><ref>{{cite conference |url= |title= Divergence, Optimization, Geometry|first= Shun'ichi |last= [[Shun'ichi Amari|Amari]]|year= 2009|conference= The 16th International Conference on Neural Information Processing (ICONIP 20009), Bangkok, Thailand, 1--5 December 2009 |editor=Leung, C.S. |editor2=Lee, M. |editor3=Chan, J.H.|series=  Lecture Notes in Computer Science, vol 5863 |publisher= Springer |location= Berlin, Heidelberg  |pages= 185--193 |doi=10.1007/978-3-642-10677-4_21  }}</ref> For example, [[Bregman divergences]] in general do not have such property and can increase in Markov processes.<ref>{{cite journal |last1= Gorban|first1= Alexander N.| date=  29 April 2014|title= General H-theorem and Entropies that Violate the Second Law|journal= Entropy|volume= 16|issue=5|pages= 2408-2432|doi=10.3390/e16052408|arxiv=1212.6767}}</ref>
 ==See also==