# Pinsker's inequality

In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance (or statistical distance) in terms of the Kullback–Leibler divergence. The inequality is tight up to constant factors.[1]

## Formal statement

Pinsker's inequality states that, if ${\displaystyle P}$ and ${\displaystyle Q}$ are two probability distributions on a measurable space ${\displaystyle (X,\Sigma )}$, then

${\displaystyle \delta (P,Q)\leq {\sqrt {{\frac {1}{2}}D_{\mathrm {KL} }(P\|Q)}},}$

where

${\displaystyle \delta (P,Q)=\sup {\bigl \{}|P(A)-Q(A)|{\big |}A\in \Sigma {\text{ is a measurable event}}{\bigr \}}}$

is the total variation distance (or statistical distance) between ${\displaystyle P}$ and ${\displaystyle Q}$ and

${\displaystyle D_{\mathrm {KL} }(P\|Q)=\operatorname {E} _{P}\left(\log {\frac {\mathrm {d} P}{\mathrm {d} Q}}\right)=\int _{X}\left(\log {\frac {\mathrm {d} P}{\mathrm {d} Q}}\right)\,\mathrm {d} P}$

is the Kullback–Leibler divergence in nats. When the sample space ${\displaystyle X}$ is a finite set, the Kullback–Leibler divergence is given by

${\displaystyle D_{\mathrm {KL} }(P\|Q)=\sum _{i\in X}\left(\log {\frac {P(i)}{Q(i)}}\right)P(i)\!}$

Note that in terms of the total variation norm ${\displaystyle \|P-Q\|}$ of the signed measure ${\displaystyle P-Q}$, Pinsker's inequality differs from the one given above by a factor of two:

${\displaystyle \|P-Q\|\leq {\sqrt {2D_{\mathrm {KL} }(P\|Q)}}.}$

A proof of Pinsker's inequality uses the partition inequality for f-divergences.

## History

Pinsker first proved the inequality with a worse constant. The inequality in the above form was proved independently by Kullback, Csiszár, and Kemperman.[2]

## Inverse problem

A precise inverse of the inequality cannot hold: for every ${\displaystyle \varepsilon >0}$, there are distributions ${\displaystyle P_{\varepsilon },Q}$ with ${\displaystyle \delta (P_{\varepsilon },Q)\leq \varepsilon }$ but ${\displaystyle D_{\mathrm {KL} }(P_{\varepsilon }\|Q)=\infty }$. An easy example is given by the two-point space ${\displaystyle \{0,1\}}$ with ${\displaystyle Q(0)=0,Q(1)=1}$ and ${\displaystyle P_{\varepsilon }(0)=\varepsilon ,P_{\varepsilon }(1)=1-\varepsilon }$. [3]

However, an inverse inequality holds on finite spaces ${\displaystyle X}$ with a constant depending on ${\displaystyle Q}$.[4] More specifically, it can be shown that with the definition ${\displaystyle \alpha _{Q}:=\min _{x\in X:Q(x)>0}Q(x)}$ we have for any measure ${\displaystyle P}$ which is absolutely continuous to ${\displaystyle Q}$

${\displaystyle {\frac {1}{2}}D_{\mathrm {KL} }(P\|Q)\leq {\frac {1}{\alpha _{Q}}}\delta (P,Q)^{2}.}$

As a consequence, if ${\displaystyle Q}$ has full support (i.e. ${\displaystyle Q(x)>0}$ for all ${\displaystyle x\in X}$), then

${\displaystyle \delta (P,Q)^{2}\leq {\frac {1}{2}}D(P\|Q)\leq {\frac {1}{\alpha _{Q}}}\delta (P,Q)^{2}.}$

## References

1. ^ Csiszár, Imre; Körner, János (2011). Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press. p. 44. ISBN 9781139499989.
2. ^ Tsybakov, Alexandre (2009). Introduction to Nonparametric Estimation. Springer. p. 132. ISBN 9780387790527.
3. ^ The divergence becomes infinite whenever one of the two distributions assigns probability zero to an event while the other assigns it a nonzero probability (no matter how small); see e.g. Basu, Mitra; Ho, Tin Kam (2006). Data Complexity in Pattern Recognition. Springer. p. 161. ISBN 9781846281723..
4. ^ see Lemma 4.1 in Götze, Friedrich; Sambale, Holger; Sinulis, Arthur. "Higher order concentration for functions of weakly dependent random variables". arXiv:1801.06348.