# Kullback's inequality

In information theory and statistics, Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function.[1] If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q, i.e. P<<Q, and whose first moments exist, then

${\displaystyle D_{KL}(P\|Q)\geq \Psi _{Q}^{*}(\mu '_{1}(P)),}$

where ${\displaystyle \Psi _{Q}^{*}}$ is the rate function, i.e. the convex conjugate of the cumulant-generating function, of ${\displaystyle Q}$, and ${\displaystyle \mu '_{1}(P)}$ is the first moment of ${\displaystyle P.}$

The Cramér–Rao bound is a corollary of this result.

## Proof

Let P and Q be probability distributions (measures) on the real line, whose first moments exist, and such that P<<Q. Consider the natural exponential family of Q given by

${\displaystyle Q_{\theta }(A)={\frac {\int _{A}e^{\theta x}Q(dx)}{\int _{-\infty }^{\infty }e^{\theta x}Q(dx)}}={\frac {1}{M_{Q}(\theta )}}\int _{A}e^{\theta x}Q(dx)}$

for every measurable set A, where ${\displaystyle M_{Q}}$ is the moment-generating function of Q. (Note that Q0=Q.) Then

${\displaystyle D_{KL}(P\|Q)=D_{KL}(P\|Q_{\theta })+\int _{\mathrm {supp} P}\left(\log {\frac {\mathrm {d} Q_{\theta }}{\mathrm {d} Q}}\right)\mathrm {d} P.}$

By Gibbs' inequality we have ${\displaystyle D_{KL}(P\|Q_{\theta })\geq 0}$ so that

${\displaystyle D_{KL}(P\|Q)\geq \int _{\mathrm {supp} P}\left(\log {\frac {\mathrm {d} Q_{\theta }}{\mathrm {d} Q}}\right)\mathrm {d} P=\int _{\mathrm {supp} P}\left(\log {\frac {e^{\theta x}}{M_{Q}(\theta )}}\right)P(dx)}$

Simplifying the right side, we have, for every real θ where ${\displaystyle M_{Q}(\theta )<\infty :}$

${\displaystyle D_{KL}(P\|Q)\geq \mu '_{1}(P)\theta -\Psi _{Q}(\theta ),}$

where ${\displaystyle \mu '_{1}(P)}$ is the first moment, or mean, of P, and ${\displaystyle \Psi _{Q}=\log M_{Q}}$ is called the cumulant-generating function. Taking the supremum completes the process of convex conjugation and yields the rate function:

${\displaystyle D_{KL}(P\|Q)\geq \sup _{\theta }\left\{\mu '_{1}(P)\theta -\Psi _{Q}(\theta )\right\}=\Psi _{Q}^{*}(\mu '_{1}(P)).}$

## Corollary: the Cramér–Rao bound

Let Xθ be a family of probability distributions on the real line indexed by the real parameter θ, and satisfying certain regularity conditions. Then

${\displaystyle \lim _{h\rightarrow 0}{\frac {D_{KL}(X_{\theta +h}\|X_{\theta })}{h^{2}}}\geq \lim _{h\rightarrow 0}{\frac {\Psi _{\theta }^{*}(\mu _{\theta +h})}{h^{2}}},}$

where ${\displaystyle \Psi _{\theta }^{*}}$ is the convex conjugate of the cumulant-generating function of ${\displaystyle X_{\theta }}$ and ${\displaystyle \mu _{\theta +h}}$ is the first moment of ${\displaystyle X_{\theta +h}.}$

### Left side

The left side of this inequality can be simplified as follows:

{\displaystyle {\begin{aligned}\lim _{h\to 0}{\frac {D_{KL}(X_{\theta +h}\|X_{\theta })}{h^{2}}}&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\log \left({\frac {\mathrm {d} X_{\theta +h}}{\mathrm {d} X_{\theta }}}\right)\mathrm {d} X_{\theta +h}\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\log \left(1-\left(1-{\frac {\mathrm {d} X_{\theta +h}}{\mathrm {d} X_{\theta }}}\right)\right)\mathrm {d} X_{\theta +h}\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\left[\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)+{\frac {1}{2}}\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}+o\left(\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}\right)\right]\mathrm {d} X_{\theta +h}&&{\text{Taylor series for }}\log(1-t)\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\left[{\frac {1}{2}}\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}\right]\mathrm {d} X_{\theta +h}\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\left[{\frac {1}{2}}\left({\frac {\mathrm {d} X_{\theta +h}-\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}\right]\mathrm {d} X_{\theta +h}\\&={\frac {1}{2}}{\mathcal {I}}_{X}(\theta )\end{aligned}}}

which is half the Fisher information of the parameter θ.

### Right side

The right side of the inequality can be developed as follows:

${\displaystyle \lim _{h\rightarrow 0}{\frac {\Psi _{\theta }^{*}(\mu _{\theta +h})}{h^{2}}}=\lim _{h\rightarrow 0}{\frac {1}{h^{2}}}{\sup _{t}\{\mu _{\theta +h}t-\Psi _{\theta }(t)\}}.}$

This supremum is attained at a value of t=τ where the first derivative of the cumulant-generating function is ${\displaystyle \Psi '_{\theta }(\tau )=\mu _{\theta +h},}$ but we have ${\displaystyle \Psi '_{\theta }(0)=\mu _{\theta },}$ so that

${\displaystyle \Psi ''_{\theta }(0)={\frac {d\mu _{\theta }}{d\theta }}\lim _{h\rightarrow 0}{\frac {h}{\tau }}.}$

Moreover,

${\displaystyle \lim _{h\rightarrow 0}{\frac {\Psi _{\theta }^{*}(\mu _{\theta +h})}{h^{2}}}={\frac {1}{2\Psi ''_{\theta }(0)}}\left({\frac {d\mu _{\theta }}{d\theta }}\right)^{2}={\frac {1}{2\mathrm {Var} (X_{\theta })}}\left({\frac {d\mu _{\theta }}{d\theta }}\right)^{2}.}$

### Putting both sides back together

We have:

${\displaystyle {\frac {1}{2}}{\mathcal {I}}_{X}(\theta )\geq {\frac {1}{2\mathrm {Var} (X_{\theta })}}\left({\frac {d\mu _{\theta }}{d\theta }}\right)^{2},}$

which can be rearranged as:

${\displaystyle \mathrm {Var} (X_{\theta })\geq {\frac {(d\mu _{\theta }/d\theta )^{2}}{{\mathcal {I}}_{X}(\theta )}}.}$