# Gibbs' inequality

Jump to navigation Jump to search
Josiah Willard Gibbs

In information theory, Gibbs' inequality is a statement about the mathematical entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including Fano's inequality. It was first presented by J. Willard Gibbs in the 19th century.

## Gibbs' inequality

Suppose that

${\displaystyle P=\{p_{1},\ldots ,p_{n}\}}$

is a probability distribution. Then for any other probability distribution

${\displaystyle Q=\{q_{1},\ldots ,q_{n}\}}$

the following inequality between positive quantities (since pi and qi are between zero and one) holds:[1]:68

${\displaystyle -\sum _{i=1}^{n}p_{i}\log p_{i}\leq -\sum _{i=1}^{n}p_{i}\log q_{i}}$

with equality if and only if

${\displaystyle p_{i}=q_{i}}$

for all i. Put in words, the information entropy of a distribution P is less than or equal to its cross entropy with any other distribution Q.

The difference between the two quantities is the Kullback–Leibler divergence or relative entropy, so the inequality can also be written:[2]:34

${\displaystyle D_{\mathrm {KL} }(P\|Q)\equiv \sum _{i=1}^{n}p_{i}\log {\frac {p_{i}}{q_{i}}}\geq 0.}$

Note that the use of base-2 logarithms is optional, and allows one to refer to the quantity on each side of the inequality as an "average surprisal" measured in bits.

## Proof

For simplicity, we prove the statement using the natural logarithm (ln), since

${\displaystyle \log a={\frac {\ln a}{\ln 2}},}$

the particular logarithm we choose only scales the relationship.

Let ${\displaystyle I}$ denote the set of all ${\displaystyle i}$ for which pi is non-zero. Then, since ${\displaystyle \ln x\leq x-1}$ for all x > 0, with equality if and only if x=1, we have:

${\displaystyle -\sum _{i\in I}p_{i}\ln {\frac {q_{i}}{p_{i}}}\geq -\sum _{i\in I}p_{i}\left({\frac {q_{i}}{p_{i}}}-1\right)}$${\displaystyle =-\sum _{i\in I}q_{i}+\sum _{i\in I}p_{i}=-\sum _{i\in I}q_{i}+1\geq 0}$

The last inequality is a consequence of the pi and qi being part of a probability distribution. Specifically, the sum of all non-zero values is 1. Some non-zero qi, however, may have been excluded since the choice of indices is conditioned upon the pi being non-zero. Therefore the sum of the qi may be less than 1.

So far, over the index set ${\displaystyle I}$, we have:

${\displaystyle -\sum _{i\in I}p_{i}\ln {\frac {q_{i}}{p_{i}}}\geq 0}$,

or equivalently

${\displaystyle -\sum _{i\in I}p_{i}\ln q_{i}\geq -\sum _{i\in I}p_{i}\ln p_{i}}$.

Both sums can be extended to all ${\displaystyle i=1,\ldots ,n}$, i.e. including ${\displaystyle p_{i}=0}$, by recalling that the expression ${\displaystyle p\ln p}$ tends to 0 as ${\displaystyle p}$ tends to 0, and ${\displaystyle (-\ln q)}$ tends to ${\displaystyle \infty }$ as ${\displaystyle q}$ tends to 0. We arrive at

${\displaystyle -\sum _{i=1}^{n}p_{i}\ln q_{i}\geq -\sum _{i=1}^{n}p_{i}\ln p_{i}}$

For equality to hold, we require

1. ${\displaystyle {\frac {q_{i}}{p_{i}}}=1}$ for all ${\displaystyle i\in I}$ so that the equality ${\displaystyle \ln {\frac {q_{i}}{p_{i}}}={\frac {q_{i}}{p_{i}}}-1}$ holds,
2. and ${\displaystyle \sum _{i\in I}q_{i}=1}$ which means ${\displaystyle q_{i}=0}$ if ${\displaystyle i\notin I}$, that is, ${\displaystyle q_{i}=0}$ if ${\displaystyle p_{i}=0}$.

This can happen if and only if ${\displaystyle p_{i}=q_{i}}$ for ${\displaystyle i=1,\ldots ,n}$.

## Alternative proofs

The result can alternatively be proved using Jensen's inequality, the log sum inequality, or the fact that the Kullback-Leibler divergence is a form of Bregman divergence. Below we give a proof based on Jensen's inequality:

Because log is a concave function, we have that:

${\displaystyle \sum _{i}p_{i}\log {\frac {q_{i}}{p_{i}}}\leq \log \sum _{i}p_{i}{\frac {q_{i}}{p_{i}}}=\log \sum _{i}q_{i}=0}$

Where the first inequality is due to Jensen's inequality, and the last equality is because ${\displaystyle q}$ is a probability distribution.

Furthermore, since ${\displaystyle \log }$ is strictly concave, by the equality condition of Jensen's inequality we get equality when

${\displaystyle {\frac {q_{1}}{p_{1}}}={\frac {q_{2}}{p_{2}}}=\cdots ={\frac {q_{n}}{p_{n}}}}$

Suppose that this ratio is ${\displaystyle \sigma }$, then we have that

${\displaystyle 1=\sum _{i}q_{i}=\sum _{i}\sigma p_{i}=\sigma }$

Where we use the fact that ${\displaystyle p,q}$ are probability distributions. Therefore the equality happens when ${\displaystyle p=q}$.

## Corollary

The entropy of ${\displaystyle P}$ is bounded by:[1]:68

${\displaystyle H(p_{1},\ldots ,p_{n})\leq \log n.}$

The proof is trivial – simply set ${\displaystyle q_{i}=1/n}$ for all i.

## References

1. ^ a b Pierre Bremaud (6 December 2012). An Introduction to Probabilistic Modeling. Springer Science & Business Media. ISBN 978-1-4612-1046-7.
2. ^ David J. C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press. ISBN 978-0-521-64298-9.