# Inequalities in information theory

Inequalities are very important in the study of information theory. There are a number of different contexts in which these inequalities appear.

## Shannon-type inequalities

Consider a finite collection of finitely (or at most countably) supported random variables on the same probability space. For a collection of n random variables, there are 2n − 1 such non-empty subsets for which entropies can be defined. For example, when n = 2, we may consider the entropies ${\displaystyle H(X_{1}),}$ ${\displaystyle H(X_{2}),}$ and ${\displaystyle H(X_{1},X_{2}),}$ and express the following inequalities (which together characterize the range of the marginal and joint entropies of two random variables):

• ${\displaystyle H(X_{1})\geq 0}$
• ${\displaystyle H(X_{2})\geq 0}$
• ${\displaystyle H(X_{1})\leq H(X_{1},X_{2})}$
• ${\displaystyle H(X_{2})\leq H(X_{1},X_{2})}$
• ${\displaystyle H(X_{1},X_{2})\leq H(X_{1})+H(X_{2}).}$

In fact, these can all be expressed as special cases of a single inequality involving the conditional mutual information, namely

${\displaystyle I(A;B|C)\geq 0,}$

where ${\displaystyle A}$, ${\displaystyle B}$, and ${\displaystyle C}$ each denote the joint distribution of some arbitrary (possibly empty) subset of our collection of random variables. Inequalities that can be derived from this are known as Shannon-type inequalities. More formally (following the notation of Yeung [1]), define ${\displaystyle \Gamma _{n}^{*}}$ to be the set of all constructible points in ${\displaystyle \mathbb {R} ^{2^{n}-1},}$ where a point is said to be constructible if and only if there is a joint, discrete distribution of n random variables such that each coordinate of that point, indexed by a non-empty subset of {1, 2, ..., n}, is equal to the joint entropy of the corresponding subset of the n random variables. The closure of ${\displaystyle \Gamma _{n}^{*}}$ is denoted ${\displaystyle {\overline {\Gamma _{n}^{*}}}.}$ In general

${\displaystyle \Gamma _{n}^{*}\subseteq {\overline {\Gamma _{n}^{*}}}\subseteq \Gamma _{n}.}$

The cone in ${\displaystyle \mathbb {R} ^{2^{n}-1}}$ characterized by all Shannon-type inequalities among n random variables is denoted ${\displaystyle \Gamma _{n}.}$ Software has been developed to automate the task of proving such inequalities [2] .[3] Given an inequality, such software is able to determine whether the given inequality contains the cone ${\displaystyle \Gamma _{n},}$ in which case the inequality can be verified, since ${\displaystyle \Gamma _{n}^{*}\subseteq \Gamma _{n}.}$

## Non-Shannon-type inequalities

Other, less trivial inequalities have been discovered among the entropies and joint entropies of four or more random variables, which cannot be derived from Shannon's basic inequalities. These are known as non-Shannon-type inequalities. In 1997 and 1998, Zhang and Yeung reported two non-Shannon-type inequalities.[4][5] The latter implies that

${\displaystyle {\overline {\Gamma _{n}^{*}}}\subset \Gamma _{n},}$

where the inclusions are proper for ${\displaystyle n\geq 4.}$ The two sets above are, in fact, convex cones.

Further non-Shannon-type inequalities were reported in.[6][7][8] Dougherty et al.[9] found a number of non-Shannon-type inequalities by computer search. Matus[10] proved the existence of infinitely many linear non-Shannon-type inequalities.

## Lower bounds for the Kullback–Leibler divergence

A great many important inequalities in information theory are actually lower bounds for the Kullback–Leibler divergence. Even the Shannon-type inequalities can be considered part of this category, since the bivariate mutual information can be expressed as the Kullback–Leibler divergence of the joint distribution with respect to the product of the marginals, and thus these inequalities can be seen as a special case of Gibbs' inequality.

On the other hand, it seems to be much more difficult to derive useful upper bounds for the Kullback–Leibler divergence. This is because the Kullback–Leibler divergence DKL(P||Q) depends very sensitively on events that are very rare in the reference distribution Q. DKL(P||Q) increases without bound as an event of finite non-zero probability in the distribution P becomes exceedingly rare in the reference distribution Q, and in fact DKL(P||Q) is not even defined if an event of non-zero probability in P has zero probability in Q. (Hence the requirement that P be absolutely continuous with respect to Q.)

### Gibbs' inequality

This fundamental inequality states that the Kullback–Leibler divergence is non-negative.

### Kullback's inequality

Another inequality concerning the Kullback–Leibler divergence is known as Kullback's inequality.[11] If P and Q are probability distributions on the real line with P absolutely continuous with respect to Q, and whose first moments exist, then

${\displaystyle D_{KL}(P\|Q)\geq \Psi _{Q}^{*}(\mu '_{1}(P)),}$

where ${\displaystyle \Psi _{Q}^{*}}$ is the large deviations rate function, i.e. the convex conjugate of the cumulant-generating function, of Q, and ${\displaystyle \mu '_{1}(P)}$ is the first moment of P.

The Cramér–Rao bound is a corollary of this result.

### Pinsker's inequality

Pinsker's inequality relates Kullback–Leibler divergence and total variation distance. It states that if P, Q are two probability distributions, then

${\displaystyle {\sqrt {{\frac {1}{2}}D_{KL}^{(e)}(P\|Q)}}\geq \sup\{|P(A)-Q(A)|:A{\text{ is an event to which probabilities are assigned.}}\}.}$

where

${\displaystyle D_{KL}^{(e)}(P||Q)}$

is the Kullback–Leibler divergence in nats and

${\displaystyle \sup _{A}|P(A)-Q(A)|}$

is the total variation distance.

## Other inequalities

### Hirschman uncertainty

In 1957,[12] Hirschman showed that for a (reasonably well-behaved) function ${\displaystyle f:\mathbb {R} \rightarrow \mathbb {C} }$ such that ${\displaystyle \int _{-\infty }^{\infty }|f(x)|^{2}\,dx=1,}$ and its Fourier transform ${\displaystyle g(y)=\int _{-\infty }^{\infty }f(x)e^{-2\pi ixy}\,dx,}$ the sum of the differential entropies of ${\displaystyle |f|^{2}}$ and ${\displaystyle |g|^{2}}$ is non-negative, i.e.

${\displaystyle -\int _{-\infty }^{\infty }|f(x)|^{2}\log |f(x)|^{2}\,dx-\int _{-\infty }^{\infty }|g(y)|^{2}\log |g(y)|^{2}\,dy\geq 0.}$

Hirschman conjectured, and it was later proved,[13] that a sharper bound of ${\displaystyle \log(e/2),}$ which is attained in the case of a Gaussian distribution, could replace the right-hand side of this inequality. This is especially significant since it implies, and is stronger than, Weyl's formulation of Heisenberg's uncertainty principle.

### Tao's inequality

Given discrete random variables ${\displaystyle X}$, ${\displaystyle Y}$, and ${\displaystyle Y'}$, such that ${\displaystyle X}$ takes values only in the interval [−1, 1] and ${\displaystyle Y'}$ is determined by ${\displaystyle Y}$ (such that ${\displaystyle H(Y'|Y)=0}$), we have[14][15]

${\displaystyle \mathbb {E} {\big (}{\big |}\mathbb {E} (X|Y')-\mathbb {E} (X|Y){\big |}{\big )}\leq {\sqrt {I(X;Y|Y')\,2\log 2}},}$

relating the conditional expectation to the conditional mutual information. This is a simple consequence of Pinsker's inequality. (Note: the correction factor log 2 inside the radical arises because we are measuring the conditional mutual information in bits rather than nats.)

## References

1. ^ Yeung, R.W. (1997). "A framework for linear information inequalities". IEEE Transactions on Information Theory. New York. 43 (6): 1924–1934. doi:10.1109/18.641556.)
2. ^ Yeung, R.W.; Yan, Y.O. (1996). "ITIP - Information Theoretic Inequality Prover".
3. ^ Pulikkoonattu, R.; E.Perron, E.; S.Diggavi, S. (2007). "Xitip - Information Theoretic Inequalities Prover".
4. ^ Zhang, Z.; Yeung, R. W. (1997). "A non-Shannon-type conditional inequality of information quantities". IEEE Transactions on Information Theory. New York. 43 (6): 1982–1986. doi:10.1109/18.641561.
5. ^ Zhang, Z.; Yeung, R. W. (1998). "On characterization of entropy function via information inequalities". IEEE Transactions on Information Theory. New York. 44 (4): 1440–1452. doi:10.1109/18.681320.
6. ^ Matus, F. (1999). "Conditional independences among four random variables III: Final conclusion". Combinatorics, Probability and Computing. 8 (3): 269–276. doi:10.1017/s0963548399003740.
7. ^ Makarychev, K.; et al. (2002). "A new class of non-Shannon-type inequalities for entropies" (PDF). Communications in Information and Systems. 2 (2): 147–166. doi:10.4310/cis.2002.v2.n2.a3.
8. ^ Zhang, Z. (2003). "On a new non-Shannon-type information inequality" (PDF). Communications in Information and Systems. 3 (1): 47–60. doi:10.4310/cis.2003.v3.n1.a4.
9. ^ Dougherty, R.; et al. (2006). Six new non-Shannon information inequalities. 2006 IEEE International Symposium on Information Theory.
10. ^ Matus, F. (2007). Infinitely many information inequalities. 2007 IEEE International Symposium on Information Theory.
11. ^ Fuchs, Aimé; Letta, Giorgio (1970). "L'inégalité de Kullback. Application à la théorie de l'estimation". Séminaire de probabilités. Strasbourg. 4: 108–131. doi:10.1007/bfb0059338. MR 0267669.
12. ^ Hirschman, I. I. (1957). "A Note on Entropy". American Journal of Mathematics. 79 (1): 152–156. doi:10.2307/2372390. JSTOR 2372390.
13. ^ Beckner, W. (1975). "Inequalities in Fourier Analysis". Annals of Mathematics. 102 (6): 159–182. doi:10.2307/1970980. JSTOR 1970980.
14. ^ Tao, T. (2006). "Szemerédi's regularity lemma revisited". Contrib. Discrete Math. 1: 8–28. arXiv:.
15. ^ Ahlswede, Rudolf (2007). "The final form of Tao's inequality relating conditional expectation and conditional mutual information". Advances in Mathematics of Communications. 1 (2): 239–242. doi:10.3934/amc.2007.1.239.