# Information projection

In information theory, the information projection or I-projection of a probability distribution q onto a set of distributions P is

${\displaystyle p^{*}={\underset {p\in P}{\arg \min }}\operatorname {D} _{\mathrm {KL} }(p||q)}$

where ${\displaystyle D_{\mathrm {KL} }}$ is the Kullback–Leibler divergence from q to p. Viewing the Kullback–Leibler divergence as a measure of distance, the I-projection ${\displaystyle p^{*}}$ is the "closest" distribution to q of all the distributions in P.

The I-projection is useful in setting up information geometry, notably because of the following inequality, valid when P is convex:[1]

${\displaystyle \operatorname {D} _{\mathrm {KL} }(p||q)\geq \operatorname {D} _{\mathrm {KL} }(p||p^{*})+\operatorname {D} _{\mathrm {KL} }(p^{*}||q)}$

This inequality can be interpreted as an information-geometric version of Pythagoras' triangle inequality theorem, where KL divergence is viewed as squared distance in a Euclidean space.

It is worthwhile to note that since ${\displaystyle \operatorname {D} _{\mathrm {KL} }(p||q)\geq 0}$ and continuous in p, if P is closed and non-empty, then there exists at least one minimizer to the optimization problem framed above. Furthermore if P is convex, then the optimum distribution is unique.

The reverse I-projection also known as moment projection or M-projection is

${\displaystyle p^{*}={\underset {p\in P}{\arg \min }}\operatorname {D} _{\mathrm {KL} }(q||p)}$

Since the KL divergence is not symmetric in its arguments, the I-projection and the M-projection will exhibit different behavior. For I-projection, ${\displaystyle p(x)}$ will typically under-estimate the support of ${\displaystyle q(x)}$ and will lock onto one of its modes. This is due to ${\displaystyle p(x)=0}$, whenever ${\displaystyle q(x)=0}$ to make sure KL divergence stays finite. For M-projection, ${\displaystyle p(x)}$ will typically over-estimate the support of ${\displaystyle q(x)}$. This is due to ${\displaystyle p(x)>0}$ whenever ${\displaystyle q(x)>0}$ to make sure KL divergence stays finite.

The concept of information projection can be extended to arbitrary statistical f-divergences and other divergences.[2]