# User:Gala.martin/sandbox

## Proofs

A proof of Jensen's inequality can be provided in several ways. Here three different proves are given, each proof beeing related to the three different statements above (the finite form, the inequality in measure-theoretic terminology, and the general inequality in probabilistic notation). The first one is obtained by proving the finite form of the inequality first, and then using a density argument; this proof should clear out how is the inequality derived. The second one is the most common proof of Jensen's inequality, and uses some basic ideas of nonsmooth analysis. The third one is just a generalization of the second one, that provides a proof of the general statement for vector--valued random variables. This last proof is the more compact, even if requires a more advanced mathematical level.

### Proof 1 (using the finite form)

If ${\displaystyle \lambda _{1},\,\lambda _{2}}$ are two arbitrary positive real numbers such that ${\displaystyle \lambda _{1}+\lambda _{2}=1}$, then convexity of ${\displaystyle \varphi }$ implies ${\displaystyle \varphi (\lambda _{1}x_{1}+\lambda _{2}x_{2})\leq \lambda _{1}\,\varphi (x_{1})+\lambda _{2}\,\varphi (x_{2})}$ for any ${\displaystyle x_{1},\,x_{2}}$. This can be easily generalized: if ${\displaystyle \lambda _{1},\,\lambda _{2},\ldots ,\lambda _{n}}$ are n positive real numbers such that ${\displaystyle \lambda _{1}+\lambda _{2}+\ldots +\lambda _{n}=1}$, then

${\displaystyle \varphi (\lambda _{1}x_{1}+\lambda _{2}x_{2}+\ldots \lambda _{n}x_{n})\leq \lambda _{1}\,\varphi (x_{1})+\lambda _{2}\,\varphi (x_{2})+\ldots +\lambda _{n}\,\varphi (x_{n})}$,

for any ${\displaystyle x_{1},\,x_{2},\ldots ,\,x_{n}}$. This finite form of the Jensen's inequality can be proved by induction: by convexity hypotheses, the statement is true for ${\displaystyle n=2}$. Suppose it is true also for some n, one needs to prove it for n+1. At least one of the ${\displaystyle \lambda _{i}}$ is strictly positive, say ${\displaystyle \lambda _{1}}$; therefore by convexity inequality:

${\displaystyle \varphi \left(\sum _{i=1}^{n+1}\lambda _{i}x_{i}\right)=\varphi \left(\lambda _{1}x_{1}+(1-\lambda _{1})\sum _{i=2}^{n+1}{\frac {\lambda _{i}}{1-\lambda _{1}}}x_{i}\right)\leq \lambda _{1}\,\varphi (x_{1})+(1-\lambda _{1})\sum _{i=2}^{n+1}\varphi \left({\frac {\lambda _{i}}{1-\lambda _{1}}}x_{i}\right)}$.

Since ${\displaystyle \sum _{i=2}^{n+1}{\frac {\lambda _{i}}{1-\lambda _{1}}}=1}$, one can apply the induction hypotheses to the last term in the previous formula to obtain the result, namely the finite form of the Jensen's inequality.

In order to obtain the general inequality from this finite form, one needs to use a density argument. The finite form can be re-written as:

${\displaystyle \varphi \left(\int x\,d\mu _{n}(x)\right)\leq \int \varphi (x)\,d\mu _{n}(x)}$,

where ${\displaystyle \mu _{n}}$ is a measure given by an arbitrary convex combination of Dirac deltas:

${\displaystyle \mu _{n}=\sum _{i=1}^{n}\lambda _{i}\delta _{x_{i}}}$.

Since convex functions are continuous, and since convex combinations of Dirac deltas are weakly dense in the set of probability measures (as could be easily verified), the general statement is obtained simply by a limiting procedure.

### Proof 2 (measure theoretic notation)

Let g be a real-valued μ-integrable function on a measure space Ω, and let φ be a convex function on the real numbers. Define the right-handed derivative of φ at x as

${\displaystyle \varphi ^{\prime }(x):=\lim _{t\to 0^{-}}{\frac {\varphi (x+t)-\varphi (x)}{t}}}$

Since φ is convex, the quotient of the right-hand side is decreasing when t approaches 0 from the right, and bounded below by any term of the form

${\displaystyle {\frac {\varphi (x+t)-\varphi (x)}{t}}}$

where t < 0, and therefore, the limit does always exist.

Now, let us define the following:

${\displaystyle x_{0}:=\int _{\Omega }g\,d\mu ,}$
${\displaystyle a:=\varphi ^{\prime }(x_{0}),}$
${\displaystyle b:=\varphi (x_{0})-x_{0}\varphi ^{\prime }(x_{0}).}$

Then for all x, ${\displaystyle ax+b\leq \varphi (x)}$. To see that, take x>x0, and define t = x − x0 > 0. Then,

${\displaystyle \varphi ^{\prime }(x_{0})\leq {\frac {\varphi (x_{0}+t)-\varphi (x_{0})}{t}}.}$

Therefore,

${\displaystyle \varphi ^{\prime }(x_{0})(x-x_{0})+\varphi (x_{0})\leq \varphi (x)}$

as desired. The case for x < x0 is proven similarly, and clearly ${\displaystyle ax_{0}+b=\varphi (x_{0})}$.

φ(x0) can then be rewritten as

${\displaystyle ax_{0}+b=a\left(\int _{\Omega }g\,d\mu \right)+b.}$

But since μ(Ω) = 1, then for every real number k we have

${\displaystyle \int _{\Omega }k\,d\mu =k.}$

In particular,

${\displaystyle a\left(\int _{\Omega }g\,d\mu \right)+b=\int _{\Omega }(ag+b)\,d\mu \leq \int _{\Omega }\varphi \circ g\,d\mu .}$

### Proof 3 (general inequality in probabilistic notation)

Let ${\displaystyle X}$ be a random variable that takes value in a real topological vector space T. Since ${\displaystyle \varphi :T\mapsto \mathbb {R} }$ is convex, for any ${\displaystyle x,y\in T}$, the quantity

${\displaystyle {\frac {\varphi (x+\theta \,y)-\varphi (x)}{\theta }}}$,

is decreasing as θ approaches 0. In particular, it is well defined the subdifferential of ${\displaystyle \varphi }$ evaluated at ${\displaystyle x}$ in the direction ${\displaystyle y}$, defined by:

${\displaystyle (D\varphi )(x)\cdot y:=\lim _{\theta \to 0}{\frac {\varphi (x+\theta \,y)-\varphi (x)}{\theta }}=\inf _{\theta \neq 0}{\frac {\varphi (x+\theta \,y)-\varphi (x)}{\theta }}}$.

It is easily seen that the subdifferential is linear in ${\displaystyle y}$, and since the infimum taken in the right-hand side of the previous formula is smaller than the value of the same term for ${\displaystyle \theta =1}$ one gets:

${\displaystyle \varphi (x)\leq \varphi (x+y)-(D\varphi )(x)\cdot y}$.

In particular, for an arbitrary sub-σ-algebra ${\displaystyle {\mathfrak {G}}}$ we can evaluate the last inequality when ${\displaystyle x=\mathbb {E} \{X|{\mathfrak {G}}\},\,y=X-\mathbb {E} \{X|{\mathfrak {G}}\}}$ to obtain:

${\displaystyle \varphi (\mathbb {E} \{X|{\mathfrak {G}}\})\leq \varphi (X)-(D\varphi )(\mathbb {E} \{X|{\mathfrak {G}}\})\cdot (X-\mathbb {E} \{X|{\mathfrak {G}}\})}$.

Now, if we take the expectation conditioned to ${\displaystyle {\mathfrak {G}}}$ on both sides of the previous expression, we get the result since:

${\displaystyle \mathbb {E} \{\left[(D\varphi )(\mathbb {E} \{X|{\mathfrak {G}}\})\cdot (X-\mathbb {E} \{X|{\mathfrak {G}}\})\right]|{\mathfrak {G}}\}=(D\varphi )(\mathbb {E} \{X|{\mathfrak {G}}\})\cdot \mathbb {E} \{\left(X-\mathbb {E} \{X|{\mathfrak {G}}\}\right)|{\mathfrak {G}}\}=0}$,

by the linearity of the subdifferential in the ${\displaystyle y}$ variable, and well-known properties of the conditional expectation.