# Concentration inequality

In probability theory, concentration inequalities provide bounds on how a random variable deviates from some value (typically, its expected value). The laws of large numbers of classical probability theory states that sums of independent random variables are, under very mild conditions, close to their expectation with a large probability. Such sums are the most basic examples of random variables concentrated around their mean. Recent results show that such behavior is shared by other functions of independent random variables.

Concentration inequalities can be sorted according to how much information about the random variable is needed in order to use them.

## Markov's inequality

Let ${\displaystyle X}$ be a random variable that is non-negative (almost surely). Then, for every constant ${\displaystyle a>0}$,

${\displaystyle \Pr(X\geq a)\leq {\frac {\operatorname {E} (X)}{a}}.}$

Note the following extension to Markov's inequality: if ${\displaystyle \Phi }$ is a strictly increasing and non-negative function, then

${\displaystyle \Pr(X\geq a)=\Pr(\Phi (X)\geq \Phi (a))\leq {\frac {\operatorname {E} (\Phi (X))}{\Phi (a)}}.}$

## Chebyshev's inequality

Chebyshev's inequality requires the following information on a random variable ${\displaystyle X}$:

• The expected value ${\displaystyle \operatorname {E} [X]}$ is finite.
• The variance ${\displaystyle \operatorname {Var} [X]=\operatorname {E} [(X-\operatorname {E} [X])^{2}]}$ is finite.

Then, for every constant ${\displaystyle a>0}$,

${\displaystyle \Pr(|X-\operatorname {E} [X]|\geq a)\leq {\frac {\operatorname {Var} [X]}{a^{2}}},}$

or equivalently,

${\displaystyle \Pr(|X-\operatorname {E} [X]|\geq a\cdot \operatorname {Std} [X])\leq {\frac {1}{a^{2}}},}$

where ${\displaystyle \operatorname {Std} [X]}$ is the standard deviation of ${\displaystyle X}$.

Chebyshev's inequality can be seen as a special case of the generalized Markov's inequality applied to the random variable ${\displaystyle X-\operatorname {E} [X]}$ with ${\displaystyle \Phi (x)=x^{2}}$.

## Chernoff bounds

The generic Chernoff bound[1]:63–65 requires only the moment generating function of ${\displaystyle X}$, defined as: ${\displaystyle M_{X}(t):=\operatorname {E} \!\left[e^{tX}\right]}$, provided it exists. Based on Markov's inequality, for every ${\displaystyle t>0}$:

${\displaystyle \Pr(X\geq a)\leq {\frac {\operatorname {E} [e^{t\cdot X}]}{e^{t\cdot a}}},}$

and for every ${\displaystyle t<0}$:

${\displaystyle \Pr(X\leq a)\leq {\frac {\operatorname {E} [e^{t\cdot X}]}{e^{t\cdot a}}}.}$

There are various Chernoff bounds for different distributions and different values of the parameter ${\displaystyle t}$. See [2]:5-7 for a compilation of more concentration inequalities.

## Bounds on sums of independent variables

Let ${\displaystyle X_{1},X_{2},\dots ,X_{n}}$ be independent random variables such that, for all i:

${\displaystyle a_{i}\leq X_{i}\leq b_{i}}$ almost surely.
${\displaystyle c_{i}:=b_{i}-a_{i}}$
${\displaystyle \forall i:c_{i}\leq C}$

Let ${\displaystyle S_{n}}$ be their sum, ${\displaystyle E_{n}}$ its expected value and ${\displaystyle V_{n}}$ its variance:

${\displaystyle S_{n}:=\sum _{i=1}^{n}X_{i}}$
${\displaystyle E_{n}:=\operatorname {E} [S_{n}]=\sum _{i=1}^{n}\operatorname {E} [X_{i}]}$
${\displaystyle V_{n}:=\operatorname {Var} [S_{n}]=\sum _{i=1}^{n}\operatorname {Var} [X_{i}]}$

It is often interesting to bound the difference between the sum and its expected value. Several inequalities can be used.

1. Hoeffding's inequality says that:

${\displaystyle \Pr \left[|S_{n}-E_{n}|>t\right]<2\exp \left(-{\frac {2t^{2}}{\sum _{i=1}^{n}c_{i}^{2}}}\right)<2\exp \left(-{\frac {2t^{2}}{nC^{2}}}\right)}$

2. The random variable ${\displaystyle S_{n}-E_{n}}$ is a special case of a martingale, and ${\displaystyle S_{0}-E_{0}=0}$. Hence, Azuma's inequality can also be used and it yields a similar bound:

${\displaystyle \Pr \left[|S_{n}-E_{n}|>t\right]<2\exp \left(-{\frac {t^{2}}{2\sum _{i=1}^{n}c_{i}^{2}}}\right)<2\exp \left(-{\frac {t^{2}}{2nC^{2}}}\right)}$

This is a generalization of Hoeffding's since it can handle other types of martingales, as well as supermartingales and submartingales.

3. The sum function, ${\displaystyle S_{n}=f(X_{1},\dots ,X_{n})}$, is a special case of a function of n variables. This function changes in a bounded way: if variable i is changed, the value of f changes by at most ${\displaystyle b_{i}-a_{i}. Hence, McDiarmid's inequality can also be used and it yields a similar bound:

${\displaystyle \Pr \left[|S_{n}-E_{n}|>t\right]<2\exp \left(-{\frac {2t^{2}}{\sum _{i=1}^{n}c_{i}^{2}}}\right)<2\exp \left(-{\frac {2t^{2}}{nC^{2}}}\right)}$

This is a different generalization of Hoeffding's since it can handle other functions besides the sum function, as long as they change in a bounded way.

4. Bennett's inequality offers some improvement over Hoeffding's when the variances of the summands are small compared to their almost-sure bounds C. It says that:

${\displaystyle \Pr \left[|S_{n}-E_{n}|>t\right]\leq 2\exp \left[-{\frac {V_{n}}{C^{2}}}h\left({\frac {Ct}{V_{n}}}\right)\right],}$ where ${\displaystyle h(u)=(1+u)\log(1+u)-u}$

5. The first of Bernstein's inequalities says that:

${\displaystyle \Pr \left[|S_{n}-E_{n}|>t\right]<2\exp \left(-{\frac {t^{2}/2}{V_{n}+C\cdot t/3}}\right)}$

This is a generalization of Hoeffding's since it can handle not only independent variables but also weakly-dependent variables.

6. Chernoff bounds have a particularly simple form in the case of sum of independent variables, since ${\displaystyle \operatorname {E} [e^{t\cdot S_{n}}]=\prod _{i=1}^{n}{\operatorname {E} [e^{t\cdot X_{i}}]}}$.

For example,[3] suppose the variables ${\displaystyle X_{i}}$ satisfy ${\displaystyle X_{i}\geq E(X_{i})-a_{i}-M}$, for ${\displaystyle 1\leq i\leq n}$. Then we have lower tail inequality:

${\displaystyle \Pr[S_{n}-E_{n}<-\lambda ]\leq \exp \left(-{\frac {\lambda ^{2}}{2(V_{n}+\sum _{i=1}^{n}a_{i}^{2}+M\lambda /3)}}\right)}$

If ${\displaystyle X_{i}}$ satisfies ${\displaystyle X_{i}\leq E(X_{i})+a_{i}+M}$, we have upper tail inequality:

${\displaystyle \Pr[S_{n}-E_{n}>\lambda ]\leq \exp \left(-{\frac {\lambda ^{2}}{2(V_{n}+\sum _{i=1}^{n}a_{i}^{2}+M\lambda /3)}}\right)}$

If ${\displaystyle X_{i}}$ are i.i.d., ${\displaystyle |X_{i}|\leq 1}$ and ${\displaystyle \sigma ^{2}}$ is the variance of ${\displaystyle X_{i}}$, a typical version of Chernoff inequality is:

${\displaystyle \Pr[|S_{n}|\geq k\sigma ]\leq 2e^{-k^{2}/4n}{\text{ for }}0\leq k\leq 2\sigma .}$

7. Similar bounds can be found in: Rademacher distribution#Bounds on sums

## Efron–Stein inequality

The Efron–Stein inequality (or influence inequality, or MG bound on variance) bounds the variance of a general function.

Suppose that ${\displaystyle X_{1}\dots X_{n}}$, ${\displaystyle X_{1}'\dots X_{n}'}$ are independent with ${\displaystyle X_{i}'}$ and ${\displaystyle X_{i}}$ having the same distribution for all ${\displaystyle i}$.

Let ${\displaystyle X=(X_{1},\dots ,X_{n}),X^{(i)}=(X_{1},\dots ,X_{i-1},X_{i}',X_{i+1},\dots ,X_{n}).}$ Then

${\displaystyle \mathrm {Var} (f(X))\leq {\frac {1}{2}}\sum _{i=1}^{n}E[(f(X)-f(X^{(i)}))^{2}].}$

## Dvoretzky–Kiefer–Wolfowitz inequality

The Dvoretzky–Kiefer–Wolfowitz inequality bounds the difference between the real and the empirical cumulative distribution function.

Given a natural number ${\displaystyle n}$, let ${\displaystyle X_{1},X_{2},\dots ,X_{n}}$ be real-valued independent and identically distributed random variables with cumulative distribution function F(·). Let ${\displaystyle F_{n}}$ denote the associated empirical distribution function defined by

${\displaystyle F_{n}(x)={\frac {1}{n}}\sum _{i=1}^{n}\mathbf {1} _{\{X_{i}\leq x\}},\qquad x\in \mathbb {R} .}$

So ${\displaystyle F(x)}$ is the probability that a single random variable ${\displaystyle X}$ is smaller than ${\displaystyle x}$, and ${\displaystyle F_{n}(x)}$ is the average number of random variables that are smaller than ${\displaystyle x}$.

Then

${\displaystyle \Pr \left(\sup _{x\in \mathbb {R} }{\bigl (}F_{n}(x)-F(x){\bigr )}>\varepsilon \right)\leq e^{-2n\varepsilon ^{2}}{\text{ for every }}\varepsilon \geq {\sqrt {{\tfrac {1}{2n}}\ln 2}}.}$

## References

1. ^ Mitzenmacher, Michael; Upfal, Eli (2005). Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press. ISBN 0-521-83540-2.
2. ^ Slagle, N.P. (2012). "One Hundred Statistics and Probability Inequalities" (PDF).
3. ^ Chung, Fan; Lu, Linyuan (2010). "Old and new concentration inequalities" (PDF). Complex Graphs and Networks. American Mathematical Society. Retrieved August 14, 2018.