Hoeffding's inequality

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of random variables deviates from its expected value. Hoeffding's inequality was proved by Wassily Hoeffding in 1963.[1]

Hoeffding's inequality is a special case of the Azuma–Hoeffding inequality, and it is more general than the Bernstein inequality, proved by Sergei Bernstein in 1923. They are also special cases of McDiarmid's inequality.

Special case of Bernoulli random variables[edit]

Hoeffding's inequality can be applied to the important special case of identically distributed Bernoulli random variables, and this is how the inequality is often used in combinatorics and computer science. We consider a coin that shows heads with probability p and tails with probability 1 − p. We toss the coin n times. The expected number of times the coin comes up heads is pn. Furthermore, the probability that the coin comes up heads at most k times can be exactly quantified by the following expression:

\Pr( H(n) \leq k)= \sum_{i=0}^{k} \binom{n}{i} p^i (1-p)^{n-i},

where H(n) is the number of heads in n coin tosses.

When k =(pε)n for some ε > 0, Hoeffding's inequality bounds this probability by a term that is exponentially small in ε2n:

\Pr( H(n) \leq (p-\varepsilon) n)\leq\exp\left(-2\varepsilon^2 n\right).

Similarly, when k = (p + ε)n for some ε > 0, Hoeffding's inequality bounds the probability that we see at least εn more tosses that show heads than we would expect:

\Pr(H(n) \geq (p+\varepsilon)n)\leq\exp\left(-2\varepsilon^2 n\right).

Hence Hoeffding's inequality implies that the number of heads that we see is concentrated around its mean, with exponentially small tail.

\Pr((p-\epsilon)n \leq H(n) \leq (p+\varepsilon)n)\geq 1-2\exp\left(-2\varepsilon^2 n\right).

General case[edit]

Let X1, ..., Xn be independent random variables. Assume that the Xi are almost surely bounded; that is:

\Pr(X_i \in [a_i, b_i]) = 1,  \qquad 1 \leq i \leq n.

We define the empirical mean of these variables

\overline X = \frac{1}{n}(X_1 + \cdots + X_n).

Theorem 2 of Hoeffding (1963) proves the inequalities

\begin{align}
\Pr \left(\overline X - \mathrm{E}\left [\overline X \right] \geq t \right) &\leq \exp \left(-\frac{2n^2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right) \\
\Pr \left(\left |\overline X - \mathrm{E}\left [\overline X \right] \right | \geq t \right) &\leq 2\exp \left(-\frac{2n^2t^2}{\sum_{i=1}^n(b_i - a_i)^2} \right)
\end{align}

which are valid for positive values of t. Here E[X] is the expected value of X. The inequalities can be also stated in terms of the sum

S_{n} = X_1 + \cdots + X_n

of the random variables:

\Pr(S_{n} - \mathrm{E}[S_{n}] \geq t) \leq \exp \left( - \frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right),
\Pr(|S_{n} - \mathrm{E}[S_{n}]| \geq t) \leq 2\exp \left( - \frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right).

Note that the inequalities also hold when the Xi have been obtained using sampling without replacement; in this case the random variables are not independent anymore. A proof of this statement can be found in Hoeffding's paper. For slightly better bounds in the case of sampling without replacement, see for instance the paper by Serfling (1974).

Proof[edit]

In this section, we give a proof of Hoeffding's inequality.[2] For the proof, we require the following:

Hoeffding's Lemma. Suppose X is a real random variable with mean zero such that Pr(X ∈ [a, b]) = 1. Then
\mathrm{E} \left [e^{sX} \right ]\leq \exp\left(\tfrac{1}{8} s^2 (b-a)^2\right).

Proof of Hoeffding's Lemma[edit]

First note that if one of a or b is zero, then Pr(X = 0) = 1 and the inequality follows. If both are nonzero, then a must be negative and b must be positive.

Next, recall that esx is a convex function on the real line:

\forall x \in [a, b]: \qquad e^{sx}\leq \frac{b-x}{b-a}e^{sa}+\frac{x-a}{b-a}e^{sb}.

Applying E[ ⋅ ] to both sides of the above inequality gives us:

\begin{align}
\mathrm{E} \left [e^{sX} \right ] &\leq \frac{b-\mathrm{E}[X]}{b-a} e^{sa} + \frac{\mathrm{E}[X]-a}{b-a}e^{sb} \\
&= \frac{b}{b-a} e^{sa} + \frac{-a}{b-a}e^{sb} && \mathrm{E}[X] = 0\\
&= \left (-\frac{a}{b-a} \right ) e^{sa} \left (-\frac{b}{a}+e^{sb-sa} \right ) \\
&= \left (-\frac{a}{b-a} \right ) e^{sa} \left (-\frac{b-a+a}{a}+e^{s(b-a)} \right ) \\
&= \left (-\frac{a}{b-a} \right ) e^{sa} \left (-\frac{b-a}{a}-1+e^{s(b-a)} \right ) \\
&= \left (1-\theta+\theta e^{s(b-a)} \right ) e^{-s\theta(b-a)} && \theta=-\frac{a}{b-a}>0
\end{align}

Let u = s(ba) and define:

\begin{cases} \varphi:\mathbf{R}\to\mathbf{R} \\ \varphi(u)=-\theta u+\log \left(1-\theta+\theta e^u \right)\end{cases}

φ is well defined on R, to see this we calculate:

\begin{align}
1-\theta+\theta e^u &= \theta \left (\frac{1}{\theta} - 1 + e^u \right) \\
& = \theta \left ( -\frac{b}{a} + e^u \right ) \\
& > 0 && \theta > 0, \quad \frac{b}{a} <0
\end{align}

The definition of φ implies

\mathrm{E} \left [e^{sX} \right ] \leq e^{\varphi(u)}.

By Taylor's theorem, for every real u there exists a v between 0 and u such that

\varphi(u)=\varphi(0)+u\varphi'(0)+\tfrac{1}{2} u^2\varphi''(v).

Note that:

\begin{align}
\varphi(0)  &= 0 \\
\varphi'(0) &= -\theta+ \left.\frac{\theta e^u}{1-\theta +\theta e^u}\right|_{u=0} \\
&=0 \\[6pt]
\varphi''(v) &= \frac{\theta e^v \left (1-\theta+\theta e^v \right)-\theta^{2}e^{2v}}{\left (1-\theta+\theta e^v \right)^2 }\\[6pt]
&=\frac{\theta e^v}{1-\theta+\theta e^v}\left(1-\frac{\theta e^v}{1-\theta+\theta e^v}\right)\\[6pt]
&= t(1-t) && t=\frac{\theta e^v}{1-\theta+\theta e^v} \\
&\leq \tfrac{1}{4} && t > 0
\end{align}

Therefore,

\varphi (u)\leq 0 + u \cdot 0 + \tfrac{1}{2}u^2 \cdot \tfrac{1}{4} = \tfrac{1}{8} u^2 = \tfrac{1}{8}s^2(b-a)^2.

This implies

\mathrm{E} \left [e^{sX} \right ] \leq \exp\left(\tfrac{1}{8}s^2(b-a)^2\right).

Proof of Inequality[edit]

Using this lemma, we can prove Hoeffding's inequality. Suppose X1, ..., Xn are n independent random variables such that

\Pr\left (X_{i}\in [a_{i},b_{i}] \right)=1, \qquad 1\leq i\leq n.

Let

S_n = X_1 + \cdots + X_n.

Then for s, t ≥ 0, Markov's inequality and the independence of Xi implies:

\begin{align}
\Pr\left(S_{n}-\mathrm{E}\left [S_n \right ]\geq t \right) &= \Pr \left (e^{s(S_n-\mathrm{E}\left [S_n \right ])} \geq e^{st} \right)\\
&\leq e^{-st} \mathrm{E} \left [e^{s(S_{n}-\mathrm{E}\left [S_n \right ])} \right ]\\
&= e^{-st} \prod_{i=1}^{n}\mathrm{E} \left [e^{s(X_i-\mathrm{E}\left [X_{i}\right])} \right ]\\
&\leq e^{-st} \prod_{i=1}^{n} e^{\frac{s^2 (b_i-a_i)^2}{8} } \\
&= \exp\left(-st+\tfrac{1}{8} s^2 \sum_{i=1}^{n}(b_{i}-a_{i})^{2}\right)
\end{align}

To get the best possible upper bound, we find the minimum of the right hand side of the last inequality as a function of s. Define

\begin{cases} g: \mathbf{R_{+}} \to \mathbf{R} \\ g(s)=-st+\frac{s^{2}}{8}\sum_{i=1}^{n}(b_{i}-a_{i})^2 \end{cases}

Note that g is a quadratic function and achieves its minimum at

s=\frac{4t}{\sum_{i=1}^{n}(b_{i}-a_{i})^{2}}.

Thus we get

\Pr \left(S_{n}-\mathrm{E}\left [S_{n} \right ]\geq t \right)\leq \exp\left(-\frac{2t^{2}}{\sum_{i=1}^{n}(b_{i}-a_{i})^{2}}\right).

See also[edit]

Notes[edit]

References[edit]

  • Serfling, Robert J. (1974). "Probability Inequalities for the Sum in Sampling without Replacement". The Annals of Statistics 2 (1): 39–48. doi:10.1214/aos/1176342611. 
  • Hoeffding, Wassily (March 1963). "Probability inequalities for sums of bounded random variables". Journal of the American Statistical Association 58 (301): 13–30. JSTOR 2282952. 
  • Robert Nowak (2009). "Lecture 7: Chernoff's Bound and Hoeffding's Inequality". ECE 901 (Summer '09) : Statistical Learning Theory Lecture Notes. University of Wisconsin-Madison. Retrieved May 16, 2014.