Empirical distribution function

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In statistics, the empirical distribution function, or empirical cdf, is the cumulative distribution function associated with the empirical measure of the sample. This cdf is a step function that jumps up by 1/n at each of the n data points. The empirical distribution function estimates the true underlying cdf of the points in the sample. A number of results exist which allow to quantify the rate of convergence of the empirical cdf to its limit.

Contents

[edit] Definition

Let (x1, …, xn) be iid real random variables with the common cdf F(t). Then the empirical distribution function is defined as [1]


    \hat F_n(t) = \frac{ \mbox{number of elements in the sample} \leq t}n = 
\frac{1}{n} \sum_{i=1}^n \mathbf{1}\{x_i \le t\},

where 1{A} is the indicator of event A. For a fixed t, the indicator 1{xi ≤ t} is a Bernoulli random variable with parameter p = F(t), hence \scriptstyle n \hat F_n(t) is a binomial random variable with mean nF(t) and variance nF(t)(1 − F(t)). This implies that \scriptstyle \hat F_n(t) is an unbiased estimator for F(t).

[edit] Asymptotic properties

By the strong law of large numbers, the estimator \scriptstyle\hat{F}_n(t) converges to F(t) as n → ∞ almost surely, for every value of t: [2]


    \hat F_n(t)\ \xrightarrow{a.s.}\ F(t),

thus the estimator \scriptstyle\hat{F}_n(t) is consistent. This expression asserts the pointwise convergence of the empirical distribution function to the true cdf. There is a stronger result, called the Glivenko–Cantelli theorem, which states that the convergence in fact happens uniformly over t: [3]


    \|\hat F_n-F\|_\infty \equiv 
    \sup_{t\in\mathbb{R}} \big|\hat F_n(t)-F(t)\big|\ \xrightarrow{a.s.}\ 0.

The sup-norm in this expression is called the Kolmogorov–Smirnov statistic for testing the goodness-of-fit between the empirical distribution \scriptstyle\hat{F}_n(t) and the assumed true cdf F. Other norm functions may be reasonably used here instead of the sup-norm. For example, the L²-norm gives rise to the Cramér–von Mises statistic.

The asymptotic distribution can be further characterized in several different ways. First, the central limit theorem states that pointwise, \scriptstyle\hat{F}_n(t) has asymptotically normal distribution with the standard √n rate of convergence: [2]


    \sqrt{n}\big(\hat F_n(t) - F(t)\big)\ \ \xrightarrow{d}\ \ \mathcal{N}\Big( 0, F(t)\big(1-F(t)\big) \Big).

This result is extended by the Donsker’s theorem, which asserts that the empirical process \scriptstyle\sqrt{n}(\hat{F}_n - F), viewed as a function indexed by t ∈ R, converges in distribution in the Skorokhod space D[−∞, +∞] to the mean-zero Gaussian process GF = BF, where B is the standard Brownian bridge.[3] The covariance structure of this Gaussian process is


    \mathrm{E}[\,G_F(t_1)G_F(t_2)\,] = F(t_1\wedge t_2) - F(t_1)F(t_2).

The uniform rate of convergence in Donsker’s theorem can be quantified by the result, known as the Hungarian embedding: [4]


    \limsup_{n\to\infty} \frac{\sqrt{n}}{\ln^2 n} \big\| \sqrt{n}(\hat F_n-F) - G_{F,n}\big\|_\infty < \infty, \quad \text{a.s.}


Alternatively, the rate of convergence of \scriptstyle\sqrt{n}(\hat{F}_n-F) can also be quantified in terms of the asymptotic behavior of the sup-norm of this expression. Number of results exist in this venue, for example the Dvoretzky–Kiefer–Wolfowitz inequality provides bound on the tail probabilities of \scriptstyle\sqrt{n}\|\hat{F}_n-F\|_\infty: [4]


    \Pr\!\Big( \sqrt{n}\|\hat{F}_n-F\|_\infty > z \Big) \leq 2e^{-2z^2}.

In fact, Kolmogorov has shown that if the cdf F is continuous, then the expression \scriptstyle\sqrt{n}\|\hat{F}_n-F\|_\infty converges in distribution to ||B||, which has the Kolmogorov distribution that does not depend on the form of F.

Another result, which follows from the law of the iterated logarithm, is that [4]


    \limsup_{n\to\infty} \frac{\sqrt{n}\|\hat{F}_n-F\|_\infty}{\sqrt{2\ln\ln n}} \leq \frac12, \quad \text{a.s.}

and


    \liminf_{n\to\infty} \sqrt{2n\ln\ln n} \|\hat{F}_n-F\|_\infty = \frac{\pi}{2}, \quad \text{a.s.}

[edit] See also

[edit] References

  • Shorack, G.R.; Wellner, J.A. (1986). Empirical processes with applications to statistics. New York: Wiley. 
  • van der Vaart, A.W. (1998). Asymptotic statistics. Cambridge University Press. ISBN 978-0-521-78450-4. 

[edit] Notes

  1. ^ van der Vaart (1998, page 265), PlanetMath
  2. ^ a b van der Vaart (1998, page 265)
  3. ^ a b van der Vaart (1998, page 266)
  4. ^ a b c van der Vaart (1998, page 268)
Personal tools
Namespaces
Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages