Sanov's theorem

In information theory, Sanov's theorem gives a bound on the probability of observing an atypical sequence of samples from a given probability distribution.

Let A be a set of probability distributions over an alphabet X, and let q be an arbitrary distribution over X (where q may or may not be in A). Suppose we draw n i.i.d. samples from q, represented by the vector $x^{n}=x_{1},x_{2},\ldots ,x_{n}$ . Then

q^{n}(\{x^{n}:{\hat {p}}_{x^{n}}\in A\})\leq (n+1)^{|X|}2^{-nD_{\mathrm {KL} }(p^{*}||q)}

,

where

$q^{n}(x^{n})$ is shorthand for $q(x_{1})q(x_{2})\cdots q(x_{n})$ and $q^{n}(S)$ is shorthand for $\sum _{x^{n}\in S}q^{n}(x^{n})$ ,
${\hat {p}}_{x^{n}}$ is the empirical distribution of the sample $x^{n}$ , and
$p^{*}$ is the information projection of q onto A.