# Uniform convergence (combinatorics)

For a class of predicates $H\,\!$ defined on a set $X\,\!$ and a set of samples $x=(x_{1},x_{2},\dots,x_{m})\,\!$, where $x_{i}\in X\,\!$, the empirical frequency of $h\in H\,\!$ on $x\,\!$ is $\widehat{Q_{x}}(h)=\frac{1}{m}|\{i:1\leq i\leq m,h(x_{i})=1\}|\,\!$. The Uniform Convergence Theorem states, roughly,that if $H\,\!$ is "simple" and we draw samples independently (with replacement) from $X\,\!$ according to a distribution $P\,\!$, then with high probability all the empirical frequency will be close to its expectation, where the expectation is given by $Q_{P}(h)=P\{y\in X:h(y)=1\}\,\!$. Here "simple" means that the Vapnik-Chernovenkis dimension of the class $H\,\!$ is small relative to the size of the sample.
In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.

## Uniform convergence theorem statement[1]

If $H\,\!$ is a set of $\{0,1\}\,\!$-valued functions defined on a set $X\,\!$ and $P\,\!$ is a probability distribution on $X\,\!$ then for $\epsilon>0\,\!$ and $m\,\!$ a positive integer, we have,

$P^{m}\{|Q_{P}(h)-\widehat{Q_{x}}(h)|\geq\epsilon\,\!$ for some $h\in H\}\leq 4\Pi_{H}(2m)e^{-\frac{\epsilon^{2}m}{8}}.\,\!$

where, for any $x\in X^{m}\,\!$,

$Q_{P}(h)=P\{(y\in X:h(y)=1\}\,\!$,
$\widehat{Q_{x}}(h)=\frac{1}{m}|\{i:1\leq i\leq m,h(x_{i})=1\}|\,\!$ and $|x|=m\,\!$. $P^{m}\,\!$ indicates that the probability is taken over $x\,\!$ consisting of $m\,\!$ i.i.d. draws from the distribution $P\,\!$.

$\Pi_{H}\,\!$ is defined as: For any $\{0,1\}\,\!$-valued functions $H\,\!$ over $X\,\!$ and $D\subseteq X \,\!$,

$\Pi_{H}(D)=\{h\cap D:h\in H\}\,\!$.

And for any natural number $m\,\!$ the shattering number $\Pi_{H}(m)\,\!$ is defined as.

$\Pi_{H}(m)=max|\{h\cap D:|D|=m,h\in H\}|\,\!$.

From the point of Learning Theory one can consider $H\,\!$ to be the Concept/Hypothesis class defined over the instance set $X\,\!$. Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.

## Sauer–Shelah lemma

The Sauer–Shelah lemma[2] relates the shattering number $\Pi_{h}(m)\,\!$ to the VC Dimension.

Lemma: $\Pi_{H}(m)\leq\left( \frac{em}{d}\right)^{d}\,\!$, where $d\,\!$ is the VC Dimension of the concept class $H\,\!$.

Corollary: $\Pi_{H}(m)\leq m^{d}\,\!$.

## Proof of uniform convergence theorem [1]

Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.

1. Symmetrization: We transform the problem of analyzing $|Q_{P}(h)-\widehat{Q}_{x}(h)|\geq\epsilon\,\!$ into the problem of analyzing $|\widehat{Q}_{r}(h)-\widehat{Q}_{s}(h)|\geq\epsilon/2\,\!$, where $r\,\!$ and $s\,\!$ are i.i.d samples of size $m\,\!$ drawn according to the distribution $P\,\!$. One can view $r\,\!$ as the original randomly drawn sample of length $m\,\!$, while $s\,\!$ may be thought as the testing sample which is used to estimate $Q_{P}(h)\,\!$.
2. Permutation: Since $r\,\!$ and $s\,\!$ are picked identically and independently, so swapping elements between them will not change the probability distribution on $r\,\!$ and $s\,\!$. So, we will try to bound the probability of $|\widehat{Q}_{r}(h)-\widehat{Q}_{s}(h)|\geq\epsilon/2\,\!$ for some $h \in H\,\!$ by considering the effect of a specific collection of permutations of the joint sample $x=r||s\,\!$. Specifically, we consider permutations $\sigma(x)\,\!$ which swap $x_i\,\!$ and $x_{m+i}\,\!$ in some subset of ${1,2,...,m}\,\!$. The symbol $r||s\,\!$ means the concatenation of $r\,\!$ and $s\,\!$.
3. Reduction to a finite class: We can now restrict the function class $H\,\!$ to a fixed joint sample and hence, if $H\,\!$ has finite VC Dimension, it reduces to the problem to one involving a finite function class.

We present the technical details of the proof.

### Symmetrization

Lemma: Let $V=\{x\in X^{m}:|Q_{P}(h)-\widehat{Q_{x}}(h)|\geq\epsilon\,\!$ for some $h\in H\}\,\!$ and

$R=\{(r,s)\in X^{m}\times X^{m}:|\widehat{Q_{r}}(h)-\widehat{Q_{s}}(h)|\geq\epsilon /2\,\!$ for some $h\in H\}\,\!$.

Then for $m\geq\frac{2}{\epsilon^{2}}\,\!$, $P^{m}(V)\leq 2P^{2m}(R)\,\!$.

Proof: By the triangle inequality,
if $|Q_{P}(h)-\widehat{Q_{r}}(h)|\geq\epsilon\,\!$ and $|Q_{P}(h)-\widehat{Q_{s}}(h)|\leq\epsilon /2\,\!$ then $|\widehat{Q_{r}}(h)-\widehat{Q_{s}}(h)|\geq\epsilon /2\,\!$.
Therefore,

$P^{2m}(R)\geq P^{2m}\{\exists h\in H,|Q_{P}(h)-\widehat{Q_{r}}(h)|\geq\epsilon\,\!$ and $|Q_{P}(h)-\widehat{Q_{s}}(h)|\leq\epsilon /2\}\,\!$
$=\int_{V}P^{m}\{s:\exists h\in H,|Q_{P}(h)-\widehat{Q_{r}}(h)|\geq\epsilon\,\!$ and $|Q_{P}(h)-\widehat{Q_{s}}(h)|\leq\epsilon /2\}dP^{m}(r)=A\,\!$ [since $r\,\!$ and $s\,\!$ are independent].

Now for $r\in V\,\!$ fix an $h\in H\,\!$ such that $|Q_{P}(h)-\widehat{Q_{r}}(h)|\geq\epsilon\,\!$. For this $h\,\!$, we shall show that

$P^{m}\{|Q_{P}(h)-\widehat{Q_{s}}(h)|\leq\frac{\epsilon}{2}\}\geq\frac{1}{2}\,\!$.

Thus for any $r\in V\,\!$, $A\geq\frac{P^{m}(V)}{2}\,\!$ and hence $P^{2m}(R)\geq\frac{P^{m}(V)}{2}\,\!$. And hence we perform the first step of our high level idea.
Notice, $m\cdot \widehat{Q_{s}}(h)\,\!$ is a binomial random variable with expectation $m\cdot Q_{P}(h)\,\!$ and variance $m\cdot Q_{P}(h)(1-Q_{P}(h))\,\!$. By Chebyshev's inequality we get,

$P^{m}\{|Q_{P}(h)-\widehat{Q_{s}(h)}|>\frac{\epsilon}{2}\}\leq\frac{m\cdot Q_{P}(h)(1-Q_{P}(h))}{(\epsilon m/2)^{2}}\leq\frac{1}{\epsilon^{2}m}\leq\frac{1}{2}\,\!$ for the mentioned bound on $m\,\!$. Here we use the fact that $x(1-x)\leq 1/4\,\!$ for $x\,\!$.

### Permutations

Let $\Gamma_{m}\,\!$ be the set of all permutations of $\{1,2,3,\dots,2m\}\,\!$ that swaps $i\,\!$ and $m+i\,\!$ $\forall i\,\!$ in some subset of $\{1,2,3,...,2m\}\,\!$.

Lemma: Let $R\,\!$ be any subset of $X^{2m}\,\!$ and $P\,\!$ any probability distribution on $X\,\!$. Then,

$P^{2m}(R)=E[Pr[\sigma(x)\in R]]\leq max_{x\in X^{2m}}(Pr[\sigma(x)\in R]),\,\!$

where the expectation is over $x\,\!$ chosen according to $P^{2m}\,\!$, and the probability is over $\sigma\,\!$ chosen uniformly from $\Gamma_{m}\,\!$.

Proof: For any $\sigma\in\Gamma_{m},\,\!$

$P^{2m}(R)=P^{2m}\{x:\sigma(x)\in R\}\,\!$ [since coordinate permutations preserve the product distribution $P^{2m}\,\!$].
$\therefore P^{2m}(R)=\int_{X^{2m}}1_{R}(x)dP^{2m}(x)\,\!$
$=\frac{1}{|\Gamma_{m}|}\sum_{\sigma\in\Gamma_{m}}\int_{X^{2m}}1_{R}(\sigma(x))dP^{2m}(x)\,\!$
$=\int_{X^{2m}}\frac{1}{|\Gamma_{m}|}\sum_{\sigma\in\Gamma_{m}}1_{R}(\sigma(x))dP^{2m}(x)\,\!$ [Because $|\Gamma_{m}|\,\!$ is finite]
$=\int_{X^{2m}}Pr[\sigma(x)\in R]dP^{2m}(x)$ [The expectation]
$\leq max_{x\in X^{2m}}(Pr[\sigma(x)\in R])\,\!$.

The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.

### Reduction to a finite class

Lemma: Basing on the previous lemma,

$max_{x\in X^{2m}}(Pr[\sigma(x)\in R])\leq 4\Pi_{H}(2m)e^{-\frac{\epsilon^{2}m}{8}}\,\!$.

Proof: Let us define $x=(x_{1},x_{2},...,x_{2m})\,\!$ and $t=|H|_{x}|\,\!$ which is atmost $\Pi_{H}(2m)\,\!$. This means there are functions $h_{1},h_{2},...,h_{t}\in H\,\!$ such that for any $h\in H,\exists i\,\!$ between $1\,\!$ and $t\,\!$ with $h_{i}(x_{k})=h(x_{k})\,\!$ for $1\leq k\leq 2m\,\!$.
We see that $\sigma(x)\in R\,\!$ iff for some $h\,\!$ in $H\,\!$ satisfies, $|\frac{1}{m}|\{1\leq i\leq m:h(x_{\sigma_{i}})=1\}|-\frac{1}{m}|\{m+1\leq i\leq 2m:h(x_{\sigma_{i}})=1\}||\geq\frac{\epsilon}{2}\,\!$. Hence if we define $w^{j}_{i}=1\,\!$ if $h_{j}(x_{i})=1\,\!$ and $w^{j}_{i}=0\,\!$ otherwise.
For $1\leq i\leq m\,\!$ and $1\leq j\leq t\,\!$, we have that $\sigma(x)\in R\,\!$ iff for some $j\,\!$ in ${1,...,t}\,\!$ satisfies $|\frac{1}{m}\left(\sum_{i}w^{j}_{\sigma(i)}-\sum_{i}w^{j}_{\sigma(m+i)}\right)|\geq\frac{\epsilon}{2}\,\!$. By union bound we get,

$Pr[\sigma(x)\in R]\leq t\cdot max\left(Pr[|\frac{1}{m}\left(\sum_{i}w^{j}_{\sigma_{i}}-\sum_{i}w^{j}_{\sigma_{m+i}}\right)|\geq\frac{\epsilon}{2}]\right)$
$\leq \Pi_{H}(2m)\cdot max\left(Pr[|\frac{1}{m}\left(\sum_{i}w^{j}_{\sigma_{i}}-\sum_{i}w^{j}_{\sigma_{m+i}}\right)|\geq\frac{\epsilon}{2}]\right)\,\!$.

Since, the distribution over the permutations $\sigma\,\!$ is uniform for each $i\,\!$, so $w^{j}_{\sigma_{i}}-w^{j}_{\sigma_{m+i}}\,\!$ equals $\pm |w^{j}_{i}-w^{j}_{m+i}|\,\!$, with equal probability.
Thus,

$Pr[|\frac{1}{m}\left(\sum_{i}\left(w^{j}_{\sigma_{i}}-w^{j}_{\sigma_{m+i}}\right)\right)|\geq\frac{\epsilon}{2}]=Pr[|\frac{1}{m}\left(\sum_{i}|w^{j}_{i}-w^{j}_{m+i}|\beta_{i}\right)|\geq\frac{\epsilon}{2}]\,\!$,

where the probability on the right is over $\beta_{i}\,\!$ and both the possibilities are equally likely. By Hoeffding's inequality, this is at most $2e^{-\frac{m\epsilon^{2}}{8}}\,\!$.

Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.