In computational learning theory (machine learning and theory of computation), Rademacher complexity, named after Hans Rademacher, measures richness of a class of real-valued functions with respect to a probability distribution.

## Definitions

### Rademacher complexity of a set

Given a set ${\displaystyle A\subseteq \mathbb {R} ^{m}}$, the Rademacher complexity of A is defined as follows:[1][2]:326

${\displaystyle \operatorname {Rad} (A):={\frac {1}{m}}\mathbb {E} _{\sigma }\left[\sup _{a\in A}\sum _{i=1}^{m}\sigma _{i}a_{i}\right]}$

where ${\displaystyle \sigma _{1},\sigma _{2},\dots ,\sigma _{m}}$ are independent random variables drawn from the Rademacher distribution i.e. ${\displaystyle \Pr(\sigma _{i}=+1)=\Pr(\sigma _{i}=-1)=1/2}$ for ${\displaystyle i=1,2,\dots ,m}$.

### Rademacher complexity of a function class

Given a sample ${\displaystyle S=(z_{1},z_{2},\dots ,z_{m})\in Z^{m}}$, and a class ${\displaystyle F}$ of real-valued functions defined on a domain space ${\displaystyle Z}$, the empirical Rademacher complexity of ${\displaystyle F}$ given ${\displaystyle S}$ is defined as:

${\displaystyle \operatorname {Rad} _{S}(F)={\frac {1}{m}}\mathbb {E} _{\sigma }\left[\sup _{f\in F}\sum _{i=1}^{m}\sigma _{i}f(z_{i})\right]}$

This can also be written using the previous definition:[2]:326

${\displaystyle \operatorname {Rad} _{S}(F)=\operatorname {Rad} (F\circ S)}$

where ${\displaystyle F\circ S}$ denotes function composition, i.e.:

${\displaystyle F\circ S:=\{(f(z_{1}),\ldots ,f(z_{m}))|f\in F\}}$

Let ${\displaystyle P}$ be a probability distribution over ${\displaystyle Z}$. The Rademacher complexity of the function class ${\displaystyle F}$ with respect to ${\displaystyle P}$ for sample size ${\displaystyle m}$ is:

${\displaystyle \operatorname {Rad} _{P,m}(F):=\mathbb {E} _{S\sim P^{m}}\left[\operatorname {Rad} _{S}(F)\right]}$

where the above expectation is taken over an identically independently distributed (i.i.d.) sample ${\displaystyle S=(z_{1},z_{2},\dots ,z_{m})}$ generated according to ${\displaystyle P}$.

## Examples

1. ${\displaystyle A}$ contains a single vector, e.g., ${\displaystyle A=\{(a,b)\}\subset \mathbb {R} ^{2}}$. Then:

${\displaystyle \operatorname {Rad} (A)={1 \over 2}\cdot \left({1 \over 4}\cdot (a+b)+{1 \over 4}\cdot (a-b)+{1 \over 4}\cdot (-a+b)+{1 \over 4}\cdot (-a-b)\right)=0}$

The same is true for every singleton hypothesis class.[3]:56

2. ${\displaystyle A}$ contains two vectors, e.g., ${\displaystyle A=\{(1,1),(1,2)\}\subset \mathbb {R} ^{2}}$. Then:

${\displaystyle \operatorname {Rad} (A)={1 \over 2}\cdot \left({1 \over 4}\cdot \max(1+1,1+2)+{1 \over 4}\cdot \max(1-1,1-2)+{1 \over 4}\cdot \max(-1+1,-1+2)+{1 \over 4}\cdot \max(-1-1,-1-2)\right)}$
${\displaystyle ={1 \over 8}(3+0+1-2)={1 \over 4}}$

The Rademacher complexity can be used to derive data-dependent upper-bounds on the learnability of function classes. Intuitively, a function-class with smaller Rademacher complexity is easier to learn.

### Bounding the representativeness

In machine learning, it is desired to have a training set that represents the true distribution of samples. This can be quantified using the notion of representativeness. Denote by P the probability distribution from which the samples are drawn. Denote by ${\displaystyle H}$ the set of hypotheses (potential classifiers) and denote by ${\displaystyle F}$ the corresponding set of error functions, i.e., for every ${\displaystyle h\in H}$, there is a function ${\displaystyle f_{h}\in F}$, that maps each training sample (features,label) to the error of the classifier ${\displaystyle h}$ on that sample (for example, if we do binary classification and the error function is the simple 0-1 loss, then ${\displaystyle f_{h}}$ is a function that returns 0 for each training sample on which ${\displaystyle h}$ is correct and 1 for each training sample on which ${\displaystyle h}$ is wrong). Define:

${\displaystyle L_{P}(f):=E_{z\sim P}[f(z)]}$ - the expected error on the real distribution;
${\displaystyle L_{S}(f):={1 \over m}\sum _{i=1}^{m}f(z_{i})}$ - the estimated error on the sample.

The representativeness of the sample ${\displaystyle S}$, with respect to ${\displaystyle P}$ and ${\displaystyle F}$, is defined as:

${\displaystyle Rep_{P}(F,S):=\sup _{f\in F}(L_{P}(f)-L_{S}(f))}$

Smaller representativeness is better, since it provides a way to avoid overfitting: it means that the true error of a classifier is not much higher than its estimated error, and so selecting a classifier that has low estimated error will ensure that the true error is also low.

The expected representativeness of a sample can be bounded by the Rademacher complexity of the function class:[2]:326

${\displaystyle E_{S\sim P^{m}}[Rep_{P}(F,S)]\leq 2\cdot E_{S\sim P^{m}}[\operatorname {Rad} (F\circ S)]}$

### Bounding the generalization error

When the Rademacher complexity is small, it is possible to learn the hypothesis class H using empirical risk minimization.

For example, (with binary error function),[2]:328 for every ${\displaystyle \delta >0}$, with probability at least ${\displaystyle 1-\delta }$, for every hypothesis ${\displaystyle h\in H}$:

${\displaystyle L_{P}(h)-L_{S}(h)\leq 2\operatorname {Rad} (F\circ S)+4{\sqrt {2\ln(4/\delta ) \over m}}}$

Since smaller Rademacher complexity is better, it is useful to have upper bounds on the Rademacher complexity of various function sets. The following rules can be used to upper-bound the Rademacher complexity of a set ${\displaystyle A\subset \mathbb {R} ^{m}}$.[2]:329–330

1. If all vectors in ${\displaystyle A}$ are translated by a constant vector ${\displaystyle a_{0}\in \mathbb {R} ^{m}}$, then Rad(A) does not change.

2. If all vectors in ${\displaystyle A}$ are multiplied by a scalar ${\displaystyle c\in \mathbb {R} }$, then Rad(A) is multiplied by ${\displaystyle |c|}$.

4. (Kakade&Tewari Lemma) If all vectors in ${\displaystyle A}$ are operated by a Lipschitz function, then Rad(A) is (at most) multiplied by the Lipschitz constant of the function. In particular, if all vectors in ${\displaystyle A}$ are operated by a contraction mapping, then Rad(A) strictly decreases.

5. The Rademacher complexity of the convex hull of ${\displaystyle A}$ equals Rad(A).

6. (Massart Lemma) The Rademacher complexity of a finite set grows logarithmically with the set size. Formally, let ${\displaystyle A}$ be a set of ${\displaystyle N}$ vectors in ${\displaystyle \mathbb {R} ^{m}}$, and let ${\displaystyle {\bar {a}}}$ be the mean of the vectors in ${\displaystyle A}$. Then:

${\displaystyle \operatorname {Rad} (A)\leq \max _{a\in A}||a-{\bar {a}}||\cdot {{\sqrt {2\log {N}}} \over m}}$

In particular, if ${\displaystyle A}$ is a set of binary vectors, the norm is at most ${\displaystyle {\sqrt {m}}}$, so:

${\displaystyle \operatorname {Rad} (A)\leq {\sqrt {2\log {N} \over m}}}$

### Bounds related to the VC dimension

Let ${\displaystyle H}$ be a set family whose VC dimension is ${\displaystyle d}$. It is known that the growth function of ${\displaystyle H}$ is bounded as:

for all ${\displaystyle m>d+1}$: ${\displaystyle Growth(H,m)\leq (em/d)^{d}}$

This means that, for every set ${\displaystyle h}$ with at most ${\displaystyle m}$ elements, ${\displaystyle |H\cap h|\leq (em/d)^{d}}$. The set-family ${\displaystyle H\cap h}$ can be considered as a set of binary vectors over ${\displaystyle \mathbb {R} ^{m}}$. Substituting this in Massart's lemma gives:

${\displaystyle \operatorname {Rad} (H\cap h)\leq {\sqrt {2d\log {(em/d)} \over m}}}$

With more advanced techniques (Dudley's entropy bound and Haussler's upper bound[4]) one can show, for example, that there exists a constant ${\displaystyle C}$, such that any class of ${\displaystyle \{0,1\}}$-indicator functions with Vapnik-Chervonenkis dimension ${\displaystyle d}$ has Rademacher complexity upper-bounded by ${\displaystyle C{\sqrt {\frac {d}{m}}}}$.

### Bounds related to linear classes

The following bounds are related to linear operations on ${\displaystyle S}$ - a constant set of ${\displaystyle m}$ vectors in ${\displaystyle \mathbb {R} ^{n}}$.[2]:332–333

1. Define ${\displaystyle A_{2}=\{(w\cdot x_{1},\ldots ,w\cdot x_{m})|||w||_{2}\leq 1\}=}$ the set of dot-products of the vectors in ${\displaystyle S}$ with vectors in the unit ball. Then:

${\displaystyle \operatorname {Rad} (A_{2})\leq {\max _{i}||x_{i}||_{2} \over {\sqrt {m}}}}$

2. Define ${\displaystyle A_{1}=\{(w\cdot x_{1},\ldots ,w\cdot x_{m})|||w||_{1}\leq 1\}=}$ the set of dot-products of the vectors in ${\displaystyle S}$ with vectors in the unit ball of the 1-norm. Then:

${\displaystyle \operatorname {Rad} (A_{1})\leq \max _{i}||x_{i}||_{\infty }\cdot {\sqrt {2\log(2n) \over m}}}$

### Bounds related to covering numbers

The following bound relates the Rademacher complexity of a set ${\displaystyle A}$ to its external covering number - the number of balls of a given radius ${\displaystyle r}$ whose union contains ${\displaystyle A}$. The bound is attributed to Dudley.[2]:338

Suppose ${\displaystyle A\subset \mathbb {R} ^{m}}$ is a set of vectors whose length (norm) is at most ${\displaystyle c}$. Then, for every integer ${\displaystyle M>0}$:

${\displaystyle \operatorname {Rad} (A)\leq {c\cdot 2^{-M} \over {\sqrt {m}}}+{6c \over m}\cdot \sum _{i=1}^{M}2^{-i}{\sqrt {\log(N_{c\cdot 2^{-i}}^{\text{ext}}(A))}}}$

In particular, if ${\displaystyle A}$ lies in a d-dimensional subspace of ${\displaystyle \mathbb {R} ^{m}}$, then:

${\displaystyle \forall r>0:N_{r}^{\text{ext}}(A)\leq (2c{\sqrt {d}}/r)^{d}}$

Substituting this in the previous bound gives the following bound on the Rademacher complexity:

${\displaystyle \operatorname {Rad} (A)\leq {6c \over m}\cdot {\bigg (}{\sqrt {d\log(2{\sqrt {d}})}}+2{\sqrt {d}}{\bigg )}=O{\bigg (}{c{\sqrt {d\log(d)}} \over m}{\bigg )}}$

## Gaussian complexity

Gaussian complexity is a similar complexity with similar physical meanings, and can be obtained from the Rademacher complexity using the random variables ${\displaystyle g_{i}}$ instead of ${\displaystyle \sigma _{i}}$, where ${\displaystyle g_{i}}$ are Gaussian i.i.d. random variables with zero-mean and variance 1, i.e. ${\displaystyle g_{i}\sim {\mathcal {N}}\left(0,1\right)}$. Gaussian and Rademacher complexities are known to be equivalent up to logarithmic factors.

## References

1. ^ Balcan, Maria-Florina (November 15–17, 2011). "Machine Learning Theory - Rademacher Complexity" (PDF). Retrieved 10 December 2016.
2. Chapter 26 in Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning - from Theory to Algorithms. Cambridge university press. ISBN 9781107057135.
3. ^ a b Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. USA, Massachusetts: MIT Press. ISBN 9780262018258.
4. ^ Bousquet, O. (2004). Introduction to Statistical Learning Theory. Biological Cybernetics, 3176(1), 169–207. http://doi.org/10.1007/978-3-540-28650-9_8
• Peter L. Bartlett, Shahar Mendelson (2002) Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research 3 463-482
• Giorgio Gnecco, Marcello Sanguineti (2008) Approximation Error Bounds via Rademacher's Complexity. Applied Mathematical Sciences, Vol. 2, 2008, no. 4, 153 - 176