Rademacher complexity

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In computational learning theory (machine learning and theory of computation), Rademacher complexity, named after Hans Rademacher, measures richness of a class of real-valued functions with respect to a probability distribution.

Definitions[edit]

Rademacher complexity of a set[edit]

Given a set , the Rademacher complexity of A is defined as follows:[1][2]:326

where are independent random variables drawn from the Rademacher distribution i.e. for .

Rademacher complexity of a function class[edit]

Given a sample , and a class of real-valued functions defined on a domain space , the empirical Rademacher complexity of given is defined as:

This can also be written using the previous definition:[2]:326

where denotes function composition, i.e:

Let be a probability distribution over . The Rademacher complexity of the function class with respect to for sample size is:

where the above expectation is taken over an identically independently distributed (i.i.d.) sample generated according to .

Examples[edit]

1. contains a single vector, e.g, . Then:

The same is true for every singleton hypothesis class.[3]:56

2. contains two vectors, e.g, . Then:

Using the Rademacher complexity[edit]

The Rademacher complexity can be used to derive data-dependent upper-bounds on the learnability of function classes. Intuitively, a function-class with smaller Rademacher complexity is easier to learn.

Bounding the representativeness[edit]

In machine learning, it is desired to have a training set that represents the true distribution of samples. This can be quantified using the notion of representativeness. Denote by P the probability distribution from which the samples are drawn. Denote by the set of hypotheses (potential classifiers) and denote by the corresponding set of error functions, i.e, for every , there is a function , that maps each training sample (features,label) to the error of the classifier on that sample (for example, if we do binary classification and the error function is the simple 0-1 loss, then is a function that returns 1 for each training sample on which is correct and 0 for each training sample on which is wrong). Define:

- the expected error on the real distribution;
- the estimated error on the sample.

The representativeness of the sample , with respect to and , is defined as:

Smaller representativeness is better, since it means that the empirical error of a classifier on the training set is not much higher than its true error.

The expected representativeness of a sample can be bounded by the expected Rademacher complexity of the function class:[2]:326

Bounding the generalization error[edit]

When the Rademacher complexity is small, it is possible to learn the hypothesis class H using empirical risk minimization.

For example (with binary error function),[2]:328 for every , with probability at least , for every hypothesis :

Bounding the Rademacher complexity[edit]

Since smaller Rademacher complexity is better, it is useful to have upper bounds on the Rademacher complexity of various function sets. The following rules can be used to upper-bound the Rademacher complexity of a set .[2]:329-330

1. If all vectors in are translated by a constant vector , then Rad(A) does not change.

2. If all vectors in are multiplied by a scalar , then Rad(A) is multiplied by .

3. Rad(A+B) = Rad(A) + Rad(B). [3]:56

4. (Kakade&Tewari Lemma) If all vectors in are operated by a Lipschitz function, then Rad(A) is (at most) multiplied by the Lipschitz constant of the function. In particular, if all vectors in are operated by a contraction mapping, then Rad(A) strictly decreases.

5. The Rademacher complexity of the convex hull of equals Rad(A).

6. (Massart Lemma) The Rademacher complexity of a finite set grows logarithmically with the set size. Formally, let be a set of vectors in , and let be the mean of the vectors in . Then:

In particular, if is a set of binary vectors, the norm is at most , so:

Bounds related to the VC dimension[edit]

Let be a set family whose VC dimension is . It is known that the growth function of is bounded as:

for all :

This means that, for every set with at most elements, . The set-family can be considered as a set of binary vectors over . Substituting this in Massart's lemma gives:

With more advanced techniques (Dudley's entropy bound and Haussler's upper bound[4]) one can show, for example, that there exists a constant , such that any class of -indicator functions with Vapnik-Chervonenkis dimension has Rademacher complexity upper-bounded by .

Bounds related to linear classes[edit]

The following bounds are related to linear operations on - a constant set of vectors in . [2]:332-333

1. Define the set of dot-products of the vectors in with vectors in the unit ball. Then:

2. Define the set of dot-products of the vectors in with vectors in the unit ball of the 1-norm. Then:

Bounds related to covering numbers[edit]

The following bound relates the Rademacher complexity of a set to its external covering number - the number of balls of a given radius whose union contains . The bound is attributed to Dudley.[2]:338

Suppose is a set of vectors whose length (norm) is at most . Then, for every integer :

In particular, if lies in a d-dimensional subspace of , then:

Substituting this in the previous bound gives the following bound on the Rademacher complexity:

Gaussian complexity[edit]

Gaussian complexity is a similar complexity with similar physical meanings, and can be obtained from the Rademacher complexity using the random variables instead of , where are Gaussian i.i.d. random variables with zero-mean and variance 1, i.e. .

References[edit]

  1. ^ Balcan, Maria-Florina (November 15–17, 2011). "Machine Learning Theory - Rademacher Complexity" (PDF). Retrieved 10 December 2016. 
  2. ^ a b c d e f g Chapter 26 in Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning - from Theory to Algorithms. Cambridge university press. ISBN 9781107057135. 
  3. ^ a b Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine Learning. USA, Massachusetts: MIT Press. ISBN 9780262018258. 
  4. ^ Bousquet, O. (2004). Introduction to Statistical Learning Theory. Biological Cybernetics, 3176(1), 169–207. http://doi.org/10.1007/978-3-540-28650-9_8
  • Peter L. Bartlett, Shahar Mendelson (2002) Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research 3 463-482
  • Giorgio Gnecco, Marcello Sanguineti (2008) Approximation Error Bounds via Rademacher's Complexity. Applied Mathematical Sciences, Vol. 2, 2008, no. 4, 153 - 176