User:Agkonings/sandbox

This is the user sandbox of Agkonings. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

In supervised learning applications in machine learning and statistical learning theory, the generalization error (also known as the out-of-sample error^[1]) is the error incurred by training a function on a finite sample instead of on the true underlying probability distribution. By training on a finite (and possibly noisy) sample, the function found by a learning algorithm may be sensitive to the particular noise in the sample or to small differences between the empirical probability distribution of the sample and the underlying true probability distribution. In this case, applying the function to other inputs with different instances of noise or different sampling effects may cause large errors in the prediction. The generalization error can be minimized by avoiding overfitting in the learning algorithm.

Definition and relation to statistical learning theory[edit]

In learning problems, the aim is to predict output values $Y$ based on some input data $X$ , which are generally assumed to be identical and independently distributed (i.i.d) based on an underlying probability distribution $\rho (x,y)$ . Note that both $X$ , $\mathrm {P}$ , and $Y$ may all be multi-dimensional. In order to determine the quality of the prediction, it is necessary to define a loss function $L(f(x),y)$ . Common loss functions include the square loss or L_2 loss.

V(f(x),y)=(f(x)-y)^{2}

Other common loss functions include the 0-1 loss and the [hinge loss].

Given a loss function, the goal of the learning problem is to minimize the expected error (also known as the true error or expected error):

I[f]=\int _{X\times Y}V(f(x),y)\rho (x,y)dxdy,

where f is a map from x, $f:X\mapsto Y$ . More specifically, the goal is to find the function $f*$ within some class of functions ${\mathcal {H}}$ (known as the hypothesis space) that minimizes the expected error,

f*=\inf _{f\in {\mathcal {H}}}I[f]

In practice, it is often difficult to estimate the expected risk because the true probability density function $\rho(x,y)$ is unknown. For many problems, a (possibly noisy) sample $S$ of input-output pairs $(x_{i},y_{i})$ for $n$ samples is available. The problem then becomes a supervised learning problem. If so, a natural proxy for the expected risk is the empirical risk or empirical error,

I_{S}[f]={\frac {1}{n}}\sum _{i=1}^{n}V(f(x_{i}),y_{i})

For a finite and noisy sample, the infinum of empirical risk (the sample function $f_{n}$ ) may not be the same as the target function $f*$ . This would result in a larger-than-optimal loss if $f_{S}$ is applied to produce predictions, such that $I[f_{n}]>I_{S}[f_{n}]$ . The generalization error is the difference between the expected and empirical error in this scenario

G=I[f_{n}]-I_{S}[f_{n}]

Since $I[f_{n}]$ cannot be computed in practice for an unknown probability distribution, the generalization error cannot be computed either. Instead, the aim of many problems in statistical learning theory is to bound or characterize the the generalization error in probability,

P_{G}=P(I(f_{n})-I_{S}[f_{n}])\leq \epsilon )\geq 1-\delta _{n}

That is, the goal is to characterize the probability $1-\delta _{n}$ that the generalization error is less than some error bound $\epsilon$ (known as the learning rate and generally dependent on $\delta$ and $n$ . Alternatively, one can think of the problem as characterizing the sample complexity $n(\epsilon ,\delta _{n})$ (that is, the number of samples needed to achieve a certain bound on the generalization error with some probability. A learning algorithm is said to satisfy generalization if the generalization error approaches zero as the number of samples goes to infinity.

It is also possible to define analogous definitions of generalization error for other types of learning, such as unsupervised learning^[2].

Relation to overfitting[edit]

The concepts of generalization error and overfitting are closely related. Overfitting occurs when the learned function $f_{S}$ becomes sensitive to the noise in the sample. As a result, the function will not perform well when acting upon a different sample with different noise. Thus, the more overfitting occurs, the larger the generalization error.

One can test whether overfitting occurs by using a cross-validation method, which splits the sample (potentially in a variety of different ways) into simulated training samples and testing samples in order to explicitly be able to test the performance of the training function on other data, which amounts to approximating a particular form of the generalization error. Beyond testing for its occurence, mechanisms also exist to reduce the chance that overfitting is present. The minimization algorithm can penalize more complex functions (known as Tikhonov regularization, or the hypothesis space can be constrained, either explicitly in the form of the functions or by adding constraints to the minimization function (Ivanov regularization).

Note that the approach to finding a function that does not fit is somewhat at adds with the goal of finding a function that is sufficiently complex to capture the particular characteristics of the data. This is known as the bias-variance tradeoff. Keeping a function simple to avoid overfitting may introduce a bias in the resulting predictions, while allowing it to be more complex leads to overfitting and a higher variance in the predictions. It is impossible to minimize both simultaneously.

Generalization error and algorithm stability[edit]

A good learning algorithm is stable if the learned function $f_{S}>$ does not change when there is a small change in the training sample. For example, the (strong) requirement of uniform stability or beta-stability implies that the change in loss when the i-th sample is replaced with a new point $z=(x,y)$ is bounded. Mathematically,

\forall (S,z)\in {\mathcal {Z}}^{n+1},\forall i,\sup _{z'\in Z}|V(f_{S},z')-V(f_{S}^{i,z},z')|\leq \beta

.

Note that if the solution of a problem is not only stable but also unique and certain to exist, that problem is said to be well-posed.

It is known that, for a certain definition of stability, the stability of an algorithm not only ensures its generalization, but is also necessary for it ^[3]. That is, stability and generalization are equivalent; one implies the other.

According to the Glivenko-Cantelli theorem, ERM leads to generalization if the hypothesis space ${\mathcal {H}}$ is appropriately chosen such that it is a uGC (or uniform Glivenko-Cantelli) class, that is, if it has the property

\forall \epsilon >0\lim _{n\to \infty }\sup _{\mu }P_{S}\left\{\sup _{f\in {\mathcal {H}}}|I[f]-I_{S}[f]|>\epsilon \right\}=0.

A class is uGC if and only if it has a finite [VC dimension]. If ${\mathcal {H}}$ is a uGC class, then ERM has the additional nice problem that is uniformly consistent. That is,

\forall \epsilon >0,\lim _{n\to \infty }\sup _{\rho }P(I[f*]-\inf _{f\in {\mathcal {H}}}I[f]>\epsilon )=0

Note that the error term that goes to zero for consistent classes is not the generalization error. The generalization error is the difference between the expected risk and the empirical risk, each on the learned function. By contrast, consistency concerns the difference between the expected risk on the target function and the expected risk on the learned function.

References[edit]

^ Y S. Abu-Mostafa, M.Magdon-Ismail, and H.-T. Lin (2012) Learning from Data, AMLBook Press. ISBN 978-1600490064
^ L.K. Hansen and J. Larsen (1996), Unsupervised Learning and Generalization, IEEE International Conference on Neural Networks, 25-30
^ Mukherjee, S., Niyogi, P. Poggio, T., and Rifkin, R. 2006. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics. Vol 25, pp 161-193.

Additional literature[edit]

Bousquet, O., S. Boucheron and G. Lugosi. Introduction to Statistical Learning Theory. Advanced Lectures on Machine Learning Lecture Notes in Artificial Intelligence 3176, 169-207. (Eds.) Bousquet, O., U. von Luxburg and G. Ratsch, Springer, Heidelberg, Germany (2004)
Bousquet, O. and A. Elisseef (2002), Stability and Generalization, Journal of Machine Learning Research, 499-526.
Devroye L. , L. Gyorfi, and G. Lugosi (1996). A Probabilistic Theory of Pattern Recognition. Springer-Verlag. ISBN 978-0387946184.
Poggio T. and S. Smale. The Mathematics of Learning: Dealing with Data. Notices of the AMS, 2003
Vapnik, V. (2000). The Nature of Statistical Learning Theory. Information Science and Statistics. Springer-Verlag. ISBN 978-0-387-98780-4.

[1] Y S. Abu-Mostafa, M.Magdon-Ismail, and H.-T. Lin (2012) Learning from Data, AMLBook Press. ISBN 978-1600490064

[2] L.K. Hansen and J. Larsen (1996), Unsupervised Learning and Generalization, IEEE International Conference on Neural Networks, 25-30

[3] Mukherjee, S., Niyogi, P. Poggio, T., and Rifkin, R. 2006. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics. Vol 25, pp 161-193.

[1]

[2]

[3]