Overfitting

From Wikipedia, the free encyclopedia

Jump to: navigation, search
Noisy (roughly linear) data is fit to both linear and polynomial functions. Although the polynomial function passes through each data point, and the linear function through few, the linear version is a better fit. If the regression curves were used to extrapolate the data, the overfit would do much worse.

In statistics, overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many degrees of freedom, in relation to the amount of data available. A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.

The potential for overfitting depends not only on the number of parameters and data but also the conformability of the model structure with the data shape, and the magnitude of model error compared to the expected level of noise or error in the data.

Even when the fitted model does not have unusually many degrees of freedom, it is to be expected that the fitted relationship will appear to perform less well on a new data set than on the data set used for fitting.[1] In particular, the value of the coefficient of determination will shrink relative to the original training data.

In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, Bayesian priors on parameters or model comparison), that can indicate when further training is not resulting in better generalization.

Contents

[edit] Machine learning

Overfitting/Overtraining in supervised learning (e.g. neural network). Training error is shown in blue, validation error in red. If the validation error increases while the training error steadily decreases then a situation of overfitting may have occurred.

The concept of overfitting is important in machine learning. Usually a learning algorithm is trained using some set of training examples, i.e. exemplary situations for which the desired output is known. The learner is assumed to reach a state where it will also be able to predict the correct output for other examples, thus generalizing to situations not presented during training (based on its inductive bias). However, especially in cases where learning was performed too long or where training examples are rare, the learner may adjust to very specific random features of the training data, that have no causal relation to the target function. In this process of overfitting, the performance on the training examples still increases while the performance on unseen data becomes worse.

As a simple example, consider a database of retail purchases that includes the item bought, the purchaser, and the date and time of purchase. It's easy to construct a model that will fit the training set perfectly by using the date and time of purchase to predict the other attributes; but this model will not generalize at all to new data, because those past times will never occur again.

Generally, a learning algorithm is said to overfit relative to a simpler one if it is more accurate in fitting known data (hindsight) but less accurate in predicting new data (foresight). One can intuitively understand overfitting from the fact that information from all past experience can be divided into two groups: information that is relevant for the future and irrelevant information (“noise”). Everything else being equal, the more difficult a criterion is to predict (i.e., the higher its uncertainty), the more noise exists in past information that need to be ignored. The problem is determining which part to ignore. A learning algorithm that can reduce the chance of fitting noise is called robust.

[edit] See also

[edit] References

  1. ^ Everitt B.S. (2002) Cambridge Dictionary of Statistics, CUP. ISBN 0-521-81099-x (entry for "Shrinkage")

[edit] External links

Personal tools