User:Avalenzu/sandbox
In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set. Past that point, however, improving the learner's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.
Background
[edit]This section presents some of the basic machine-learning concepts required for a description of early stopping methods.
Overfitting
[edit]Machine learning algorithms train a model based on a finite set of training data. During this training, the model is evaluated based on how well it predicts the observations contained in the training set. In general, however, the goal of a machine learning scheme is to produce a model that generalizes, that is, that predicts previously unseen observations. Overfitting occurs when a model fits the data in the training set well, while incurring larger generalization error.
Regularization
[edit]Regularization, in the context of machine learning, refers to the process of modifying a learning algorithm so as to prevent overfitting. This generally involves imposing some sort of smoothness constraint on the learned model. [1] This smoothness may be enforced explicitly, by fixing the number of parameters in the model, or by augmenting the cost function as in Tikhonov regularization. Tikhonov regularization, along with principle component regression and many other regularization schemes, fall under the umbrella of spectral regularization, regularization characterized by the application of a filter. Early stopping also belongs to this class of methods.
Gradient Descent Methods
[edit]Gradient descent methods are first-order, iterative, optimization methods. Each iteration updates an approximate solution to the optimization problem by taking a step in the direction of the negative of the gradient of the objective function. By choosing the step-size appropriately, such a method can be made to converge to a local minimum of the objective function. Gradient descent is used in machine-learning by defining a loss function that reflects the error of learner on the training set and then minimizing that function.
Definition
[edit]Early stopping refers to any regularization (machine-learning) technique wherein an iterative machine-learning scheme is stopped prior to convergence so as to prevent overfitting.
Early stopping based on analytical results
[edit]Early stopping in statistical learning theory
[edit]Early-stopping can be used to regularize non-parametric regression problems encountered in machine learning. For a given input space, , output space, , and samples drawn from an unknown probability measure, , on , the goal of such problems is to approximate a regression function, , given by
- ,
where is the conditional distribution at induced by .[2] One common choice for approximating the regression function is to use functions from a reproducing kernel Hilbert space.[2] These spaces can be infinite dimensional, in which they can supply solutions that overfit training sets of arbitrary size. Regularization is, therefore, especially important for these methods. One way to regularize non-parametric regression problems is to apply an early stopping rule to an iterative procedure such as gradient descent.
The early stopping rules proposed for these problems are based on analysis of upper bounds on the generalization error as a function of the iteration number. They yield prescriptions for the number of iterations to run that can be computed prior to starting the solution process.[3] [4]
Example: Least-squares loss
[edit](Adapted from Yao, Rosasco and Caponnetto, 2007[3])
Let and . Given a set of samples
- ,
drawn independently from , minimize the functional
where, is a member of the reproducing kernel Hilbert space . That is, minimize the expected risk for a Least-squares loss function. Since depends on the unknown probability measure , it cannot be used for computation. Instead, consider the following empirical risk
Let and be the t-th iterates of gradient descent applied to the expected and empirical risks, respectively, where both iterations are initialized at the origin, and both use the step size . The form the population iteration, which converges to , but cannot be used in computation, while the form the sample iteration which usually converges to an overfitting solution.
We want to control the difference between the expected risk of the sample iteration and the minimum expected risk, that is, the expected risk of the regression function:
This difference can be rewritten as the sum of two terms: the difference in expected risk between the sample and population iterations and that between the population iteration and the regression function:
This equation presents a bias-variance tradeoff, which is then solved to give an optimal stopping rule that may depend on the unknown probability distribution. That rule has associated probabilistic bounds on the generalization error. For the analysis leading to the early stopping rule and bounds, the reader is referred to the original article. [3] In practice, data-driven methods, e.g. cross-validation can be used to obtain an adaptive stopping rule.
Early stopping in Boosting
[edit]Boosting refers to a family of algorithms in which a set of weak learners (learners that are only slightly correlated with the true process) are combined to produce a strong learner. It has been shown, for several boosting algorithms (including AdaBoost), that regularization via early stopping can provide guarantees of consistency, that is, that the result of the algorithm approaches the true solution as the number of samples goes to infinity.[5] [6] [7]
L2-Boosting
[edit]Boosting methods have close ties to the gradient descent methods described above can be regarded as a boosting method based on the loss: L2Boost[3].
Early stopping based on cross-validation
[edit]These early stopping rules work by splitting the original training set into a new training set and a validation set. The error on the validation set is used as a proxy for the generalization error in determining when overfitting has begun. These methods are most commonly employed in the training of neural networks. Prechelt gives the folowing summary of a naive implementation of cross-validation based early stopping as follows:[8]
- Split the training data into a training set and a validation set, e.g. in a 2-to-1 proportion.
- Train only on the training set and evaluate the per-example error on the validation set once in a while, e.g. after every fifth epoch.
- Stop training as soon as the error on the validation set is higher than it was the last time it was checked.
- Use the weights the network had in that previous step as the result of the training run.
— Lutz Prechelt, Early Stopping - But When?
This simple procedure is complicated in practice by the fact that the validation error may fluctuate during training, producing multiple local minima. This complication has led to the creation of many ad-hoc rules for deciding when overfitting has truly begun. [8]
References
[edit]- ^ Girosi, Federico; Jones, Michael; Poggio, Tomaso (1995-03-01). "Regularization Theory and Neural Networks Architectures". Neural Computation. 7 (2): 219–269. doi:10.1162/neco.1995.7.2.219. ISSN 0899-7667. S2CID 49743910. Retrieved 2013-12-14.
- ^ a b Smale, Steve; Zhou, Ding-Xuan (2007-08-01). "Learning Theory Estimates via Integral Operators and Their Approximations". Constructive Approximation. 26 (2): 153–172. doi:10.1007/s00365-006-0659-y. ISSN 0176-4276. S2CID 5977083. Retrieved 2013-12-15.
- ^ a b c d Yao, Yuan; Rosasco, Lorenzo; Caponnetto, Andrea (2007-08-01). "On Early Stopping in Gradient Descent Learning". Constructive Approximation. 26 (2): 289–315. doi:10.1007/s00365-006-0663-2. ISSN 0176-4276. S2CID 8323954. Retrieved 2013-12-05.
- ^ Raskutti, G. (2011). "Early stopping for non-parametric regression: An optimal data-dependent stopping rule". 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton). 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton). pp. 1318–1325. doi:10.1109/Allerton.2011.6120320. ISBN 978-1-4577-1817-5.
{{cite conference}}
: Cite has empty unknown parameters:|1=
,|2=
,|3=
,|4=
,|5=
,|6=
,|7=
,|8=
, and|9=
(help); Unknown parameter|coauthors=
ignored (|author=
suggested) (help) - ^ Wenxin Jiang (February 2004). "Process consistency for AdaBoost". The Annals of Statistics. 32 (1): 13–29. doi:10.1214/aos/1079120128. ISSN 0090-5364. Retrieved 2013-12-05.
- ^ Bühlmann, Peter; Yu, Bin (2003-06-01). "Boosting with the L₂ Loss: Regression and Classification". Journal of the American Statistical Association. 98 (462): 324–339. doi:10.1198/016214503000125. ISSN 0162-1459. JSTOR 30045243. S2CID 123059267. Retrieved 2013-12-15.
- ^ Zhang, Tong; Yu, Bin (2005-08-01). "Boosting with Early Stopping: Convergence and Consistency". The Annals of Statistics. 33 (4): 1538–1579. doi:10.1214/009053605000000255. ISSN 0090-5364. JSTOR 3448617. S2CID 13158356. Retrieved 2013-12-05.
- ^ a b Prechelt, Lutz (2012-01-01). "Early Stopping — But When?". In Grégoire Montavon, Klaus-Robert Müller (eds.) (ed.). Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science. Vol. 7700. Springer Berlin Heidelberg. pp. 53–67. doi:10.1007/978-3-642-35289-8_5. ISBN 978-3-642-35289-8. Retrieved 2013-12-15.
{{cite book}}
:|editor=
has generic name (help); Unknown parameter|coauthors=
ignored (|author=
suggested) (help)
ORIGINAL ARTICLE
[edit]This article needs additional citations for verification. (August 2010) |
In machine learning, early stopping is a form of regularization used when a machine learning model (such as a neural network) is trained by on-line gradient descent. In early stopping, the training set is split into a new training set and a validation set. Gradient descent is applied to the new training set. After each sweep through the new training set, the network is evaluated on the validation set. When the performance with the validation test stops improving, the algorithm halts. The network with the best performance on the validation set is then used for actual testing, with a separate set of data (the validation set is used in learning to decide when to stop).
This technique is a simple but efficient hack to deal with the problem of overfitting. Overfitting is a phenomenon in which a learning system, such as a neural network gets very good at dealing with one data set at the expense of becoming very bad at dealing with other data sets. Early stopping is effectively limiting the used weights in the network and thus imposes a regularization, effectively lowering the VC dimension.
Early stopping is a very common practice in neural network training and often produces networks that generalize well. However, while often improving the generalization it does not do so in a mathematically well-defined way.
Method
[edit]- Divide the available data into training and validation sets.
- Use a large number of hidden units.
- Use very small random initial values.
- Use a slow learning rate.
- Compute the validation error rate periodically during training.
- Stop training when the validation error rate "starts to go up".
It is crucial to realize that the validation error is not a good estimate of the generalization error. One method for getting an unbiased estimate of the generalization error is to run the net on a third set of data, the test set, that is not used at all during the training process. The error on the test set gives estimate on generalization; to have the outputs of the net approximate target values given inputs that are not in the training set.
Advantages
[edit]Early stopping has several advantages:
- It is fast.
- It can be applied successfully to networks in which the number of weights far exceeds the sample size.
- It requires only one major decision by the user: what proportion of validation cases to use.
Issues
[edit]- It's not clear on how many cases to assign to the training and validation sets
- The result might highly depend on the algorithm which is used to split the data into training and validation set
- Notion of "increasing validation error" is ambiguous; it may go up and down numerous times during training. The safest approach is to train to convergence, then determine which iteration had the lowest validation error. This impairs fast training, one of the advantages of early stopping.
See also
[edit]- Overfitting, early stopping is one of methods used to prevent overfitting
- Cross-validation, in particular using a "Validation Set"
- Generalization error
References
[edit]External links
[edit]- What is early stopping? Internet FAQ Archives