Online machine learning
|Machine learning and
Online machine learning is used in the case where the data becomes available in a sequential fashion, in order to determine a mapping from the dataset to the corresponding labels. The key difference between online learning and batch learning (or "offline" learning) techniques, is that in online learning the mapping is updated after the arrival of every new datapoint in a scalable fashion, whereas batch techniques are used when one has access to the entire training dataset at once. Online learning could be used in the case of a process occurring in time, for example the value of a stock given its history and other external factors, in which case the mapping updates as time goes on and we get more and more samples.
Ideally in online learning, the memory needed to store the function remains constant even with added datapoints, since the solution computed at one step is updated when a new datapoint becomes available, after which that datapoint can then be discarded. For many formulations, for example nonlinear kernel methods, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used. In this case, the space requirements are no longer guaranteed to be constant since it requires storing all previous datapoints, but the solution may take less time to compute with the addition of a new datapoint, as compared to batch learning techniques.
As in all machine learning problems, the goal of the algorithm is to minimize some performance criteria using a loss function. For example, with stock market prediction the algorithm may attempt to minimize the mean squared error between the predicted and true value of a stock. Another popular performance criterion is to minimize the number of mistakes when dealing with classification problems. In addition to applications of a sequential nature, online learning algorithms are also relevant in applications with huge amounts of data such that traditional learning approaches that use the entire data set in aggregate are computationally infeasible.
- 1 A prototypical online supervised learning algorithm
- 2 Example: Complexity in the Case of Linear Least Squares
- 3 Books with substantial treatment of online machine learning
- 4 See also
- 5 References
- 6 External links
A prototypical online supervised learning algorithm
In the setting of supervised learning, or learning from examples, we are interested in learning a function , where is thought of as a space of inputs and as a space of outputs, that predicts well on instances that are drawn from a joint probability distribution on . In this setting, we are given a loss function , such that measures the difference between the predicted value and the true value . The ideal goal is to select a function , where is a space of functions called a hypothesis space, so as to minimize the expected risk:
In reality, the learner never knows the true distribution over instances. Instead, the learner usually has access to a training set of examples that are assumed to have been drawn i.i.d. from the true distribution . A common paradigm in this situation is to estimate a function through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularized least squares and support vector machines.
The above paradigm is not well-suited to the online learning setting though, as it requires complete a priori knowledge of the entire training set. In the pure online learning approach, the learning algorithm should update a sequence of functions in a way such that the function depends only on the previous function and the next data point . This approach has low memory requirements in the sense that it only requires storage of a representation of the current function and the next data point . A related approach that has larger memory requirements allows to depend on and all previous data points . We focus solely on the former approach here, and we consider both the case where the data is coming from an infinite stream and the case where the data is coming from a finite training set , in which case the online learning algorithm may make multiple passes through the data.
The algorithm and its interpretations
Here we outline a prototypical online learning algorithm in the supervised learning setting and we discuss several interpretations of this algorithm. For simplicity, consider the case where , , and is the set of all linear functionals from into , i.e. we are working with a linear kernel and functions can be identified with vectors . Furthermore, assume that is a convex, differentiable loss function. An online learning algorithm satisfying the low memory property discussed above consists of the following iteration:
where , is the gradient of the loss for the next data point evaluated at the current linear functional , and is a step-size parameter. In the case of an infinite stream of data, one can run this iteration, in principle, forever, and in the case of a finite but large set of data, one can consider a single pass or multiple passes (epochs) through the data.
Interestingly enough, the above simple iterative online learning algorithm has three distinct interpretations, each of which has distinct implications about the predictive quality of the sequence of functions . The first interpretation considers the above iteration as an instance of the stochastic gradient descent method applied to the problem of minimizing the expected risk defined above. Indeed, in the case of an infinite stream of data, since the examples are assumed to be drawn i.i.d. from the distribution , the sequence of gradients of in the above iteration are an i.i.d. sample of stochastic estimates of the gradient of the expected risk and therefore one can apply complexity results for the stochastic gradient descent method to bound the deviation , where is the minimizer of . This interpretation is also valid in the case of a finite training set; although with multiple passes through the data the gradients are no longer independent, still complexity results can be obtained in special cases.
The second interpretation applies to the case of a finite training set and considers the above recursion as an instance of the incremental gradient descent method to minimize the empirical risk:
Since the gradients of in the above iteration are also stochastic estimates of the gradient of , this interpretation is also related to the stochastic gradient descent method, but applied to minimize the empirical risk as opposed to the expected risk. Since this interpretation concerns the empirical risk and not the expected risk, multiple passes through the data are readily allowed and actually lead to tighter bounds on the deviations , where is the minimizer of .
The third interpretation of the above recursion is distinctly different from the first two and concerns the case of sequential trials discussed above, where the data are potentially not i.i.d. and can perhaps be selected in an adversarial manner. At each step of this process, the learner is given an input and makes a prediction based on the current linear function . Only after making this prediction does the learner see the true label , at which point the learner is allowed to update to . Since we are not making any distributional assumptions about the data, the goal here is to perform as well as if we could view the entire sequence of examples ahead of time; that is, we would like the sequence of functions to have low regret relative to any vector :
In this setting, the above recursion can be considered as an instance of the online gradient descent method for which there are complexity bounds that guarantee regret.
It should be noted that although the three interpretations of this algorithm yield complexity bounds in three distinct settings, each bound depends on the choice of step-size sequence in a different way, and thus we cannot simultaneously apply the consequences of all three interpretations; we must instead select the step-size sequence in a way that is tailored for the interpretation that is most relevant. Furthermore, the above algorithm and these interpretations can be extended to the case of a nonlinear kernel by simply considering to be the feature space associated with the kernel. Although in this case the memory requirements at each iteration are no longer , but are rather on the order of the number of data points considered so far.
Example: Complexity in the Case of Linear Least Squares
Let us consider the setting of supervised learning with the square loss function , (, , ). The solution after the arrival of every datapoint is given by where and is built from the data points, with being -by- and being -by-. The solution of linear least squares problem is roughly .
If we have total points in the dataset and we have to recompute the solution after the arrival of every datapoint , we have a total complexity . Here we assume that the matrix is invertible, otherwise we can proceed in a similar fashion with Tikhonov regularization.
The recursive least squares algorithm considers an online approach to the least squares problem. It can be shown that for suitable initializations of and , the solution of the linear least squares problem given in the previous section can be computed by the following iteration:
For the proof, see RLS.
The complexity for steps of this algorithm is , which is an order of magnitude faster than the corresponding batch learning complexity. The storage requirements at every step here are constant at , i.e. that of storing the matrix .
Stochastic Gradient Descent
If we now replace by (i.e. replacing by ), we have a stochastic gradient descent algorithm. In this case, the complexity for steps of this algorithm reduces to . The storage requirements at every step are constant at .
However, the stepsize needs to be chosen carefully to solve the expected risk minimization problem, as detailed above.
Books with substantial treatment of online machine learning
- Algorithmic Learning in a Random World by Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Published by Springer Science+Business Media, Inc. 2005 ISBN 0-387-00152-2
- Prediction, learning, and games by Nicolò Cesa-Bianchi and Gábor Lugosi. Cambridge University Press, 2006 ISBN 0-521-84108-9
- Hierarchical temporal memory
- k-nearest neighbor algorithm
- Lazy learning
- Learning Vector Quantization
- Offline learning, the opposite model
- Online algorithm
- Streaming Algorithm
- Stochastic gradient descent
- Supervised learning
- Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks. Cambridge University Press. ISBN 978-0-521-65263-6
- Stochastic Approximation Algorithms and Applications, Harold J. Kushner and G. George Yin, New York: Springer-Verlag, 1997. ISBN 0-387-94916-X; 2nd ed., titled Stochastic Approximation and Recursive Algorithms and Applications, 2003, ISBN 0-387-00894-2.
- Bertsekas, D. P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optimization for Machine Learning, 85.
- Shalev-Shwartz, S. (2011). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 107-194.
- http://onlineprediction.net/, Wiki for On-Line Prediction.