# Online machine learning

Online machine learning is used in the case where the data becomes available in a sequential fashion, in order to determine a mapping from the dataset to the corresponding labels. The key difference between online learning and batch learning (or "offline" learning) techniques, is that in online learning the mapping is updated after the arrival of every new datapoint in a scalable fashion, whereas batch techniques are used when one has access to the entire training dataset at once. Online learning could be used in the case of a process occurring in time, for example the value of a stock given its history and other external factors, in which case the mapping updates as time goes on and we get more and more samples.

Ideally in online learning, the memory needed to store the function remains constant even with added datapoints, since the solution computed at one step is updated when a new datapoint becomes available, after which that datapoint can then be discarded. For many formulations, for example nonlinear kernel methods, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used. In this case, the space requirements are no longer guaranteed to be constant since it requires storing all previous datapoints, but the solution may take less time to compute with the addition of a new datapoint, as compared to batch learning techniques.

As in all machine learning problems, the goal of the algorithm is to minimize some performance criteria using a loss function. For example, with stock market prediction the algorithm may attempt to minimize the mean squared error between the predicted and true value of a stock. Another popular performance criterion is to minimize the number of mistakes when dealing with classification problems. In addition to applications of a sequential nature, online learning algorithms are also relevant in applications with huge amounts of data such that traditional learning approaches that use the entire data set in aggregate are computationally infeasible.

## A prototypical online supervised learning algorithm

In the setting of supervised learning, or learning from examples, we are interested in learning a function $f : X \to Y$, where $X$ is thought of as a space of inputs and $Y$ as a space of outputs, that predicts well on instances that are drawn from a joint probability distribution $p(x,y)$ on $X \times Y$. In this setting, we are given a loss function $V : Y \times Y \to \mathbb{R}$, such that $V(f(x), y)$ measures the difference between the predicted value $f(x)$ and the true value $y$. The ideal goal is to select a function $f \in \mathcal{H}$, where $\mathcal{H}$ is a space of functions called a hypothesis space, so as to minimize the expected risk:

$I[f] = \mathbb{E}[V(f(x), y)] = \int V(f(x), y)\,dp(x, y) \ .$

In reality, the learner never knows the true distribution $p(x,y)$ over instances. Instead, the learner usually has access to a training set of examples $(x_1, y_1), \ldots, (x_n, y_n)$ that are assumed to have been drawn i.i.d. from the true distribution $p(x,y)$. A common paradigm in this situation is to estimate a function $\hat{f}$ through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularized least squares and support vector machines.

The above paradigm is not well-suited to the online learning setting though, as it requires complete a priori knowledge of the entire training set. In the pure online learning approach, the learning algorithm should update a sequence of functions $f_1, f_2, \ldots$ in a way such that the function $f_{t+1}$ depends only on the previous function $f_t$ and the next data point $(x_t, y_t)$. This approach has low memory requirements in the sense that it only requires storage of a representation of the current function $f_t$ and the next data point $(x_t, y_t)$. A related approach that has larger memory requirements allows $f_{t+1}$ to depend on $f_t$ and all previous data points $(x_1, y_1), \ldots, (x_t, y_t)$. We focus solely on the former approach here, and we consider both the case where the data is coming from an infinite stream $(x_1, y_1), (x_2, y_2), \ldots$ and the case where the data is coming from a finite training set $(x_1, y_1), \ldots, (x_n, y_n)$, in which case the online learning algorithm may make multiple passes through the data.

### The algorithm and its interpretations

Here we outline a prototypical online learning algorithm in the supervised learning setting and we discuss several interpretations of this algorithm. For simplicity, consider the case where $X = \mathbb{R}^d$, $Y \subseteq \mathbb{R}$, and $\mathcal{H} = \{\langle w, \cdot\rangle : w \in \mathbb{R}^d\}$ is the set of all linear functionals from $X$ into $\mathbb{R}$, i.e. we are working with a linear kernel and functions $f \in \mathcal{H}$ can be identified with vectors $w \in \mathbb{R}^d$. Furthermore, assume that $V(\cdot, \cdot)$ is a convex, differentiable loss function. An online learning algorithm satisfying the low memory property discussed above consists of the following iteration:

$w_{t+1} \gets w_t - \gamma_t\nabla V(\langle w_t, x_t \rangle, y_t) \ ,$

where $w_1 \gets 0$, $\nabla V(\langle w_t, x_t \rangle, y_t)$ is the gradient of the loss for the next data point $(x_t, y_t)$ evaluated at the current linear functional $w_t$, and $\gamma_t > 0$ is a step-size parameter. In the case of an infinite stream of data, one can run this iteration, in principle, forever, and in the case of a finite but large set of data, one can consider a single pass or multiple passes (epochs) through the data.

Interestingly enough, the above simple iterative online learning algorithm has three distinct interpretations, each of which has distinct implications about the predictive quality of the sequence of functions $w_1, w_2, \ldots$. The first interpretation considers the above iteration as an instance of the stochastic gradient descent method applied to the problem of minimizing the expected risk $I[w]$ defined above.[1] Indeed, in the case of an infinite stream of data, since the examples $(x_1, y_1), (x_2, y_2), \ldots$ are assumed to be drawn i.i.d. from the distribution $p(x,y)$, the sequence of gradients of $V(\cdot, \cdot)$ in the above iteration are an i.i.d. sample of stochastic estimates of the gradient of the expected risk $I[w]$ and therefore one can apply complexity results for the stochastic gradient descent method to bound the deviation $I[w_t] - I[w^\ast]$, where $w^\ast$ is the minimizer of $I[w]$.[2] This interpretation is also valid in the case of a finite training set; although with multiple passes through the data the gradients are no longer independent, still complexity results can be obtained in special cases.

The second interpretation applies to the case of a finite training set and considers the above recursion as an instance of the incremental gradient descent method[3] to minimize the empirical risk:

$I_n[w] = \frac{1}{n}\sum_{i = 1}^nV(\langle w,x_i \rangle, y_i) \ .$

Since the gradients of $V(\cdot, \cdot)$ in the above iteration are also stochastic estimates of the gradient of $I_n[w]$, this interpretation is also related to the stochastic gradient descent method, but applied to minimize the empirical risk as opposed to the expected risk. Since this interpretation concerns the empirical risk and not the expected risk, multiple passes through the data are readily allowed and actually lead to tighter bounds on the deviations $I_n[w_t] - I_n[w^\ast_n]$, where $w^\ast_n$ is the minimizer of $I_n[w]$.

The third interpretation of the above recursion is distinctly different from the first two and concerns the case of sequential trials discussed above, where the data are potentially not i.i.d. and can perhaps be selected in an adversarial manner. At each step of this process, the learner is given an input $x_t$ and makes a prediction based on the current linear function $w_t$. Only after making this prediction does the learner see the true label $y_t$, at which point the learner is allowed to update $w_t$ to $w_{t+1}$. Since we are not making any distributional assumptions about the data, the goal here is to perform as well as if we could view the entire sequence of examples ahead of time; that is, we would like the sequence of functions $w_1, w_2, \ldots$ to have low regret relative to any vector $w^\ast$:

$R_T(w^\ast) = \sum_{t = 1}^TV(\langle w_t, x_t \rangle, y_t) - \sum_{t = 1}^TV(\langle w^\ast, x_t \rangle, y_t) \ .$

In this setting, the above recursion can be considered as an instance of the online gradient descent method for which there are complexity bounds that guarantee $O(\sqrt{T})$ regret.[4]

It should be noted that although the three interpretations of this algorithm yield complexity bounds in three distinct settings, each bound depends on the choice of step-size sequence $\{\gamma_t\}$ in a different way, and thus we cannot simultaneously apply the consequences of all three interpretations; we must instead select the step-size sequence in a way that is tailored for the interpretation that is most relevant. Furthermore, the above algorithm and these interpretations can be extended to the case of a nonlinear kernel by simply considering $X$ to be the feature space associated with the kernel. Although in this case the memory requirements at each iteration are no longer $O(d)$, but are rather on the order of the number of data points considered so far.

## Example: Complexity in the Case of Linear Least Squares

### Batch Learning

Let us consider the setting of supervised learning with the square loss function $V(\langle w,x_i\rangle,y_i)=(x_i^Tw-y_i)^2$, ($x_i \in \mathbb{R}^d$, $w_i \in \mathbb{R}^d$, $y_i \in \mathbb{R}$). The solution after the arrival of every datapoint $\{x_i,y_i\}$ is given by $w^*=(X^T X)^{-1}X^T Y$ where $X$ and $Y$ is built from the $i$ data points, with $X$ being $i$-by-$d$ and $Y$ being $i$-by-$1$. The solution of linear least squares problem is roughly $O(id^2)$.

If we have $n$ total points in the dataset and we have to recompute the solution after the arrival of every datapoint $i=1, \ldots, n$, we have a total complexity $O(n^2d^2)$. Here we assume that the matrix $X^T X$ is invertible, otherwise we can proceed in a similar fashion with Tikhonov regularization.

### Online Learning

The recursive least squares algorithm considers an online approach to the least squares problem. It can be shown that for suitable initializations of $w_0 \in \mathbb{R}^d$ and $\Gamma_0 \in \mathbb{R}^{dxd}$, the solution of the linear least squares problem given in the previous section can be computed by the following iteration:

$\Gamma_i=\Gamma_{i-1}-\frac{\Gamma_{i-1}x_i x_i^T \Gamma_{i-1}}{1+x_i^T\Gamma_{i-1}x_i}$
$w_i = w_{i-1}-\Gamma_ix_i(x_i^T w_{i-1}-y_i)$

For the proof, see RLS.

The complexity for $n$ steps of this algorithm is $O(nd^2)$, which is an order of magnitude faster than the corresponding batch learning complexity. The storage requirements at every step $i$ here are constant at $O(d^2)$, i.e. that of storing the matrix $\Gamma_i$.

If we now replace $w_i = w_{i-1}-\Gamma_ix_n(x_i^T w_{i-1}-y_i)$ by $w_i = w_{i-1}-\gamma_i x_i(x_i^T w_{i-1}-y_i)$ (i.e. replacing $\Gamma_i \in \mathbb{R}^{d\times d}$ by $\gamma_i \in \mathbb{R}$), we have a stochastic gradient descent algorithm. In this case, the complexity for $n$ steps of this algorithm reduces to $O(nd)$. The storage requirements at every step $i$ are constant at $O(d)$.

However, the stepsize $\gamma_i$ needs to be chosen carefully to solve the expected risk minimization problem, as detailed above.

## Books with substantial treatment of online machine learning

• Algorithmic Learning in a Random World by Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Published by Springer Science+Business Media, Inc. 2005 ISBN 0-387-00152-2

## References

1. ^ Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks. Cambridge University Press. ISBN 978-0-521-65263-6
2. ^ Stochastic Approximation Algorithms and Applications, Harold J. Kushner and G. George Yin, New York: Springer-Verlag, 1997. ISBN 0-387-94916-X; 2nd ed., titled Stochastic Approximation and Recursive Algorithms and Applications, 2003, ISBN 0-387-00894-2.
3. ^ Bertsekas, D. P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optimization for Machine Learning, 85.
4. ^ Shalev-Shwartz, S. (2011). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 107-194.