Online machine learning

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update our best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adopt to new patterns in the data, or when the data itself is generated as a function of time, e.g. stock price prediction.

Two general modelling strategies exist for online learning models: statistical learning models and adversarial models. In statistical learning models (e.g. stochastic gradient descent, perceptrons), the data samples are assumed to be independent and identically distributed random variables (i.e they are not adopting with time), and our algorithm just has a limited access to the data. In adversarial models, we instead look at our learning problem as a game between two players (the learner vs the data generator), and we are trying to minimize our losses regardless of the move played by the other player. In this model, the opponent is allowed to dynamically adopt the data generated based on the output of the learning algorithm. Spam filtering falls in this category, as the adversary will dynamically generate new spam based on the current behavior of the spam detector. Examples of algorithms in this model include follow the leader, follow the regularized leader etc.

Introduction[edit]

In the setting of supervised learning, we are interested in learning a function  f : X \to Y, where X is thought of as a space of inputs and Y as a space of outputs, that predicts well on instances that are drawn from a joint probability distribution p(x,y) on X \times Y. In reality, the learner never knows the true distribution p(x,y) over instances. Instead, the learner usually has access to a training set of examples (x_1, y_1), \ldots, (x_n, y_n). In this setting, we are given a loss function V : Y \times Y \to \mathbb{R}, such that  V(f(x), y) measures the difference between the predicted value f(x) and the true value y. The ideal goal is to select a function f \in \mathcal{H}, where \mathcal{H} is a space of functions called a hypothesis space, so that we minimize some notion of total loss. Depending on the type of model (statistical or adversarial), one can devise different notions of loss, which lead to different learning algorithms.

Statistical Learning Models[edit]

In statistical learning models, the training sample  (x_i,y_i) are assumed to have been drawn i.i.d. from the true distribution p(x,y) and we try to minimize the expected "risk"

I[f] = \mathbb{E}[V(f(x), y)] = \int V(f(x), y)\,dp(x, y) \ .

A common paradigm in this situation is to estimate a function \hat{f} through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularized least squares and support vector machines. For the case of online learning, the data is still assumed to be i.i.d but we don't have access to all the data. A purely online model in this category would learn based on just the new input (x_{t+1},y_{t+1}), the current best predictor  f_{t} and some extra stored information (which is usually expected to have storage requirements independent of training data size). For many formulations, for example nonlinear kernel methods, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used where we allow f_{t+1} to depend on f_t and all previous data points (x_1, y_1), \ldots, (x_t, y_t). In this case, the space requirements are no longer guaranteed to be constant since it requires storing all previous data points, but the solution may take less time to compute with the addition of a new data point, as compared to batch learning techniques.

An important generalisation of these techniques is mini-batch techniques, which process a small batch of  b \ge 1 datapoints at a time, but can be considered as online algorithms for  b much smaller than the total number of training points. Mini-batch techniques are used with repeated passing over the training data (called incremental methods) to obtain optimised out-of-core versions of machine learning algorithms, for e.g. Stochastic gradient descent. When combined with backpropogation, this is currently the de-facto training method for training artificial neural networks.

Example: Linear Least Squares[edit]

We use the simple example of linear least squares to explain a variety of ideas in online learning. The ideas are general enough to be applied to other settings, for e.g. with other convex loss functions.

Batch Learning[edit]

Let us consider the setting of supervised learning with the square loss function. Thus, we are trying to minimise the empirical loss

 I_n[w] = \sum_{j=1}^{n} V(\langle w,x_j\rangle,y_j) = \sum_{j=1}^{n} (x_j^Tw-y_j)^2 where
 x_j \in \mathbb{R}^d, w \in \mathbb{R}^d, y_j \in \mathbb{R}.

Let X be the  i \times d data matrix and Y is the  i \times 1 matrix of target values after the arrival of the first i datapoints. Assuming that the covaraiance matrix  \Sigma_i = X^T X is invertible (otherwise we can proceed in a similar fashion with Tikhonov regularization), the best solution  f^*(x) = \langle w^*, x \rangle to the linear least squares problem is given by

 w^* = (X^TX)^{-1}X^TY = \Sigma_i^{-1} \sum_{j=1}^{i} x_j y_j .

Now, calculating the covariance matrix  \Sigma_i = \sum_{j=1}^{i} x_j^Tx_j takes time  O(id^2) , inverting the d \times d matrix takes time O(d^3), while the rest of the multiplication takes time O(d^2), giving a total time of O(id^2 + d^3). If we have n total points in the dataset and we have to recompute the solution after the arrival of every datapoint i=1, \ldots, n, the naive approach will have a total complexity O(n^2d^2 + nd^3). Note that if we store the matrix  \Sigma_i , then updating it at each step needs only adding  x_{i+1}^Tx_{i+1} , which takes  O(d) time, reducing the total time to O(nd^2 + nd^3) = O(nd^3), but with an additional storage space of  O(d^2) to store  \Sigma_i [1]

Online Learning : Recursive Least Squares[edit]

The recursive least squares algorithm considers an online approach to the least squares problem. It can be shown that by initialising  \textstyle w_0 = 0 \in \mathbb{R}^d and \textstyle \Gamma_0 = I \in \mathbb{R}^{d \times d}, the solution of the linear least squares problem given in the previous section can be computed by the following iteration:

 \Gamma_i=\Gamma_{i-1}-\frac{\Gamma_{i-1}x_i x_i^T \Gamma_{i-1}}{1+x_i^T\Gamma_{i-1}x_i}
w_i = w_{i-1}-\Gamma_ix_i(x_i^T w_{i-1}-y_i)

The above iteration algorithm can be proved using induction on  i .[2] The proof also shows that  \Gamma_i = \Sigma_i^{-1} . One can look at RLS also in the context of adaptive filters (see RLS).

The complexity for n steps of this algorithm is O(nd^2), which is an order of magnitude faster than the corresponding batch learning complexity. The storage requirements at every step i here are to store the matrix \Gamma_i, which is constant at O(d^2). For the case when  \Sigma_i is not invertible, we consider the regularised version of the problem loss function  \sum_{j=1}^{n} (x_j^Tw - y_j)^2 + \lambda || w ||_2^2 . Then, it's easy to show that the same algorithm works with  \Gamma_0 = (I + \lambda I)^{-1} , and the iterations proceed to give  \Gamma_i = (\Sigma_i + \lambda I)^{-1} .[1]

Stochastic gradient descent[edit]

If we now replace

\textstyle w_i = w_{i-1}-\Gamma_ix_i(x_i^T w_{i-1}-y_i)

by

 \textstyle w_i = w_{i-1}-\gamma_i x_i(x_i^T w_{i-1}-y_i) = w_{i-1} - \gamma_i \nabla V(\langle w_{i-1}, x_i \rangle, y_i)

i.e. we replace \Gamma_i \in \mathbb{R}^{d\times d} by \gamma_i \in \mathbb{R}, we otain the stochastic gradient descent algorithm. In this case, the complexity for n steps of this algorithm reduces to O(nd). The storage requirements at every step i are constant at O(d).

However, the stepsize \gamma_i needs to be chosen carefully to solve the expected risk minimization problem, as detailed above. By choosing a decaying step size  \gamma_i \approx \frac{1}{\sqrt{i}}, one can prove the convergence of the average iterate  \overline{w}_n = \frac{1}{n} \sum_{i=1}^{n} w_i . This setting is a special case of stochastic optimization, a well known problem in optimization.[1]

Incremental SGD[edit]

In practice, one can perform multiple stochastic gradient passes (also called cycles or epochs) over the data. The algorithm thus obtained is called incremental gradient method and corresponds to an iteration

 \textstyle w_i = w_{i-1} - \gamma_i \nabla V(\langle w_{i-1}, x_{t_i} \rangle, y_{t_i})

The main difference with the stochastic gradient method is that here a sequence  t_i is chosen to decide which training point is visited in the  i -th step. Such a sequence can be stochastic or deterministic. The number of iterations is then decoupled to the number of points (each point can be considered more than once). The incremental gradient method can be shown to provide a minimizer to the empirical risk.[3] Incremental techniques can be advantageous when considering objective functions made up of a sum of many terms e.g. an empirical error corresponding to a very large dataset.[1]

Kernel Methods[edit]

See also: Kernel method

Kernels can be used to extend the above algorithms to non-parametric models (or models where the parameters form an infinite dimensional space). The corresponding procedure will no longer be truly online and instead involve storing all the data points, but is still faster than the brute force method. We restrict our discussion to the case of the square loss, though it can be extended to any convex loss. It can be shown by an easy induction [1] that if  X_i is the data matrix and  w_i is the output after  i steps of the SGD algorithm, then we have

 w_i = X_i^T c_i

where  \textstyle c_i = ((c_i)_1, (c_i)_2, ..., (c_i)_i) \in \mathbb{R}^i and the sequence  c_i satisfies the recursion:

 c_0 = 0
 (c_i)_j = (c_{i-1})_j, j=1,2,...,i-1 and
 (c_i)_i = \gamma_i \Big(y_i - \sum_{j=1}^{i-1} (c_{i-1})_j\langle x_j, x_i \rangle)

Notice that here  \langle x_j, x_i \rangle is just the standard Kernel on  \mathbb{R}^d , and our predictor is of the form

 f_i(x) = \langle w_{i-1},x \rangle = \sum_{j=1}^{i-1} (c_{i-1})_j \langle x_j,x \rangle .

Now, if we instead introduce a general kernel  K and let our predictor be

 f_i(x) = \sum_{j=1}^{i-1} (c_{i-1})_j K(x_j,x)

then the same proof will also show that predictor minimising the least squares loss is obtained by changing the above recursion to

 (c_i)_i = \gamma_i \Big(y_i - \sum_{j=1}^{i-1}(c_{i-1})_j K(x_j,x_i))

We can see that the above expression requires storing all the data for updating  c_i . The total time complexity for the recursion when evaluating for the  n -th datapoint is  O(n^2 d k) , where  k is the cost of evaluating the kernel on a single pair of points.[1] Thus, the use of the kernel has allowed us to move from a finite dimensional parameter space  \textstyle w_{i} \in \mathbb{R}^d to a possibly infinite dimensional feature represented by a kernel  K by instead performing the recursion on the space of parameters  \textstyle c_{i} \in \mathbb{R}^i , whose dimension is the same as the size of the training dataset. In general, this is a consequence of the Representer theorem.[1]

Adversarial Models: Sequential Learning[edit]

See also: Sequential game

In sequential learning, we can think of our learning problem as a game between two players (the learner vs nature), and we are trying to minimize our losses regardless of the move played by the other player. The game proceeds as follows.

For  t= 1, 2, ..., T

  • Learner receives an input  x_t \in X
  • Learner outputs prediction  p_t = f_t(x_t) \in Y
  • Nature looks at output  p_t and send the learner the true label  y_t \in Y
  • Learner suffers loss V(p_t,y_t) and updates it's model.

Since we are not making any distributional assumptions about the data, the goal here is to perform as well as if we could view the entire sequence of examples ahead of time. Let  f^\ast \in H be the hypothesis that achieves the least loss for this sequence, ie it minimises  \sum_{t = 1}^TV(p_t, y_t) . We can think of this the benchmark to beat, and thus, we would like the sequence of functions f_1, f_2, \ldots to have a low loss relative to this. It's customary to call this as the regret on the hypothesis set  H . Thus, for sequential learning, the learner is trying to minimise is the regret

 R_T(H) = \sum_{t = 1}^TV(p_t, y_t) - \min_{f \in H} \sum_{t = 1}^TV(f(x_t), y_t)

We thus require the learner to be competitive with the best fixed predictor from H. In adversarial models, the members of the hypothesis set are also called as experts.

If we impose no additional constraints, then one can prove Cover’s impossibility result, which states that there is a hypothesis set H such that for any online learning algorithm, the regret is atleast linear in  T .[4] However, for learning to be feasible, we would like to obtain a sublinear bound on the regret, so that the average regret goes to  0 as  T \rightarrow \infty . One way to do so is to add the realisability constraint. It states that there exists a fixed hypothesis in H generating the target values. In this case, one can show that the regret R_T is bounded by  \log_2 |H| .[5] However, realisability is usually too strong of an assumption. Another way to bound the regret is to move to the setup of online convex optimisation, which we will now look at.

Online convex optimisation[edit]

In OCO, we force the hypothesis set and the loss functions to be convex to obtain stronger learning bounds. The modified sequential game is now as follows:

For  t = 1,2,...,T

  • Learner receives input  x_t
  • Learner outputs  w_t from a fixed convex set  S
  • Nature sends back a convex loss function  v_t : S \rightarrow \mathbb{R} .
  • Learner suffers loss v_t(w_t) and updates it's model

Thus, when we minimise regret, we are now competing against the best weight vector  u \in H As an example, consider the case of online least squares linear regression. Here, the weight vectors come from the convex set  S = \mathbb{R}^d , and nature sends back the convex loss function  v_t(w) = ( \langle w,x_t \rangle - y_t )^2 . Note here that  y_t is implicitly sent with  v_t .

Some online prediction problems however cannot fit it the framework of OCO. For example, in online classification, the prediction domain and the loss functions are not convex. In such scenario's, two simple techniques for convexification are convexification by randomisation and convexification by use of surrogate loss functions [4]

Let us now consider some simple online convex optimisation algorithms:

Follow the leader (FTL)[edit]

The simplest learning rule to try is to select (at the current step) the hypothesis that has the least loss over all past rounds. This algorithm is called Follow the leader, and is simply given by:

In round  t , set

 w_t = \operatorname*{arg\,min}_{w \in S} \sum_{i=1}^{t-1} v_i(w)

Here, ties are broken arbitrarily. This method can thus be looked as a greedy algorithm. For the case of online quadratic optimization (where the loss function is  v_t(w) = || w - x_t ||_2^2 ), one can show a regret bound that grows as  \log(T) [4] However, similar bounds cannot be obtained for the FTL algorithm for other important families of models like online linear optimization etc. To do so, one modifies FTL by adding regularisation.

Follow the regularised leader (FTRL)[edit]

This is a natural modification of FTL that is used to stabilise the FTL solutions and obtain better regret bounds. We choose a regularisation function  R : S \rightarrow \mathbb{R} and then perform our learning as follows:

In round  t , set

  w_t = \operatorname*{arg\,min}_{w \in S} \sum_{i=1}^{t-1}v_i(w) + R(w)

As a special example, consider the case of online linear optmisation i.e where nature sends back loss functions of the form  v_t(w) = \langle w,z_t \rangle . Also, let  S = \mathbb{R}^d . Suppose we choose the regularisation function  R(w) = \frac{1}{2 \eta} ||w||_2^2 for some positive number  \eta . Then, one can show that the regret minimising iteration becomes [4]

 w_{t+1} = - \eta \sum_{i=1}^{t} z_i = w_t - \eta z_t

Note that this can be rewritten as  w_{t+1} = w_t - \eta \nabla v_t(w_t) , which looks exactly like online gradient descent. If  S is instead some convex subspace of  \mathbb{R}^d , we would need to project onto  S , leading to the modified update rule

 w_{t+1} \in \Pi_S(- \eta \sum_{i=1}^{t} z_i) = \Pi_S(\eta \theta_{t+1})

This algorithm is known as lazy projection, as the vector  \theta_{t+1} accumulates the gradients. It is also known as Nesterov’s dual averaging algorithm. In this scenario of linear loss functions and quadratic regularisation, the regret is bounded by  O(\sqrt{T}) , and thus the average regret goes to  0 as desired.[5]

Online subgradient descent (OSD)[edit]

The above proved a regret bound for linear loss functions  v_t(w) = \langle w, z_t \rangle . To generalise the algorithm to any convex loss function, we use the subgradient  \partial v_t(w_t) of  v_t as a linear approaximation to  v_t near  w_t , leadinng to the online subgradient descent algorithm:

Initialise parameter  \eta, w_1 = 0

For  t = 1,2,...,T

  • Predict using  w_t , receive f_t from nature.
  • Choose z_t \in  \partial v_t(w_t)
  • If  S = \mathbb{R}^d , update as  w_{t+1} = w_t - \eta z_t
  • If  S \subset \mathbb{R}^d , project cummulative gradients onto  S ie  w_{t+1} = \Pi_S(\eta\theta_{t+1}) , \theta_{t+1} = \theta_t + z_t

One can use the OSD algorithm to derive  O(\sqrt{T}) regret bounds for the online version of SVM's for classification, which use the hinge loss  v_t(w) = \max \{ 0, 1 - y_t(w \cdot x_t) \} [5]

Other sequential algorithms[edit]

Quadratically regularised FTRL algorithms lead to lazily projected gradient algrithms as described above. To use the above for arbitrary convex functions and regularisers, one uses online mirror descent. Another algorithm is called prediction with expert advice. In this case, our hypothesis set consists of  d functions. We maintain a distribution  w_t \in \Delta_d over the  d experts, and predict by sampling an expert from this distribution. For the Eucledian regularisation, one can show a regret bound of  O(\sqrt{T}) , which can be improved further to a  O(\sqrt{\log T}) bound by using a better regulariser. For further reading about these algorithms, refer to [4][5]

Comparison of the models[edit]

The paradigm of online learning interestingly has three distinct interpretations depending on the choice of the learning model, each of which has distinct implications about the predictive quality of the sequence of functions f_1, f_2, \ldots, f_n. We use the prototypical stochastic gradient descent algorithm for this discussion. As noted above, it's recursion is given by

 \textstyle w_t = w_{t-1} - \gamma_t \nabla V(\langle w_{t-1}, x_t \rangle, y_t)

Statistical Learning Model[edit]

The first interpretation consider the stochastic gradient descent method as applied to the problem of minimizing the expected risk I[w] defined above.[6] Indeed, in the case of an infinite stream of data, since the examples (x_1, y_1), (x_2, y_2), \ldots are assumed to be drawn i.i.d. from the distribution p(x,y), the sequence of gradients of V(\cdot, \cdot) in the above iteration are an i.i.d. sample of stochastic estimates of the gradient of the expected risk I[w] and therefore one can apply complexity results for the stochastic gradient descent method to bound the deviation I[w_t] - I[w^\ast], where w^\ast is the minimizer of I[w].[7] This interpretation is also valid in the case of a finite training set; although with multiple passes through the data the gradients are no longer independent, still complexity results can be obtained in special cases.

The second interpretation applies to the case of a finite training set and considers the SGD algorithm as an instance of incremental gradient descent method.[3] In this case, one instead looks at the empirical risk:

I_n[w] = \frac{1}{n}\sum_{i = 1}^nV(\langle w,x_i \rangle, y_i) \ .

Since the gradients of V(\cdot, \cdot) in the incremental gradient descent iterations are also stochastic estimates of the gradient of I_n[w], this interpretation is also related to the stochastic gradient descent method, but applied to minimize the empirical risk as opposed to the expected risk. Since this interpretation concerns the empirical risk and not the expected risk, multiple passes through the data are readily allowed and actually lead to tighter bounds on the deviations I_n[w_t] - I_n[w^\ast_n], where w^\ast_n is the minimizer of I_n[w].

Adversarial Model[edit]

The third interpretation of the above recursion is distinctly different from the first two and concerns the case of sequential trials where the data are potentially not i.i.d. and can perhaps be selected in an adversarial manner. Since we are not making any distributional assumptions about the data, the goal here is to perform as well as if we could view the entire sequence of examples ahead of time, and we try to minimise the regret on the hypothesis set  \mathcal{H}

R_T(\mathcal{H}) = \sum_{t = 1}^TV(\langle w_t, x_t \rangle, y_t) - \min_{w \in H} \sum_{t = 1}^TV(\langle w, x_t \rangle, y_t) \ .

In this setting, the above recursion can be considered as an instance of the online subgradient descent method for which there are complexity bounds that guarantee O(\sqrt{T}) regret.[4]

It should be noted that although the three interpretations of this algorithm yield complexity bounds in three distinct settings, each bound depends on the choice of step-size sequence \{\gamma_t\} in a different way, and thus we cannot simultaneously apply the consequences of all three interpretations; we must instead select the step-size sequence in a way that is tailored for the interpretation that is most relevant. Furthermore, the above algorithm and these interpretations can be extended to the case of a nonlinear kernel by simply considering X to be the feature space associated with the kernel. Although in this case the memory requirements at each iteration are no longer O(d), but are rather on the order of the number of data points considered so far.

Implementations[edit]

  • Vowpal Wabbit : Open-source fast out-of-core online learning system which is notable for supporting a number of machine learning reductions, importance weighting and a selection of different loss functions and optmisation algorithms. It uses the hashing trick for bounding the size of the set of features independent of the amount of training data .
  • Scikit-learn: Provides out-of-core implementations of algorithms for
  • GURLS: Least squares library in C++ and Matlab for efficient supervised learning. Contains an implementation of the Recursive Least Squares algorithm.[8]

Books with substantial treatment of online machine learning[edit]

  • Algorithmic Learning in a Random World by Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Published by Springer Science+Business Media, Inc. 2005 ISBN 0-387-00152-2
  • Prediction, learning, and games by Nicolò Cesa-Bianchi and Gábor Lugosi. Cambridge University Press, 2006 ISBN 0-521-84108-9

See also[edit]

References[edit]

  1. ^ a b c d e f g L. Rosasco, T. Poggio, Machine Learning: a Regularization Approach, MIT-9.520 Lectures Notes, Manuscript, Dec. 2015. Chapter 7 - Online Learning
  2. ^ Yin, Harold J. Kushner, G. George (2003). Stochastic approximation and recursive algorithms and applications (Second edition. ed.). New York: Springer. pp. 8–12. ISBN 978-0-387-21769-7. 
  3. ^ a b Bertsekas, D. P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optimization for Machine Learning, 85.
  4. ^ a b c d e f Shalev-Shwartz, Shai (2011). "Online Learning and Online Convex Optimization". Foundations and Trends® in Machine Learning. pp. 107–194. doi:10.1561/2200000018. 
  5. ^ a b c d Liang, Percy. "CS229T/STAT231: Statistical Learning Theory (Winter 2015)" (PDF). 
  6. ^ Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks. Cambridge University Press. ISBN 978-0-521-65263-6 
  7. ^ Stochastic Approximation Algorithms and Applications, Harold J. Kushner and G. George Yin, New York: Springer-Verlag, 1997. ISBN 0-387-94916-X; 2nd ed., titled Stochastic Approximation and Recursive Algorithms and Applications, 2003, ISBN 0-387-00894-2.
  8. ^ Tacchetti, A., Mallapragada, P. K., Santoro, M., & Rosasco, L. (2013). GURLS: a least squares library for supervised learning. The Journal of Machine Learning Research, 14(1), 3201-3205. http://lcsl.mit.edu/#/downloads/gurls.

External links[edit]