Coordinate descent

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Coordinate descent is a non-derivative optimization algorithm. To find a local minimum of a function, one does line search along one coordinate direction at the current point in each iteration. One uses different coordinate directions cyclically throughout the procedure.

Description[edit]

Coordinate descent is based on the idea that the minimization of a multivariable function F(\mathbf{x}) can be achieved by minimizing it along one direction at a time, i.e., solving univariate (or at least much simpler) optimization problems in a loop.[1] In the simplest case of cyclic coordinate descent, one cyclically iterates through the directions, one at a time, minimizing the objective function with respect to each coordinate direction at a time. That is, in each iteration, for each variable k of the problem in turn, the algorithm solves the optimization problem

x^{k+1}_i = \underset{y\in\mathbb R}{\operatorname{arg\,min}}\; f(x^{k+1}_1, \dots, x^{k+1}_{i-1}, y, x^k_{i+1}, \dots, x^k_n).

Thus, one begins with an initial guess \mathbf{x}^0 for a local minimum of F, and get a sequence \mathbf{x}^0, \mathbf{x}^1, \mathbf{x}^2, \dots iteratively.

By doing line search in each iteration, one automatically has

F(\mathbf{x}^0)\ge F(\mathbf{x}^1)\ge F(\mathbf{x}^2)\ge \dots.

It can be shown that this sequence has similar convergence properties as steepest descent. No improvement after one cycle of line search along coordinate directions implies a stationary point is reached.

This process is illustrated below.

Coordinate descent.svg

Differentiable case[edit]

In the case of a continuously differentiable function F, a coordinate descent algorithm can be sketched as:[1]

  • Choose an initial parameter vector x.
  • Until convergence is reached, or for some fixed number of iterations:
    • Choose an index i from 1 to n.
    • Choose a step size α.
    • Update xi to xiαF/xi(x).

The step size can be chosen in various ways, e.g., by solving for the exact minimizer of f(xi) = F(x) (i.e., F with all variables but xi fixed), or by traditional line search criteria.[1]

Limitations[edit]

Coordinate descent has problems with non-smooth functions. The following picture shows that coordinate descent iteration may get stuck at a non-stationary point if the level curves of a function are not smooth. Suppose that the algorithm is at the point (-2, -2); then there are two axis-aligned directions it can consider for taking a step, indicated by the red arrows. However, every step along these two directions will increase the objective function's value (assuming a minimization problem), so the algorithm will not take any step, even though both steps together would bring the algorithm closer to the optimum.

Nonsmooth coordinate descent.svg

Applications[edit]

Coordinate descent algorithms are popular with practitioners owing to their simplicity, but the same property has led optimization researchers to largely ignore them in favor of more interesting (complicated) methods.[1] This situation has changed with the advent of large-scale problems in machine learning, where coordinate descent has been shown competitive to other methods when applied to such problems as training linear support vector machines[2] (see LIBLINEAR) and non-negative matrix factorization.[3] They are attractive for problems where computing gradients is infeasible, perhaps because the data required to do so are distributed across computer networks.[4]

See also[edit]

References[edit]

  1. ^ a b c d Wright, Stephen J. (2015). "Coordinate descent algorithms". Mathematical Programming 151 (1): 3–34. arXiv:1502.04759. doi:10.1007/s10107-015-0892-3. 
  2. ^ Hsieh, C. J.; Chang, K. W.; Lin, C. J.; Keerthi, S. S.; Sundararajan, S. (2008). "A dual coordinate descent method for large-scale linear SVM". Proceedings of the 25th international conference on Machine learning - ICML '08 (PDF). p. 408. doi:10.1145/1390156.1390208. ISBN 9781605582054. 
  3. ^ Hsieh, C. J.; Dhillon, I. S. (2011). Fast coordinate descent methods with variable selection for non-negative matrix factorization (PDF). Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '11. p. 1064. doi:10.1145/2020408.2020577. ISBN 9781450308137. 
  4. ^ Nesterov, Yurii (2012). "Efficiency of coordinate descent methods on huge-scale optimization problems" (PDF). SIAM J. Optimization 22 (2): 341–362. doi:10.1137/100802001. 
  • Bezdek, J. C.; Hathaway, R. J.; Howard, R. E.; Wilson, C. A.; Windham, M. P. (1987), "Local convergence analysis of a grouped variable version of coordinate descent", Journal of Optimization theory and applications (Kluwer Academic/Plenum Publishers) 54 (3), pp. 471–477, doi:10.1007/BF00940196 
  • Bertsekas, Dimitri P. (1999). Nonlinear Programming, Second Edition Athena Scientific, Belmont, Massachusetts. ISBN 1-886529-00-0.
  • Canutescu, AA; Dunbrack, RL (2003), "Cyclic coordinate descent: A robotics algorithm for protein loop closure.", Protein science 12 (5), pp. 963–72, doi:10.1110/ps.0242703, PMID 12717019 .
  • Luo, Zhiquan; Tseng, P. (1992), "On the convergence of the coordinate descent method for convex differentiable minimization", Journal of Optimization theory and applications (Kluwer Academic/Plenum Publishers) 72 (1), pp. 7–35, doi:10.1007/BF00939948 .
  • Wu, TongTong; Lange, Kenneth (2008), "Coordinate descent algorithms for Lasso penalized regression", The Annals of Applied Statistics (Institute of Mathematical Statistics) 2 (1), pp. 224–244, doi:10.1214/07-AOAS147 .
  • Richtarik, Peter; Takac, Martin (April 2011), "Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function", Mathematical Programming (Springer), doi:10.1007/s10107-012-0614-z .
  • Richtarik, Peter; Takac, Martin (December 2012), "Parallel coordinate descent methods for big data optimization", arXiv:1212.0873 .