Jump to content

Backtracking line search: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m →‎References: cite repair;
No edit summary
Line 54: Line 54:
[[Critical point (mathematics)]] are points where the gradient of the objective function is 0. Local minima are critical points, but there are critical points which are not local minima. An example is saddle points. [[Saddle point]] are critical points, at which there are at least one direction where the function is (local) maximum. Therefore, these points are far from being local minima. For example, if a function has at least one saddle point, then it cannot be [[Convex function]]. The relevance of saddle points to optimisation algorithms is that in large scale (i.e. high-dimensional) optimisation, one likely sees more saddle points than minima, see {{harvtxt|Bray|Dean|2007}}. Hence, a good optimisation algorithm should be able to avoid saddle points. In the setting of [[Deep learning]], saddle points are also prevalent, see {{harvtxt|Dauphin|Pascanu|Gulcehre|Cho|Ganguli|Bengjo|2014}}. Thus, to apply in [[Deep learning]], one needs results for non-convex functions.
[[Critical point (mathematics)]] are points where the gradient of the objective function is 0. Local minima are critical points, but there are critical points which are not local minima. An example is saddle points. [[Saddle point]] are critical points, at which there are at least one direction where the function is (local) maximum. Therefore, these points are far from being local minima. For example, if a function has at least one saddle point, then it cannot be [[Convex function]]. The relevance of saddle points to optimisation algorithms is that in large scale (i.e. high-dimensional) optimisation, one likely sees more saddle points than minima, see {{harvtxt|Bray|Dean|2007}}. Hence, a good optimisation algorithm should be able to avoid saddle points. In the setting of [[Deep learning]], saddle points are also prevalent, see {{harvtxt|Dauphin|Pascanu|Gulcehre|Cho|Ganguli|Bengjo|2014}}. Thus, to apply in [[Deep learning]], one needs results for non-convex functions.


For convergence to critical points: For example, if the cost function is a [[Analytic function|real analytic function]], then it is shown in {{harvtxt|Absil|Mahony|Andrews|2005}} that convergence is guaranteed. In {{harvtxt|Bertsekas|2016}}, there is a proof that for every sequence constructed by backtracking line search, a cluster point (i.e. the [[Limit (mathematics)|limit]] of one [[subsequence]], if the subsequence converges) is a critical point.
For convergence to critical points: For example, if the cost function is a [[Analytic function|real analytic function]], then it is shown in {{harvtxt|Absil|Mahony|Andrews|2005}} that convergence is guaranteed. In {{harvtxt|Bertsekas|2016}}, there is a proof that for every sequence constructed by backtracking line search, a cluster point (i.e. the [[Limit (mathematics)|limit]] of one [[subsequence]], if the subsequence converges) is a critical point. For the case of a function with at most countably many critical points and compact sublevels (see [[Level set]]), as well as with Lipschitz continuous gradient where one uses Standard GD with learning rate <1/L (see the section about Stochastic gradient descent), then convergence is guaranteed, see for example Chapter 12 in {{harvtxt|Lange|2013}}.

For avoidance of saddle points: For example, if the gradient of the cost function is Lipschitz continuous and one chooses Standard GD with learning rate <1/L, then with a random choice of initial point <math>\mathbf{x}_0</math> (more precisely, outside a set of [[Lebesgue measure]] zero), the sequence constructed will not converge to a non-degenerate (see [[Critical point]]) saddle point (proven in {{harvtxt|Lee|Simchowitz|Jordan|Recht|2016}}), and more generally it is also true that the sequence constructed will not converge to a degenerate saddle point (proven in {{harvtxt|Panageas|Piliouras|2017}}).


None of these results have been proven for any other optimization algorithm so far.{{citation needed|date=October 2020}}
None of these results have been proven for any other optimization algorithm so far.{{citation needed|date=October 2020}}
Line 77: Line 80:
* {{cite journal | last1 = Bray | first1 = A. J.|last2=Dean|first2=D. S.|| year = 2007 | title = Statistics of critical points of gaussian fields on large-dimensional spaces| journal = [[Physical Review Letters]] | volume = 98 | issue = | pages = 150201 | url = https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.98.150201 | doi=10.1103/PhysRevLett.98.150201| doi-access = free }}
* {{cite journal | last1 = Bray | first1 = A. J.|last2=Dean|first2=D. S.|| year = 2007 | title = Statistics of critical points of gaussian fields on large-dimensional spaces| journal = [[Physical Review Letters]] | volume = 98 | issue = | pages = 150201 | url = https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.98.150201 | doi=10.1103/PhysRevLett.98.150201| doi-access = free }}
* {{cite journal | last1 = Dauphin | first1 = Y. N.|last2=Pascanu|first2=R.| last3=Gulcehre|first3=C.| last4=Cho|first4=K.| last5=Ganguli|first5=S.| last6=Bengio|first6=Y. |year = 2014 | title = Identifying and attacking the saddle point problem in high-dimensional non-convex optimization| journal = [[NeurIPS]] | volume = 14 | issue = | pages = 2933-2941 | url = https://dl.acm.org/doi/10.5555/2969033.2969154}}
* {{cite journal | last1 = Dauphin | first1 = Y. N.|last2=Pascanu|first2=R.| last3=Gulcehre|first3=C.| last4=Cho|first4=K.| last5=Ganguli|first5=S.| last6=Bengio|first6=Y. |year = 2014 | title = Identifying and attacking the saddle point problem in high-dimensional non-convex optimization| journal = [[NeurIPS]] | volume = 14 | issue = | pages = 2933-2941 | url = https://dl.acm.org/doi/10.5555/2969033.2969154}}
* {{cite book |last1=Lange| first1= K. |title = Optimization|publisher=[[Springer-Verlag]] Publications|location= New York|year= 2013| isbn= 978-1-4614-5838-8}}
* {{cite book | first1= J. E. |last1= Dennis |author-link1=John E. Dennis |first2= R. B.|last2= Schnabel |author-link2=Robert B. Schnabel |title = Numerical Methods for Unconstrained Optimization and Nonlinear Equations|publisher=[[Society for Industrial and Applied Mathematics|SIAM]] Publications|location= Philadelphia|year= 1996| isbn= 978-0-898713-64-0}}
* {{cite book | first1= J. E. |last1= Dennis |author-link1=John E. Dennis |first2= R. B.|last2= Schnabel |author-link2=Robert B. Schnabel |title = Numerical Methods for Unconstrained Optimization and Nonlinear Equations|publisher=[[Society for Industrial and Applied Mathematics|SIAM]] Publications|location= Philadelphia|year= 1996| isbn= 978-0-898713-64-0}}
* {{cite journal | last1 = Lee | first1 = J. D.|last2=Simchowitz|first2=M.| last3=Jordan|first3=M. I.| last4=Recht|first4=B.| | year = 2016 | title = Gradient descent only converges to minimizers| journal = [[Proceedings of Machine Learning Research]] | volume = 49 | issue = | pages = 1246-1257 | url = http://proceedings.mlr.press/v49/lee16.html}}
* {{cite journal | last1 = Lee | first1 = J. D.|last2=Simchowitz|first2=M.| last3=Jordan|first3=M. I.| last4=Recht|first4=B.| | year = 2016 | title = Gradient descent only converges to minimizers| journal = [[Proceedings of Machine Learning Research]] | volume = 49 | issue = | pages = 1246-1257 | url = http://proceedings.mlr.press/v49/lee16.html}}

Revision as of 20:06, 8 October 2020

In (unconstrained) minimization, a backtracking line search, a search scheme based on the Armijo–Goldstein condition, is a line search method to determine the amount to move along a given search direction. It involves starting with a relatively large estimate of the step size for movement along the search direction, and iteratively shrinking the step size (i.e., "backtracking") until a decrease of the objective function is observed that adequately corresponds to the decrease that is expected, based on the local gradient of the objective function.

Backtracking line search is typically used for gradient descent, but it can also used in other contexts. For example, it can be used with Newton's method if the Hessian matrix is positive definite.

Motivation

Given a starting position and a search direction , the task of a line search is to determine a step size that adequately reduces the objective function (assumed smooth), i.e., to find a value of that reduces relative to . However, it is usually undesirable to devote substantial resources to finding a value of to precisely minimize . This is because the computing resources needed to find a more precise minimum along one particular direction could instead be employed to identify a better search direction. Once an improved starting point has been identified by the line search, another subsequent line search will ordinarily be performed in a new direction. The goal, then, is just to identify a value of that provides a reasonable amount of improvement in the objective function, rather than to find the actual minimizing value of .

The backtracking line search starts with a large estimate of and iteratively shrinks it. The shrinking continues until a value is found that is small enough to provide a decrease in the objective function that adequately matches the decrease that is expected to be achieved, based on the local function gradient

Define the local slope of the function of along the search direction as (where denotes the dot product). It is assumed that is a vector for which some local decrease is possible, i.e., it is assumed that .

Based on a selected control parameter , the Armijo–Goldstein condition tests whether a step-wise movement from a current position to a modified position achieves an adequately corresponding decrease in the objective function. The condition is fulfilled, see Armijo (1966), if

This condition, when used appropriately as part of a line search, can ensure that the step size is not excessively large. However, this condition is not sufficient on its own to ensure that the step size is nearly optimal, since any value of that is sufficiently small will satisfy the condition.

Thus, the backtracking line search strategy starts with a relatively large step size, and repeatedly shrinks it by a factor until the Armijo–Goldstein condition is fulfilled.

The search will terminate after a finite number of steps for any positive values of and that are less than 1. For example, Armijo used 12 for both and in Armijo (1966).

Algorithm

This condition is from Armijo (1966). Starting with a maximum candidate step size value , using search control parameters and , the backtracking line search algorithm can be expressed as follows:

  1. Set and iteration counter .
  2. Until the condition is satisfied that repeatedly increment and set
  3. Return as the solution.

In other words, reduce by a factor of in each iteration until the Armijo–Goldstein condition is fulfilled.

Function minimization using backtracking line search in practice

In practice, the above algorithm is typically iterated to produce a sequence , , to converge to a minimum, provided such a minimum exists and is selected appropriately in each step. For gradient descent, is selected as .

The value of for the that fulfills the Armijo–Goldstein condition depends on and , and is thus denoted below by . It also depends on , , and of course, although these dependencies can be left implicit if are assumed to be fixed with respect to the optimization problem.

The detailed steps are thus, see Armijo (1966), Bertsekas (2016):

  1. Choose an initial starting point and set the iteration counter .
  2. Until some stopping condition is satisfied, choose a descent direction , increment , and update the position to .
  3. Return as the minimizing position and as the function minimum.

To assure good behavior, it is necessary that some conditions must be satisfied by . Roughly speaking should not be too far away from . A precise version is as follows (see e.g. Bertsekas (2016)). There are constants so that the following two conditions are satisfied:

  1. For all n, . Here, is the Euclidean norm of . (This assures that if , then also . More generally, if , then also .) A more strict version requires also the reverse inequality: for a positive constant .
  2. For all n, . (This condition ensures that the directions of and are similar.)

Lower bound for gradient descent

A lower bound for learning rates satisfying Armijo's condition, where , is in the order of , where is a local Lipschitz constant for the gradient near the point (see Lipschitz continuity). See Armijo (1966) for more detail.

Theoretical guarantee (for gradient descent)

Compared with Wolfe's conditions, which is more complicated, Armijo's condition has a better theoretical guarantee. Indeed, so far backtracking line search and its modifications are the most theoretically guaranteed methods among all numerical optimization algorithms concerning convergence to Critical point (mathematics) and avoidance of Saddle point, see below.

Critical point (mathematics) are points where the gradient of the objective function is 0. Local minima are critical points, but there are critical points which are not local minima. An example is saddle points. Saddle point are critical points, at which there are at least one direction where the function is (local) maximum. Therefore, these points are far from being local minima. For example, if a function has at least one saddle point, then it cannot be Convex function. The relevance of saddle points to optimisation algorithms is that in large scale (i.e. high-dimensional) optimisation, one likely sees more saddle points than minima, see Bray & Dean (2007). Hence, a good optimisation algorithm should be able to avoid saddle points. In the setting of Deep learning, saddle points are also prevalent, see Dauphin et al. (2014). Thus, to apply in Deep learning, one needs results for non-convex functions.

For convergence to critical points: For example, if the cost function is a real analytic function, then it is shown in Absil, Mahony & Andrews (2005) that convergence is guaranteed. In Bertsekas (2016), there is a proof that for every sequence constructed by backtracking line search, a cluster point (i.e. the limit of one subsequence, if the subsequence converges) is a critical point. For the case of a function with at most countably many critical points and compact sublevels (see Level set), as well as with Lipschitz continuous gradient where one uses Standard GD with learning rate <1/L (see the section about Stochastic gradient descent), then convergence is guaranteed, see for example Chapter 12 in Lange (2013).

For avoidance of saddle points: For example, if the gradient of the cost function is Lipschitz continuous and one chooses Standard GD with learning rate <1/L, then with a random choice of initial point (more precisely, outside a set of Lebesgue measure zero), the sequence constructed will not converge to a non-degenerate (see Critical point) saddle point (proven in Lee et al. (2016)), and more generally it is also true that the sequence constructed will not converge to a degenerate saddle point (proven in Panageas & Piliouras (2017)).


None of these results have been proven for any other optimization algorithm so far.[citation needed]

A special case: (Standard) Stochastic gradient descent

While it is trivial to mention, if the gradient of a cost function is Lipschitz continuous, with Lipschitz constant L, then with choosing learning rate to be constant and of the size 1/L, one has a special case of Backtracking line search (for gradient descent). This has been used at least in Armijo (1966). This scheme however requires that one needs to have a good estimate for L, otherwise if learning rate is too big (relative to 1/L) then the scheme has no convergence guarantee. One can see what will go wrong if the cost function is a smoothing (near the point 0) of the function f(t)=|t|. Such a good estimate is, however, difficult and laborious in large dimensions. Also, if the gradient of the function is not globally Lipschitz continuous, then this scheme has no convergence guarantee. For example, this is similar to an exercise in Bertsekas (2016), for the cost function and for whatever constant learning rate one chooses, with a random initial point the sequence constructed by this special scheme does not converge to the global minimum 0.

If one does not care about the condition that learning rate must be bounded by 1/L, then this special scheme has been used much older, at least since 1847 by Cauchy Gradient descent, which can be called Standard GD (to distinguish with SGD). In the Stochastic setting (such as in the mini-batch setting in Deep learning Deep learning), Standard GD is called Stochastic gradient descent Stochastic gradient descent, or SGD.

Even if the cost function has globally continuous gradient, good estimate of the Lipschitz constant for the cost functions in Deep Learning may not be feasible or desirable, given the very high dimensions of Deep neural networks Deep learning. Hence, there is a technique of fine-tuning of learning rates in applying Standard GD or SGD. One way is to choose many learning rates from a grid search, with the hope that some of the learning rates can give good results. (However, if the loss function does not have global Lipschitz continuous gradient, then the example with above shows that grid search cannot help.) Another way is the so-called adaptive Standard GD or SGD, some representatives are Adam, Adadelta, RMSProp and so on, see Stochastic gradient descent. In adaptive Standard GD or SGD, learning rates are allowed to vary at each iterate step n, but in a different manner from Backtracking line search for gradient descent. Apparently, it would be more expensive to use Backtracking line search for gradient descent, since one needs to do a loop search until Armijo's condition is satisfied, while for adaptive Standard GD or SGD no loop search is needed. Most of these adaptive Standard GD or SGD do not have the descent property , for all n, as Backtracking line search for gradient descent. Only a few has this property, and which have good theoretical properties, but they turn out to be special cases of Backtracking line search or more generally Armijo's condition Armijo (1966). The first one is when one chooses learning rate to be a constant <1/L, as mentioned above, if one can have a good estimate of L. The second is so called Diminshing learning rate, used in the well known paper by Robbins & Monro (1951), if again the function has globally Lipschitz continuous gradient (but the Lipschitz constant may be unknown) and the learning rates converge to 0.

See also

References