# Limited-memory BFGS

Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning.[1][2]

Like the original BFGS, L-BFGS uses an estimation to the inverse Hessian matrix to steer its search through variable space, but where BFGS stores a dense n×n approximation to the inverse Hessian (n being the number of variables in the problem), L-BFGS stores only a few vectors that represent the approximation implicitly. Due to its resulting linear memory requirement, the L-BFGS method is particularly well suited for optimization problems with a large number of variables. Instead of the inverse Hessian Hk, L-BFGS maintains a history of the past m updates of the position x and gradient ∇f(x), where generally the history size m can be small (often m<10). These updates are used to implicitly do operations requiring the Hk-vector product.

## Algorithm

L-BFGS shares many features with other quasi-Newton algorithms, but is very different in how the matrix-vector multiplication for finding the search direction is carried out ${\displaystyle d_{k}=-H_{k}g_{k}\,\!}$, where ${\displaystyle g_{k}}$ is the current derivative and ${\displaystyle H_{k}}$ is the inverse of the Hessian matrix. There are multiple published approaches using a history of updates to form this direction vector. Here, we give a common approach, the so-called "two loop recursion."[3][4]

We'll take as given ${\displaystyle x_{k}\,\!}$, the position at the ${\displaystyle k\,\!}$-th iteration, and ${\displaystyle g_{k}\equiv \nabla f(x_{k})}$ where ${\displaystyle f\,\!}$ is the function being minimized, and all vectors are column vectors. We also assume that we have stored the last ${\displaystyle m}$ updates of the form ${\displaystyle s_{k}=x_{k+1}-x_{k}\,\!}$ and ${\displaystyle y_{k}=g_{k+1}-g_{k}\,\!}$. We'll define ${\displaystyle \rho _{k}={\frac {1}{y_{k}^{\rm {T}}s_{k}}}}$, and ${\displaystyle H_{k}^{0}\,\!}$ will be the 'initial' approximate of the inverse Hessian that our estimate at iteration ${\displaystyle k\,\!}$ begins with. Then we can compute the (uphill) direction as follows:

 ${\displaystyle q=g_{k}\,\!}$
For ${\displaystyle i=k-1,k-2,\ldots ,k-m}$
${\displaystyle \alpha _{i}=\rho _{i}s_{i}^{\rm {T}}q\,\!}$
${\displaystyle q=q-\alpha _{i}y_{i}\,\!}$
${\displaystyle H_{k}^{0}=y_{k-1}s_{k-1}^{\rm {T}}/y_{k-1}y_{k-1}^{\rm {T}}}$
${\displaystyle z=H_{k}^{0}q}$
For ${\displaystyle i=k-m,k-m+1,\ldots ,k-1}$
${\displaystyle \beta _{i}=\rho _{i}y_{i}^{\rm {T}}z\,\!}$
${\displaystyle z=z+s_{i}(\alpha _{i}-\beta _{i})\,\!}$
Stop with ${\displaystyle H_{k}g_{k}=z\,\!}$


This formulation is valid whether we are minimizing or maximizing. Note that if we are minimizing, the search direction would be the negative of z (since z is "uphill"), and if we are maximizing, ${\displaystyle H_{k}^{0}}$ should be negative definite rather than positive definite. We would typically do a backtracking line search in the search direction (any line search would be valid, but L-BFGS does not require exact line searches in order to converge).

Commonly, the inverse Hessian ${\displaystyle H_{k}^{0}\,\!}$ is represented as a diagonal matrix, so that initially setting ${\displaystyle z\,\!}$ requires only an element-by-element multiplication.

This two loop update only works for the inverse Hessian. Approaches to implementing L-BFGS using the direct approximate Hessian ${\displaystyle B_{k}\,\!}$ have also been developed, as have other means of approximating the inverse Hessian.[5]

## Applications

L-BFGS has been called "the algorithm of choice" for fitting log-linear (MaxEnt) models and conditional random fields with ${\displaystyle \ell _{2}}$-regularization.[2]

## Variants

Since BFGS (and hence L-BFGS) is designed to minimize smooth functions without constraints, the L-BFGS algorithm must be modified to handle functions that include non-differentiable components or constraints. A popular class of modifications are called active-set methods, based on the concept of the active set. The idea is that when restricted to a small neighborhood of the current iterate, the function and constraints can be simplified.

### L-BFGS-B

The L-BFGS-B algorithm extends L-BFGS to handle simple box constraints (aka bound constraints) on variables; that is, constraints of the form lixiui where li and ui are per-variable constant lower and upper bounds, respectively (for each xi, either or both bounds may be omitted).[6][7] The method works by identifying fixed and free variables at every step (using a simple gradient method), and then using the L-BFGS method on the free variables only to get higher accuracy, and then repeating the process.

### OWL-QN

Orthant-wise limited-memory quasi-Newton (OWL-QN) is an L-BFGS variant for fitting ${\displaystyle \ell _{1}}$-regularized models, exploiting the inherent sparsity of such models.[2] It minimizes functions of the form

${\displaystyle f({\vec {x}})=g({\vec {x}})+C\|{\vec {x}}\|_{1}}$

where ${\displaystyle g}$ is a differentiable convex loss function. The method is an active-set type method: at each iterate, it estimates the sign of each component of the variable, and restricts the subsequent step to have the same sign. Once the sign is fixed, the non-differentiable ${\displaystyle \|{\vec {x}}\|_{1}}$ term becomes a smooth linear term which can be handled by L-BFGS. After an L-BFGS step, the method allows some variables to change sign, and repeats the process.

### O-LBFGS

Schraudolph et al. present an online approximation to both BFGS and L-BFGS.[8] Similar to stochastic gradient descent, this can be used to reduce the computational complexity by evaluating the error function and gradient on a randomly drawn subset of the overall dataset in each iteration. It has been shown that O-LBFGS has a global almost sure convergence [9] while the online approximation of BFGS (O-BFGS) is not necessarily convergent.[10]

## Implementations

An early, open source implementation of L-BFGS in Fortran exists in Netlib as a shar archive [1]. Multiple other open source implementations have been produced as translations of this Fortran code (e.g. java, and python via SciPy). Other implementations exist:

### Implementations of variants

The L-BFGS-B variant also exists as ACM TOMS algorithm 778.[7] In February 2011, some of the authors of the original L-BFGS-B code posted a major update (version 3.0).

A reference implementation[11] is available in Fortran 77 (and with a Fortran 90 interface) at the author's website. This version, as well as older versions, has been converted to many other languages, including a Java wrapper for v3.0; Matlab interfaces for v3.0, v2.4, and v2.1; a C++ interface for v2.1; a Python interface for v3.0 as part of scipy.optimize.minimize; an OCaml interface for v2.1 and v3.0; version 2.3 has been converted to C by f2c and is available at this website; and R's optim general-purpose optimizer routine includes L-BFGS-B by using method="L-BFGS-B".[12]

There exists a complete C++11 rewrite of the L-BFGS-B solver using Eigen3.

OWL-QN implementations are available in:

## Works cited

1. ^ Malouf, Robert (2002). A comparison of algorithms for maximum entropy parameter estimation (PDF). Proc. Sixth Conf. on Natural Language Learning (CoNLL). pp. 49–55.
2. ^ a b c d Andrew, Galen; Gao, Jianfeng (2007). "Scalable training of L₁-regularized log-linear models". Proceedings of the 24th International Conference on Machine Learning. ISBN 9781595937933. doi:10.1145/1273496.1273501.
3. ^ Matthies, H.; Strang, G. (1979). "The solution of non linear finite element equations". International Journal for Numerical Methods in Engineering. 14 (11): 1613–1626. doi:10.1002/nme.1620141104.
4. ^ Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation. 35 (151): 773–782. doi:10.1090/S0025-5718-1980-0572855-7.
5. ^ Byrd, R. H.; Nocedal, J.; Schnabel, R. B. (1994). "Representations of Quasi-Newton Matrices and their use in Limited Memory Methods". Mathematical Programming. 63 (4): 129–156. doi:10.1007/BF01582063.
6. ^ Byrd, R. H.; Lu, P.; Nocedal, J.; Zhu, C. (1995). "A Limited Memory Algorithm for Bound Constrained Optimization". SIAM J. Sci. Comput. 16 (5): 1190–1208. doi:10.1137/0916069.
7. ^ a b Zhu, C.; Byrd, Richard H.; Lu, Peihuang; Nocedal, Jorge (1997). "L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization". ACM Transactions on Mathematical Software. 23 (4): 550–560. doi:10.1145/279232.279236.
8. ^ N. Schraudolph, J. Yu, and S. Günter (2007). A stochastic quasi-Newton method for online convex optimization. AISTATS.
9. ^ A. Mokhtari and A. Ribeiro (2015). "Global convergence of online limited memory BFGS" (PDF). Journal of Machine Learning Research. 16: 3151–3181.
10. ^ A. Mokhtari and A. Ribeiro (2014). "RES: Regularized Stochastic BFGS Algorithm". IEEE Transactions on Signal Processing. 62 (23): 6089–6104. doi:10.1109/TSP.2014.2357775.
11. ^ Morales, J. L.; Nocedal, J. (2011). "Remark on "algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound constrained optimization"". ACM Transactions on Mathematical Software. 38: 1. doi:10.1145/2049662.2049669.
12. ^ "General-purpose Optimization". R documentation. Comprehensive R Archive Network.