# Nonlinear conjugate gradient method

In numerical optimization, the nonlinear conjugate gradient method generalizes the conjugate gradient method to nonlinear optimization. For a quadratic function ${\displaystyle \displaystyle f(x)}$:

${\displaystyle \displaystyle f(x)=\|Ax-b\|^{2}}$

The minimum of ${\displaystyle f}$ is obtained when the gradient is 0:

${\displaystyle \nabla _{x}f=2A^{T}(Ax-b)=0}$.

Whereas linear conjugate gradient seeks a solution to the linear equation ${\displaystyle \displaystyle A^{T}Ax=A^{T}b}$, the nonlinear conjugate gradient method is generally used to find the local minimum of a nonlinear function using its gradient ${\displaystyle \nabla _{x}f}$ alone. It works when the function is approximately quadratic near the minimum, which is the case when the function is twice differentiable at the minimum.

Given a function ${\displaystyle \displaystyle f(x)}$ of ${\displaystyle N}$ variables to minimize, its gradient ${\displaystyle \nabla _{x}f}$ indicates the direction of maximum increase. One simply starts in the opposite (steepest descent) direction:

${\displaystyle \Delta x_{0}=-\nabla _{x}f(x_{0})}$

with an adjustable step length ${\displaystyle \displaystyle \alpha }$ and performs a line search in this direction until it reaches the minimum of ${\displaystyle \displaystyle f}$:

${\displaystyle \displaystyle \alpha _{0}:=\arg \min _{\alpha }f(x_{0}+\alpha \Delta x_{0})}$,
${\displaystyle \displaystyle x_{1}=x_{0}+\alpha _{0}\Delta x_{0}}$

After this first iteration in the steepest direction ${\displaystyle \displaystyle \Delta x_{0}}$, the following steps constitute one iteration of moving along a subsequent conjugate direction ${\displaystyle \displaystyle s_{n}}$, where ${\displaystyle \displaystyle s_{0}=\Delta x_{0}}$:

1. Calculate the steepest direction: ${\displaystyle \Delta x_{n}=-\nabla _{x}f(x_{n})}$,
2. Compute ${\displaystyle \displaystyle \beta _{n}}$ according to one of the formulas below,
3. Update the conjugate direction: ${\displaystyle \displaystyle s_{n}=\Delta x_{n}+\beta _{n}s_{n-1}}$
4. Perform a line search: optimize ${\displaystyle \displaystyle \alpha _{n}=\arg \min _{\alpha }f(x_{n}+\alpha s_{n})}$,
5. Update the position: ${\displaystyle \displaystyle x_{n+1}=x_{n}+\alpha _{n}s_{n}}$,

With a pure quadratic function the minimum is reached within N iterations (excepting roundoff error), but a non-quadratic function will make slower progress. Subsequent search directions lose conjugacy requiring the search direction to be reset to the steepest descent direction at least every N iterations, or sooner if progress stops. However, resetting every iteration turns the method into steepest descent. The algorithm stops when it finds the minimum, determined when no progress is made after a direction reset (i.e. in the steepest descent direction), or when some tolerance criterion is reached.

Within a linear approximation, the parameters ${\displaystyle \displaystyle \alpha }$ and ${\displaystyle \displaystyle \beta }$ are the same as in the linear conjugate gradient method but have been obtained with line searches. The conjugate gradient method can follow narrow (ill-conditioned) valleys where the steepest descent method slows down and follows a criss-cross pattern.

Four of the best known formulas for ${\displaystyle \displaystyle \beta _{n}}$ are named after their developers and are given by the following formulas:

• Fletcher–Reeves:[1]
${\displaystyle \beta _{n}^{FR}={\frac {\Delta x_{n}^{T}\Delta x_{n}}{\Delta x_{n-1}^{T}\Delta x_{n-1}}}}$
• Polak–Ribière:[2]
${\displaystyle \beta _{n}^{PR}={\frac {\Delta x_{n}^{T}(\Delta x_{n}-\Delta x_{n-1})}{\Delta x_{n-1}^{T}\Delta x_{n-1}}}}$
• Hestenes-Stiefel:[3]
${\displaystyle \beta _{n}^{HS}=-{\frac {\Delta x_{n}^{T}(\Delta x_{n}-\Delta x_{n-1})}{s_{n-1}^{T}(\Delta x_{n}-\Delta x_{n-1})}}}$
• Dai–Yuan:[4]
${\displaystyle \beta _{n}^{DY}=-{\frac {\Delta x_{n}^{T}\Delta x_{n}}{s_{n-1}^{T}(\Delta x_{n}-\Delta x_{n-1})}}}$.

These formulas are equivalent for a quadratic function, but for nonlinear optimization the preferred formula is a matter of heuristics or taste. A popular choice is ${\displaystyle \displaystyle \beta =\max\{0,\,\beta ^{PR}\}}$ which provides a direction reset automatically[citation needed].

Newton based methods - Newton-Raphson Algorithm, Quasi-Newton methods (e.g., BFGS method) - tend to converge in fewer iterations, although each iteration typically requires more computation than a conjugate gradient iteration as Newton-like methods require computing the Hessian (matrix of second derivatives) in addition to the gradient. Quasi-Newton methods also require more memory to operate (see also the limited memory L-BFGS method).