Derivation of the conjugate gradient method

In numerical linear algebra, the conjugate gradient method is an iterative method for numerically solving the linear system

${\displaystyle {\boldsymbol {Ax}}={\boldsymbol {b}}}$

where ${\displaystyle {\boldsymbol {A}}}$ is symmetric positive-definite. The conjugate gradient method can be derived from several different perspectives, including specialization of the conjugate direction method for optimization, and variation of the Arnoldi/Lanczos iteration for eigenvalue problems.

Derivation from the conjugate direction method

The conjugate gradient method can be seen as a special case of the conjugate direction method applied to minimization of the quadratic function

${\displaystyle f({\boldsymbol {x}})={\boldsymbol {x}}^{\mathrm {T} }{\boldsymbol {A}}{\boldsymbol {x}}-2{\boldsymbol {b}}^{\mathrm {T} }{\boldsymbol {x}}{\text{.}}}$

The conjugate direction method

In the conjugate direction method for minimizing

${\displaystyle f({\boldsymbol {x}})={\boldsymbol {x}}^{\mathrm {T} }{\boldsymbol {A}}{\boldsymbol {x}}-2{\boldsymbol {b}}^{\mathrm {T} }{\boldsymbol {x}}{\text{.}}}$

one starts with an initial guess ${\displaystyle {\boldsymbol {x}}_{0}}$ and the corresponding residual ${\displaystyle {\boldsymbol {r}}_{0}={\boldsymbol {b}}-{\boldsymbol {Ax}}_{0}}$, and computes the iterate and residual by the formulae

{\displaystyle {\begin{aligned}\alpha _{i}&={\frac {{\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {r}}_{i}}{{\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}}{\text{,}}\\{\boldsymbol {x}}_{i+1}&={\boldsymbol {x}}_{i}+\alpha _{i}{\boldsymbol {p}}_{i}{\text{,}}\\{\boldsymbol {r}}_{i+1}&={\boldsymbol {r}}_{i}-\alpha _{i}{\boldsymbol {Ap}}_{i}\end{aligned}}}

where ${\displaystyle {\boldsymbol {p}}_{0},{\boldsymbol {p}}_{1},{\boldsymbol {p}}_{2},\ldots }$ are a series of mutually conjugate directions, i.e.,

${\displaystyle {\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{j}=0}$

for any ${\displaystyle i\neq j}$.

The conjugate direction method is imprecise in the sense that no formulae are given for selection of the directions ${\displaystyle {\boldsymbol {p}}_{0},{\boldsymbol {p}}_{1},{\boldsymbol {p}}_{2},\ldots }$. Specific choices lead to various methods including the conjugate gradient method and Gaussian elimination.

Derivation from the Arnoldi/Lanczos iteration

The conjugate gradient method can also be seen as a variant of the Arnoldi/Lanczos iteration applied to solving linear systems.

The general Arnoldi method

In the Arnoldi iteration, one starts with a vector ${\displaystyle {\boldsymbol {r}}_{0}}$ and gradually builds an orthonormal basis ${\displaystyle \{{\boldsymbol {v}}_{1},{\boldsymbol {v}}_{2},{\boldsymbol {v}}_{3},\ldots \}}$ of the Krylov subspace

${\displaystyle {\mathcal {K}}({\boldsymbol {A}},{\boldsymbol {r}}_{0})=\{{\boldsymbol {r}}_{0},{\boldsymbol {Ar}}_{0},{\boldsymbol {A}}^{2}{\boldsymbol {r}}_{0},\ldots \}}$

by defining ${\displaystyle {\boldsymbol {v}}_{i}={\boldsymbol {w}}_{i}/\lVert {\boldsymbol {w}}_{i}\rVert _{2}}$ where

${\displaystyle {\boldsymbol {w}}_{i}={\begin{cases}{\boldsymbol {r}}_{0}&{\text{if }}i=1{\text{,}}\\{\boldsymbol {Av}}_{i-1}-\sum _{j=1}^{i-1}({\boldsymbol {v}}_{j}^{\mathrm {T} }{\boldsymbol {Av}}_{i-1}){\boldsymbol {v}}_{j}&{\text{if }}i>1{\text{.}}\end{cases}}}$

In other words, for ${\displaystyle i>1}$, ${\displaystyle {\boldsymbol {v}}_{i}}$ is found by Gram-Schmidt orthogonalizing ${\displaystyle {\boldsymbol {Av}}_{i-1}}$ against ${\displaystyle \{{\boldsymbol {v}}_{1},{\boldsymbol {v}}_{2},\ldots ,{\boldsymbol {v}}_{i-1}\}}$ followed by normalization.

Put in matrix form, the iteration is captured by the equation

${\displaystyle {\boldsymbol {AV}}_{i}={\boldsymbol {V}}_{i+1}{\boldsymbol {\tilde {H}}}_{i}}$

where

{\displaystyle {\begin{aligned}{\boldsymbol {V}}_{i}&={\begin{bmatrix}{\boldsymbol {v}}_{1}&{\boldsymbol {v}}_{2}&\cdots &{\boldsymbol {v}}_{i}\end{bmatrix}}{\text{,}}\\{\boldsymbol {\tilde {H}}}_{i}&={\begin{bmatrix}h_{11}&h_{12}&h_{13}&\cdots &h_{1,i}\\h_{21}&h_{22}&h_{23}&\cdots &h_{2,i}\\&h_{32}&h_{33}&\cdots &h_{3,i}\\&&\ddots &\ddots &\vdots \\&&&h_{i,i-1}&h_{i,i}\\&&&&h_{i+1,i}\end{bmatrix}}={\begin{bmatrix}{\boldsymbol {H}}_{i}\\h_{i+1,i}{\boldsymbol {e}}_{i}^{\mathrm {T} }\end{bmatrix}}\end{aligned}}}

with

${\displaystyle h_{ji}={\begin{cases}{\boldsymbol {v}}_{j}^{\mathrm {T} }{\boldsymbol {Av}}_{i}&{\text{if }}j\leq i{\text{,}}\\\lVert {\boldsymbol {w}}_{i+1}\rVert _{2}&{\text{if }}j=i+1{\text{,}}\\0&{\text{if }}j>i+1{\text{.}}\end{cases}}}$

When applying the Arnoldi iteration to solving linear systems, one starts with ${\displaystyle {\boldsymbol {r}}_{0}={\boldsymbol {b}}-{\boldsymbol {Ax}}_{0}}$, the residual corresponding to an initial guess ${\displaystyle {\boldsymbol {x}}_{0}}$. After each step of iteration, one computes ${\displaystyle {\boldsymbol {y}}_{i}={\boldsymbol {H}}_{i}^{-1}(\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {e}}_{1})}$ and the new iterate ${\displaystyle {\boldsymbol {x}}_{i}={\boldsymbol {x}}_{0}+{\boldsymbol {V}}_{i}{\boldsymbol {y}}_{i}}$.

The direct Lanczos method

For the rest of discussion, we assume that ${\displaystyle {\boldsymbol {A}}}$ is symmetric positive-definite. With symmetry of ${\displaystyle {\boldsymbol {A}}}$, the upper Hessenberg matrix ${\displaystyle {\boldsymbol {H}}_{i}={\boldsymbol {V}}_{i}^{\mathrm {T} }{\boldsymbol {AV}}_{i}}$ becomes symmetric and thus tridiagonal. It then can be more clearly denoted by

${\displaystyle {\boldsymbol {H}}_{i}={\begin{bmatrix}a_{1}&b_{2}\\b_{2}&a_{2}&b_{3}\\&\ddots &\ddots &\ddots \\&&b_{i-1}&a_{i-1}&b_{i}\\&&&b_{i}&a_{i}\end{bmatrix}}{\text{.}}}$

This enables a short three-term recurrence for ${\displaystyle {\boldsymbol {v}}_{i}}$ in the iteration, and the Arnoldi iteration is reduced to the Lanczos iteration.

Since ${\displaystyle {\boldsymbol {A}}}$ is symmetric positive-definite, so is ${\displaystyle {\boldsymbol {H}}_{i}}$. Hence, ${\displaystyle {\boldsymbol {H}}_{i}}$ can be LU factorized without partial pivoting into

${\displaystyle {\boldsymbol {H}}_{i}={\boldsymbol {L}}_{i}{\boldsymbol {U}}_{i}={\begin{bmatrix}1\\c_{2}&1\\&\ddots &\ddots \\&&c_{i-1}&1\\&&&c_{i}&1\end{bmatrix}}{\begin{bmatrix}d_{1}&b_{2}\\&d_{2}&b_{3}\\&&\ddots &\ddots \\&&&d_{i-1}&b_{i}\\&&&&d_{i}\end{bmatrix}}}$

with convenient recurrences for ${\displaystyle c_{i}}$ and ${\displaystyle d_{i}}$:

{\displaystyle {\begin{aligned}c_{i}&=b_{i}/d_{i-1}{\text{,}}\\d_{i}&={\begin{cases}a_{1}&{\text{if }}i=1{\text{,}}\\a_{i}-c_{i}b_{i}&{\text{if }}i>1{\text{.}}\end{cases}}\end{aligned}}}

Rewrite ${\displaystyle {\boldsymbol {x}}_{i}={\boldsymbol {x}}_{0}+{\boldsymbol {V}}_{i}{\boldsymbol {y}}_{i}}$ as

{\displaystyle {\begin{aligned}{\boldsymbol {x}}_{i}&={\boldsymbol {x}}_{0}+{\boldsymbol {V}}_{i}{\boldsymbol {H}}_{i}^{-1}(\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {e}}_{1})\\&={\boldsymbol {x}}_{0}+{\boldsymbol {V}}_{i}{\boldsymbol {U}}_{i}^{-1}{\boldsymbol {L}}_{i}^{-1}(\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {e}}_{1})\\&={\boldsymbol {x}}_{0}+{\boldsymbol {P}}_{i}{\boldsymbol {z}}_{i}\end{aligned}}}

with

{\displaystyle {\begin{aligned}{\boldsymbol {P}}_{i}&={\boldsymbol {V}}_{i}{\boldsymbol {U}}_{i}^{-1}{\text{,}}\\{\boldsymbol {z}}_{i}&={\boldsymbol {L}}_{i}^{-1}(\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {e}}_{1}){\text{.}}\end{aligned}}}

It is now important to observe that

{\displaystyle {\begin{aligned}{\boldsymbol {P}}_{i}&={\begin{bmatrix}{\boldsymbol {P}}_{i-1}&{\boldsymbol {p}}_{i}\end{bmatrix}}{\text{,}}\\{\boldsymbol {z}}_{i}&={\begin{bmatrix}{\boldsymbol {z}}_{i-1}\\\zeta _{i}\end{bmatrix}}{\text{.}}\end{aligned}}}

In fact, there are short recurrences for ${\displaystyle {\boldsymbol {p}}_{i}}$ and ${\displaystyle \zeta _{i}}$ as well:

{\displaystyle {\begin{aligned}{\boldsymbol {p}}_{i}&={\frac {1}{d_{i}}}({\boldsymbol {v}}_{i}-b_{i}{\boldsymbol {p}}_{i-1}){\text{,}}\\\zeta _{i}&=-c_{i}\zeta _{i-1}{\text{.}}\end{aligned}}}

With this formulation, we arrive at a simple recurrence for ${\displaystyle {\boldsymbol {x}}_{i}}$:

{\displaystyle {\begin{aligned}{\boldsymbol {x}}_{i}&={\boldsymbol {x}}_{0}+{\boldsymbol {P}}_{i}{\boldsymbol {z}}_{i}\\&={\boldsymbol {x}}_{0}+{\boldsymbol {P}}_{i-1}{\boldsymbol {z}}_{i-1}+\zeta _{i}{\boldsymbol {p}}_{i}\\&={\boldsymbol {x}}_{i-1}+\zeta _{i}{\boldsymbol {p}}_{i}{\text{.}}\end{aligned}}}

The relations above straightforwardly lead to the direct Lanczos method, which turns out to be slightly more complex.

The conjugate gradient method from imposing orthogonality and conjugacy

If we allow ${\displaystyle {\boldsymbol {p}}_{i}}$ to scale and compensate for the scaling in the constant factor, we potentially can have simpler recurrences of the form:

{\displaystyle {\begin{aligned}{\boldsymbol {x}}_{i}&={\boldsymbol {x}}_{i-1}+\alpha _{i-1}{\boldsymbol {p}}_{i-1}{\text{,}}\\{\boldsymbol {r}}_{i}&={\boldsymbol {r}}_{i-1}-\alpha _{i-1}{\boldsymbol {Ap}}_{i-1}{\text{,}}\\{\boldsymbol {p}}_{i}&={\boldsymbol {r}}_{i}+\beta _{i-1}{\boldsymbol {p}}_{i-1}{\text{.}}\end{aligned}}}

As premises for the simplification, we now derive the orthogonality of ${\displaystyle {\boldsymbol {r}}_{i}}$ and conjugacy of ${\displaystyle {\boldsymbol {p}}_{i}}$, i.e., for ${\displaystyle i\neq j}$,

{\displaystyle {\begin{aligned}{\boldsymbol {r}}_{i}^{\mathrm {T} }{\boldsymbol {r}}_{j}&=0{\text{,}}\\{\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{j}&=0{\text{.}}\end{aligned}}}

The residuals are mutually orthogonal because ${\displaystyle {\boldsymbol {r}}_{i}}$ is essentially a multiple of ${\displaystyle {\boldsymbol {v}}_{i+1}}$ since for ${\displaystyle i=0}$, ${\displaystyle {\boldsymbol {r}}_{0}=\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {v}}_{1}}$, for ${\displaystyle i>0}$,

{\displaystyle {\begin{aligned}{\boldsymbol {r}}_{i}&={\boldsymbol {b}}-{\boldsymbol {Ax}}_{i}\\&={\boldsymbol {b}}-{\boldsymbol {A}}({\boldsymbol {x}}_{0}+{\boldsymbol {V}}_{i}{\boldsymbol {y}}_{i})\\&={\boldsymbol {r}}_{0}-{\boldsymbol {AV}}_{i}{\boldsymbol {y}}_{i}\\&={\boldsymbol {r}}_{0}-{\boldsymbol {V}}_{i+1}{\boldsymbol {\tilde {H}}}_{i}{\boldsymbol {y}}_{i}\\&={\boldsymbol {r}}_{0}-{\boldsymbol {V}}_{i}{\boldsymbol {H}}_{i}{\boldsymbol {y}}_{i}-h_{i+1,i}({\boldsymbol {e}}_{i}^{\mathrm {T} }{\boldsymbol {y}}_{i}){\boldsymbol {v}}_{i+1}\\&=\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {v}}_{1}-{\boldsymbol {V}}_{i}(\lVert {\boldsymbol {r}}_{0}\rVert _{2}{\boldsymbol {e}}_{1})-h_{i+1,i}({\boldsymbol {e}}_{i}^{\mathrm {T} }{\boldsymbol {y}}_{i}){\boldsymbol {v}}_{i+1}\\&=-h_{i+1,i}({\boldsymbol {e}}_{i}^{\mathrm {T} }{\boldsymbol {y}}_{i}){\boldsymbol {v}}_{i+1}{\text{.}}\end{aligned}}}

To see the conjugacy of ${\displaystyle {\boldsymbol {p}}_{i}}$, it suffices to show that ${\displaystyle {\boldsymbol {P}}_{i}^{\mathrm {T} }{\boldsymbol {AP}}_{i}}$ is diagonal:

{\displaystyle {\begin{aligned}{\boldsymbol {P}}_{i}^{\mathrm {T} }{\boldsymbol {AP}}_{i}&={\boldsymbol {U}}_{i}^{-\mathrm {T} }{\boldsymbol {V}}_{i}^{\mathrm {T} }{\boldsymbol {AV}}_{i}{\boldsymbol {U}}_{i}^{-1}\\&={\boldsymbol {U}}_{i}^{-\mathrm {T} }{\boldsymbol {H}}_{i}{\boldsymbol {U}}_{i}^{-1}\\&={\boldsymbol {U}}_{i}^{-\mathrm {T} }{\boldsymbol {L}}_{i}{\boldsymbol {U}}_{i}{\boldsymbol {U}}_{i}^{-1}\\&={\boldsymbol {U}}_{i}^{-\mathrm {T} }{\boldsymbol {L}}_{i}\end{aligned}}}

is symmetric and lower triangular simultaneously and thus must be diagonal.

Now we can derive the constant factors ${\displaystyle \alpha _{i}}$ and ${\displaystyle \beta _{i}}$ with respect to the scaled ${\displaystyle {\boldsymbol {p}}_{i}}$ by solely imposing the orthogonality of ${\displaystyle {\boldsymbol {r}}_{i}}$ and conjugacy of ${\displaystyle {\boldsymbol {p}}_{i}}$.

Due to the orthogonality of ${\displaystyle {\boldsymbol {r}}_{i}}$, it is necessary that ${\displaystyle {\boldsymbol {r}}_{i+1}^{\mathrm {T} }{\boldsymbol {r}}_{i}=({\boldsymbol {r}}_{i}-\alpha _{i}{\boldsymbol {Ap}}_{i})^{\mathrm {T} }{\boldsymbol {r}}_{i}=0}$. As a result,

{\displaystyle {\begin{aligned}\alpha _{i}&={\frac {{\boldsymbol {r}}_{i}^{\mathrm {T} }{\boldsymbol {r}}_{i}}{{\boldsymbol {r}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}}\\&={\frac {{\boldsymbol {r}}_{i}^{\mathrm {T} }{\boldsymbol {r}}_{i}}{({\boldsymbol {p}}_{i}-\beta _{i-1}{\boldsymbol {p}}_{i-1})^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}}\\&={\frac {{\boldsymbol {r}}_{i}^{\mathrm {T} }{\boldsymbol {r}}_{i}}{{\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}}{\text{.}}\end{aligned}}}

Similarly, due to the conjugacy of ${\displaystyle {\boldsymbol {p}}_{i}}$, it is necessary that ${\displaystyle {\boldsymbol {p}}_{i+1}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}=({\boldsymbol {r}}_{i+1}+\beta _{i}{\boldsymbol {p}}_{i})^{\mathrm {T} }{\boldsymbol {Ap}}_{i}=0}$. As a result,

{\displaystyle {\begin{aligned}\beta _{i}&=-{\frac {{\boldsymbol {r}}_{i+1}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}{{\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}}\\&=-{\frac {{\boldsymbol {r}}_{i+1}^{\mathrm {T} }({\boldsymbol {r}}_{i}-{\boldsymbol {r}}_{i+1})}{\alpha _{i}{\boldsymbol {p}}_{i}^{\mathrm {T} }{\boldsymbol {Ap}}_{i}}}\\&={\frac {{\boldsymbol {r}}_{i+1}^{\mathrm {T} }{\boldsymbol {r}}_{i+1}}{{\boldsymbol {r}}_{i}^{\mathrm {T} }{\boldsymbol {r}}_{i}}}{\text{.}}\end{aligned}}}

This completes the derivation.

References

1. Hestenes, M. R.; Stiefel, E. (December 1952). "Methods of conjugate gradients for solving linear systems" (PDF). Journal of Research of the National Bureau of Standards. 49 (6).
2. Saad, Y. (2003). "Chapter 6: Krylov Subspace Methods, Part I". Iterative methods for sparse linear systems (2nd ed.). SIAM. ISBN 978-0-89871-534-7.