Derivation of the conjugate gradient method

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In numerical linear algebra, the conjugate gradient method is an iterative method for numerically solving the linear system

\boldsymbol{Ax}=\boldsymbol{b}

where \boldsymbol{A} is symmetric positive-definite. The conjugate gradient method can be derived from several different perspectives, including specialization of the conjugate direction method for optimization, and variation of the Arnoldi/Lanczos iteration for eigenvalue problems.

The intent of this article is to document the important steps in these derivations.

Derivation from the conjugate direction method[edit]

The conjugate gradient method can be seen as a special case of the conjugate direction method applied to minimization of the quadratic function

f(\boldsymbol{x})=\boldsymbol{x}^\mathrm{T}\boldsymbol{A}\boldsymbol{x}-2\boldsymbol{b}^\mathrm{T}\boldsymbol{x}\text{.}

The conjugate direction method[edit]

In the conjugate direction method for minimizing

f(\boldsymbol{x})=\boldsymbol{x}^\mathrm{T}\boldsymbol{A}\boldsymbol{x}-2\boldsymbol{b}^\mathrm{T}\boldsymbol{x}\text{.}

one starts with an initial guess \boldsymbol{x}_0 and the corresponding residual \boldsymbol{r}_0=\boldsymbol{b}-\boldsymbol{Ax}_0, and computes the iterate and residual by the formulae

\begin{align}
\alpha_i&=\frac{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{r}_i}{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_i}\text{,}\\
\boldsymbol{x}_{i+1}&=\boldsymbol{x}_i+\alpha_i\boldsymbol{p}_i\text{,}\\
\boldsymbol{r}_{i+1}&=\boldsymbol{r}_i-\alpha_i\boldsymbol{Ap}_i
\end{align}

where \boldsymbol{p}_0,\boldsymbol{p}_1,\boldsymbol{p}_2,\ldots are a series of mutually conjugate directions, i.e.,

\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_j=0

for any i\neq j.

The conjugate direction method is imprecise in the sense that no formulae are given for selection of the directions \boldsymbol{p}_0,\boldsymbol{p}_1,\boldsymbol{p}_2,\ldots. Specific choices lead to various methods including the conjugate gradient method and Gaussian elimination.

Derivation from the Arnoldi/Lanczos iteration[edit]

Further information: Arnoldi iteration and Lanczos iteration

The conjugate gradient method can also be seen as a variant of the Arnoldi/Lanczos iteration applied to solving linear systems.

The general Arnoldi method[edit]

In the Arnoldi iteration, one starts with a vector \boldsymbol{r}_0 and gradually builds an orthonormal basis \{\boldsymbol{v}_1,\boldsymbol{v}_2,\boldsymbol{v}_3,\ldots\} of the Krylov subspace

\mathcal{K}(\boldsymbol{A},\boldsymbol{r}_0)=\{\boldsymbol{r}_0,\boldsymbol{Ar}_0,\boldsymbol{A}^2\boldsymbol{r}_0,\ldots\}

by defining \boldsymbol{v}_i=\boldsymbol{w}_i/\lVert\boldsymbol{w}_i\rVert_2 where

\boldsymbol{w}_i=\begin{cases}
\boldsymbol{r}_0 & \text{if }i=1\text{,}\\
\boldsymbol{Av}_{i-1}-\sum_{j=1}^{i-1}(\boldsymbol{v}_j^\mathrm{T}\boldsymbol{Av}_{i-1})\boldsymbol{v}_j & \text{if }i>1\text{.}
\end{cases}

In other words, for i>1, \boldsymbol{v}_i is found by Gram-Schmidt orthogonalizing \boldsymbol{Av}_{i-1} against \{\boldsymbol{v}_1,\boldsymbol{v}_2,\ldots,\boldsymbol{v}_{i-1}\} followed by normalization.

Put in matrix form, the iteration is captured by the equation

\boldsymbol{AV}_i=\boldsymbol{V}_{i+1}\boldsymbol{\tilde{H}}_i

where

\begin{align}
\boldsymbol{V}_i&=\begin{bmatrix}
\boldsymbol{v}_1 & \boldsymbol{v}_2 & \cdots & \boldsymbol{v}_i
\end{bmatrix}\text{,}\\
\boldsymbol{\tilde{H}}_i&=\begin{bmatrix}
h_{11} & h_{12} & h_{13} & \cdots & h_{1,i}\\
h_{21} & h_{22} & h_{23} & \cdots & h_{2,i}\\
& h_{32} & h_{33} & \cdots & h_{3,i}\\
& & \ddots & \ddots & \vdots\\
& & & h_{i,i-1} & h_{i,i}\\
& & & & h_{i+1,i}
\end{bmatrix}=\begin{bmatrix}
\boldsymbol{H}_i\\
h_{i+1,i}\boldsymbol{e}_i^\mathrm{T}
\end{bmatrix}
\end{align}

with

h_{ji}=\begin{cases}
\boldsymbol{v}_j^\mathrm{T}\boldsymbol{Av}_i & \text{if }j\leq i\text{,}\\
\lVert\boldsymbol{w}_{i+1}\rVert_2 & \text{if }j=i+1\text{,}\\
0 & \text{if }j>i+1\text{.}
\end{cases}

When applying the Arnoldi iteration to solving linear systems, one starts with \boldsymbol{r}_0=\boldsymbol{b}-\boldsymbol{Ax}_0, the residual corresponding to an initial guess \boldsymbol{x}_0. After each step of iteration, one computes \boldsymbol{y}_i=\boldsymbol{H}_i^{-1}(\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{e}_1) and the new iterate \boldsymbol{x}_i=\boldsymbol{x}_0+\boldsymbol{V}_i\boldsymbol{y}_i.

The direct Lanczos method[edit]

For the rest of discussion, we assume that \boldsymbol{A} is symmetric positive-definite. With symmetry of \boldsymbol{A}, the upper Hessenberg matrix \boldsymbol{H}_i=\boldsymbol{V}_i^\mathrm{T}\boldsymbol{AV}_i becomes symmetric and thus tridiagonal. It then can be more clearly denoted by

\boldsymbol{H}_i=\begin{bmatrix}
a_1 & b_2\\
b_2 & a_2 & b_3\\
& \ddots & \ddots & \ddots\\
& & b_{i-1} & a_{i-1} & b_i\\
& & & b_i & a_i
\end{bmatrix}\text{.}

This enables a short three-term recurrence for \boldsymbol{v}_i in the iteration, and the Arnoldi iteration is reduced to the Lanczos iteration.

Since \boldsymbol{A} is symmetric positive-definite, so is \boldsymbol{H}_i. Hence, \boldsymbol{H}_i can be LU factorized without partial pivoting into

\boldsymbol{H}_i=\boldsymbol{L}_i\boldsymbol{U}_i=\begin{bmatrix}
1\\
c_2 & 1\\
& \ddots & \ddots\\
& & c_{i-1} & 1\\
& & & c_i & 1
\end{bmatrix}\begin{bmatrix}
d_1 & b_2\\
& d_2 & b_3\\
& & \ddots & \ddots\\
& & & d_{i-1} & b_i\\
& & & & d_i
\end{bmatrix}

with convenient recurrences for c_i and d_i:

\begin{align}
c_i&=b_i/d_{i-1}\text{,}\\
d_i&=\begin{cases}
a_1 & \text{if }i=1\text{,}\\
a_i-c_ib_i & \text{if }i>1\text{.}
\end{cases}
\end{align}

Rewrite \boldsymbol{x}_i=\boldsymbol{x}_0+\boldsymbol{V}_i\boldsymbol{y}_i as

\begin{align}
\boldsymbol{x}_i&=\boldsymbol{x}_0+\boldsymbol{V}_i\boldsymbol{H}_i^{-1}(\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{e}_1)\\
&=\boldsymbol{x}_0+\boldsymbol{V}_i\boldsymbol{U}_i^{-1}\boldsymbol{L}_i^{-1}(\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{e}_1)\\
&=\boldsymbol{x}_0+\boldsymbol{P}_i\boldsymbol{z}_i
\end{align}

with

\begin{align}
\boldsymbol{P}_i&=\boldsymbol{V}_{i}\boldsymbol{U}_i^{-1}\text{,}\\
\boldsymbol{z}_i&=\boldsymbol{L}_i^{-1}(\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{e}_1)\text{.}
\end{align}

It is now important to observe that

\begin{align}
\boldsymbol{P}_i&=\begin{bmatrix}
\boldsymbol{P}_{i-1} & \boldsymbol{p}_i
\end{bmatrix}\text{,}\\
\boldsymbol{z}_i&=\begin{bmatrix}
\boldsymbol{z}_{i-1}\\
\zeta_i
\end{bmatrix}\text{.}
\end{align}

In fact, there are short recurrences for \boldsymbol{p}_i and \zeta_i as well:

\begin{align}
\boldsymbol{p}_i&=\frac{1}{d_i}(\boldsymbol{v}_i-b_i\boldsymbol{p}_{i-1})\text{,}\\
\zeta_i&=-c_i\zeta_{i-1}\text{.}
\end{align}

With this formulation, we arrive at a simple recurrence for \boldsymbol{x}_i:

\begin{align}
\boldsymbol{x}_i&=\boldsymbol{x}_0+\boldsymbol{P}_i\boldsymbol{z}_i\\
&=\boldsymbol{x}_0+\boldsymbol{P}_{i-1}\boldsymbol{z}_{i-1}+\zeta_i\boldsymbol{p}_i\\
&=\boldsymbol{x}_{i-1}+\zeta_i\boldsymbol{p}_i\text{.}
\end{align}

The relations above straightforwardly lead to the direct Lanczos method, which turns out to be slightly more complex.

The conjugate gradient method from imposing orthogonality and conjugacy[edit]

If we allow \boldsymbol{p}_i to scale and compensate for the scaling in the constant factor, we potentially can have simpler recurrences of the form:

\begin{align}
\boldsymbol{x}_i&=\boldsymbol{x}_{i-1}+\alpha_{i-1}\boldsymbol{p}_{i-1}\text{,}\\
\boldsymbol{r}_i&=\boldsymbol{r}_{i-1}-\alpha_{i-1}\boldsymbol{Ap}_{i-1}\text{,}\\
\boldsymbol{p}_i&=\boldsymbol{r}_i+\beta_{i-1}\boldsymbol{p}_{i-1}\text{.}
\end{align}

As premises for the simplification, we now derive the orthogonality of \boldsymbol{r}_i and conjugacy of \boldsymbol{p}_i, i.e., for i\neq j,

\begin{align}
\boldsymbol{r}_i^\mathrm{T}\boldsymbol{r}_j&=0\text{,}\\
\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_j&=0\text{.}
\end{align}

The residuals are mutually orthogonal because \boldsymbol{r}_i is essentially a multiple of \boldsymbol{v}_{i+1} since for i=0, \boldsymbol{r}_0=\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{v}_1, for i>0,

\begin{align}
\boldsymbol{r}_i&=\boldsymbol{b}-\boldsymbol{Ax}_i\\
&=\boldsymbol{b}-\boldsymbol{A}(\boldsymbol{x}_0+\boldsymbol{V}_i\boldsymbol{y}_i)\\
&=\boldsymbol{r}_0-\boldsymbol{AV}_i\boldsymbol{y}_i\\
&=\boldsymbol{r}_0-\boldsymbol{V}_{i+1}\boldsymbol{\tilde{H}}_i\boldsymbol{y}_i\\
&=\boldsymbol{r}_0-\boldsymbol{V}_i\boldsymbol{H}_i\boldsymbol{y}_i-h_{i+1,i}(\boldsymbol{e}_i^\mathrm{T}\boldsymbol{y}_i)\boldsymbol{v}_{i+1}\\
&=\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{v}_1-\boldsymbol{V}_i(\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{e}_1)-h_{i+1,i}(\boldsymbol{e}_i^\mathrm{T}\boldsymbol{y}_i)\boldsymbol{v}_{i+1}\\
&=-h_{i+1,i}(\boldsymbol{e}_i^\mathrm{T}\boldsymbol{y}_i)\boldsymbol{v}_{i+1}\text{.}\end{align}

To see the conjugacy of \boldsymbol{p}_i, it suffices to show that \boldsymbol{P}_i^\mathrm{T}\boldsymbol{AP}_i is diagonal:

\begin{align}
\boldsymbol{P}_i^\mathrm{T}\boldsymbol{AP}_i&=\boldsymbol{U}_i^{-\mathrm{T}}\boldsymbol{V}_i^\mathrm{T}\boldsymbol{AV}_i\boldsymbol{U}_i^{-1}\\
&=\boldsymbol{U}_i^{-\mathrm{T}}\boldsymbol{H}_i\boldsymbol{U}_i^{-1}\\
&=\boldsymbol{U}_i^{-\mathrm{T}}\boldsymbol{L}_i\boldsymbol{U}_i\boldsymbol{U}_i^{-1}\\
&=\boldsymbol{U}_i^{-\mathrm{T}}\boldsymbol{L}_i
\end{align}

is symmetric and lower triangular simultaneously and thus must be diagonal.

Now we can derive the constant factors \alpha_i and \beta_i with respect to the scaled \boldsymbol{p}_i by solely imposing the orthogonality of \boldsymbol{r}_i and conjugacy of \boldsymbol{p}_i.

Due to the orthogonality of \boldsymbol{r}_i, it is necessary that \boldsymbol{r}_{i+1}^\mathrm{T}\boldsymbol{r}_i=(\boldsymbol{r}_i-\alpha_i\boldsymbol{Ap}_i)^\mathrm{T}\boldsymbol{r}_i=0. As a result,

\begin{align}
\alpha_i&=\frac{\boldsymbol{r}_i^\mathrm{T}\boldsymbol{r}_i}{\boldsymbol{r}_i^\mathrm{T}\boldsymbol{Ap}_i}\\
&=\frac{\boldsymbol{r}_i^\mathrm{T}\boldsymbol{r}_i}{(\boldsymbol{p}_i-\beta_{i-1}\boldsymbol{p}_{i-1})^\mathrm{T}\boldsymbol{Ap}_i}\\
&=\frac{\boldsymbol{r}_i^\mathrm{T}\boldsymbol{r}_i}{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_i}\text{.}
\end{align}

Similarly, due to the conjugacy of \boldsymbol{p}_i, it is necessary that \boldsymbol{p}_{i+1}^\mathrm{T}\boldsymbol{Ap}_i=(\boldsymbol{r}_{i+1}+\beta_i\boldsymbol{p}_i)^\mathrm{T}\boldsymbol{Ap}_i=0. As a result,

\begin{align}
\beta_i&=-\frac{\boldsymbol{r}_{i+1}^\mathrm{T}\boldsymbol{Ap}_i}{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_i}\\
&=-\frac{\boldsymbol{r}_{i+1}^\mathrm{T}(\boldsymbol{r}_i-\boldsymbol{r}_{i+1})}{\alpha_i\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_i}\\
&=\frac{\boldsymbol{r}_{i+1}^\mathrm{T}\boldsymbol{r}_{i+1}}{\boldsymbol{r}_i^\mathrm{T}\boldsymbol{r}_i}\text{.}
\end{align}

This completes the derivation.

References[edit]

  1. Hestenes, M. R.; Stiefel, E. (December 1952). "Methods of conjugate gradients for solving linear systems" (PDF). Journal of Research of the National Bureau of Standards 49 (6). 
  2. Saad, Y. (2003). "Chapter 6: Krylov Subspace Methods, Part I". Iterative methods for sparse linear systems (2nd ed.). SIAM. ISBN 978-0-89871-534-7.