# Anderson acceleration

In mathematics, Anderson acceleration, also called Anderson mixing, is a method for the acceleration of the convergence rate of fixed-point iterations. Introduced by Donald G. Anderson,[1] this technique can be used to find the solution to fixed point equations ${\displaystyle f(x)=x}$ often arising in the field of computational science.

## Definition

Given a function ${\displaystyle f:\mathbb {R} ^{n}\to \mathbb {R} ^{n}}$, consider the problem of finding a fixed point of ${\displaystyle f}$, which is a solution to the equation ${\displaystyle f(x)=x}$. A classical approach to the problem is to employ a fixed-point iteration scheme;[2] that is, given an initial guess ${\displaystyle x_{0}}$ for the solution, to compute the sequence ${\displaystyle x_{i+1}=f(x_{i})}$ until some convergence criterion is met. However, the convergence of such a scheme is not guaranteed in general; moreover, the rate of convergence is usually linear, which can become too slow if the evaluation of the function ${\displaystyle f}$ is computationally expensive.[2] Anderson acceleration is a method to accelerate the convergence of the fixed-point sequence.[2]

Define the residual ${\displaystyle g(x)=f(x)-x}$, and denote ${\displaystyle f_{k}=f(x_{k})}$ and ${\displaystyle g_{k}=g(x_{k})}$ (where ${\displaystyle x_{k}}$ corresponds to the sequence of iterates from the previous paragraph). Given an initial guess ${\displaystyle x_{0}}$ and an integer parameter ${\displaystyle m\geq 1}$, the method can be formulated as follows:[3][note 1]

${\displaystyle x_{1}=f(x_{0})}$
${\displaystyle \forall k=1,2,\dots }$
${\displaystyle m_{k}=\min\{m,k\}}$
${\displaystyle G_{k}={\begin{bmatrix}g_{k-m_{k}}&\dots &g_{k}\end{bmatrix}}}$
${\displaystyle \alpha _{k}=\operatorname {argmin} _{\alpha \in A_{k}}\|G_{k}\alpha \|_{2},\quad {\text{where}}\;A_{k}=\{\alpha =(\alpha _{0},\dots ,\alpha _{m_{k}})\in \mathbb {R} ^{m_{k}+1}:\sum _{i=0}^{m_{k}}\alpha _{i}=1\}}$
${\displaystyle x_{k+1}=\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}f_{k-m_{k}+i}}$

where the matrix–vector multiplication ${\displaystyle G_{k}\alpha =\sum _{i=0}^{m_{k}}(\alpha )_{i}g_{k-m_{k}+i}}$, and ${\displaystyle (\alpha )_{i}}$ is the ${\displaystyle i}$th element of ${\displaystyle \alpha }$. Conventional stopping criteria can be used to end the iterations of the method. For example, iterations can be stopped when ${\displaystyle \|x_{k+1}-x_{k}\|}$ falls under a prescribed tolerance, or when the residual ${\displaystyle g(x_{k})}$ falls under a prescribed tolerance.[2]

With respect to the standard fixed-point iteration, the method has been found to converge faster and be more robust, and in some cases avoid the divergence of the fixed-point sequence.[3][4]

### Derivation

For the solution ${\displaystyle x^{*}}$, we know that ${\displaystyle f(x^{*})=x^{*}}$, which is equivalent to saying that ${\displaystyle g(x^{*})={\vec {0}}}$. We can therefore rephrase the problem as an optimization problem where we want to minimize ${\displaystyle \|g(x)\|_{2}}$.

Instead of going directly from ${\displaystyle x_{k}}$ to ${\displaystyle x_{k+1}}$ by choosing ${\displaystyle x_{k+1}=f(x_{k})}$ as in fixed-point iteration, let's consider an intermediate point ${\displaystyle x'_{k+1}}$ that we choose to be the linear combination ${\displaystyle x'_{k+1}=X_{k}\alpha _{k}}$, where the coefficient vector ${\displaystyle \alpha _{k}\in A_{k}}$, and ${\displaystyle X_{k}={\begin{bmatrix}x_{k-m_{k}}&\dots &x_{k}\end{bmatrix}}}$ is the matrix containing the last ${\displaystyle m_{k}+1}$ points, and choose ${\displaystyle x'_{k+1}}$ such that it minimizes ${\displaystyle \|g(x'_{k+1})\|_{2}}$. Since the elements in ${\displaystyle \alpha _{k}}$ sum to one, we can make the first order approximation ${\displaystyle g(X_{k}\alpha _{k})=g\left(\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}x_{k-m_{k}+i}\right)\approx \sum _{i=0}^{m_{k}}(\alpha _{k})_{i}g(x_{k-m_{k}+i})=G_{k}\alpha _{k}}$, and our problem becomes to find the ${\displaystyle \alpha }$ that minimizes ${\displaystyle \|G_{k}\alpha \|_{2}}$. After having found ${\displaystyle \alpha _{k}}$, we could in principle calculate ${\displaystyle x'_{k+1}}$.

However, since ${\displaystyle f}$ is designed to bring a point closer to ${\displaystyle x^{*}}$, ${\displaystyle f(x'_{k+1})}$ is probably closer to ${\displaystyle x^{*}}$ than what ${\displaystyle x'_{k+1}}$ is, so it makes sense to choose ${\displaystyle x_{k+1}=f(x'_{k+1})}$ rather than ${\displaystyle x_{k+1}=x'_{k+1}}$. Furthermore, since the elements in ${\displaystyle \alpha _{k}}$ sum to one, we can make the first order approximation ${\displaystyle f(x'_{k+1})=f\left(\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}x_{k-m_{k}+i}\right)\approx \sum _{i=0}^{m_{k}}(\alpha _{k})_{i}f(x_{k-m_{k}+i})=\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}f_{k-m_{k}+i}}$. We therefore choose

${\displaystyle x_{k+1}=\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}f_{k-m_{k}+i}}$.

### Solution of the minimization problem

At each iteration of the algorithm, the constrained optimization problem ${\displaystyle \operatorname {argmin} \|G_{k}\alpha \|_{2}}$, subject to ${\displaystyle \alpha \in A_{k}}$ needs to be solved. The problem can be recast in several equivalent formulations,[3] yielding different solution methods which may result in a more convenient implementation:

• defining the matrices ${\displaystyle {\mathcal {G}}_{k}={\begin{bmatrix}g_{k-m_{k}+1}-g_{k-m_{k}}&\dots &g_{k}-g_{k-1}\end{bmatrix}}}$ and ${\displaystyle {\mathcal {X}}_{k}={\begin{bmatrix}x_{k-m_{k}+1}-x_{k-m_{k}}&\dots &x_{k}-x_{k-1}\end{bmatrix}}}$, solve ${\displaystyle \gamma _{k}=\operatorname {argmin} _{\gamma \in \mathbb {R} ^{m_{k}}}\|g_{k}-{\mathcal {G}}_{k}\gamma \|_{2}}$, and set ${\displaystyle x_{k+1}=x_{k}+g_{k}-({\mathcal {X}}_{k}+{\mathcal {G}}_{k})\gamma _{k}}$;[3][4]
• solve ${\displaystyle \theta _{k}=\{(\theta _{k})_{i}\}_{i=1}^{m_{k}}=\operatorname {argmin} _{\theta \in \mathbb {R} ^{m_{k}}}\left\|g_{k}+\sum _{i=1}^{m_{k}}\theta _{i}(g_{k-i}-g_{k})\right\|_{2}}$, then set ${\displaystyle x_{k+1}=x_{k}+g_{k}+\sum _{j=1}^{m_{k}}(\theta _{k})_{j}(x_{k-j}-x_{k}+g_{k-j}-g_{k})}$.[1]

For both choices, the optimization problem is in the form of an unconstrained linear least-squares problem, which can be solved by standard methods including QR decomposition[3] and singular value decomposition,[4] possibly including regularization techniques to deal with rank deficiencies and conditioning issues in the optimization problem. Solving the least-squares problem by solving the normal equations is generally not advisable due to potential numerical instabilities and generally high computational cost.[4]

Stagnation in the method (i.e. subsequent iterations with the same value, ${\displaystyle x_{k+1}=x_{k}}$) causes the method to break down, due to the singularity of the least-squares problem. Similarly, near-stagnation (${\displaystyle x_{k+1}\approx x_{k}}$) results in bad conditioning of the least squares problem. Moreover, the choice of the parameter ${\displaystyle m}$ might be relevant in determining the conditioning of the least-squares problem, as discussed below.[3]

### Relaxation

The algorithm can be modified introducing a variable relaxation parameter (or mixing parameter) ${\displaystyle \beta _{k}>0}$.[1][3][4] At each step, compute the new iterate as

${\displaystyle x_{k+1}=(1-\beta _{k})\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}x_{k-m_{k}+i}+\beta _{k}\sum _{i=0}^{m_{k}}(\alpha _{k})_{i}f(x_{k-m_{k}+i})\;.}$
The choice of ${\displaystyle \beta _{k}}$ is crucial to the convergence properties of the method; in principle, ${\displaystyle \beta _{k}}$ might vary at each iteration, although it is often chosen to be constant.[4]

### Choice of m

The parameter ${\displaystyle m}$ determines how much information from previous iterations is used to compute the new iteration ${\displaystyle x_{k+1}}$. On the one hand, if ${\displaystyle m}$ is chosen to be too small, too little information is used and convergence may be undesirably slow. On the other hand, if ${\displaystyle m}$ is too large, information from old iterations may be retained for too many subsequent iterations, so that again convergence may be slow.[3] Moreover, the choice of ${\displaystyle m}$ affects the size of the optimization problem. A too large value of ${\displaystyle m}$ may worsen the conditioning of the least squares problem and the cost of its solution.[3] In general, the particular problem to be solved determines the best choice of the ${\displaystyle m}$ parameter.[3]

### Choice of mk

With respect to the algorithm described above, the choice of ${\displaystyle m_{k}}$ at each iteration can be modified. One possibility is to choose ${\displaystyle m_{k}=k}$ for each iteration ${\displaystyle k}$ (sometimes referred to as Anderson acceleration without truncation).[3] This way, every new iteration ${\displaystyle x_{k+1}}$ is computed using all the previously computed iterations. A more sophisticated technique is based on choosing ${\displaystyle m_{k}}$ so as to maintain a small enough conditioning for the least-squares problem.[3]

## Relations to other classes of methods

Newton's method can be applied to the solution of ${\displaystyle f(x)-x=0}$ to compute a fixed point of ${\displaystyle f(x)}$ with quadratic convergence. However, such method requires the evaluation of the exact derivative of ${\displaystyle f(x)}$, which can be very costly.[4] Approximating the derivative by means of finite differences is a possible alternative, but it requires multiple evaluations of ${\displaystyle f(x)}$ at each iteration, which again can become very costly. Anderson acceleration requires only one evaluation of the function ${\displaystyle f(x)}$ per iteration, and no evaluation of its derivative. On the other hand, the convergence of an Anderson-accelerated fixed point sequence is still linear in general.[5]

Several authors have pointed out similarities between the Anderson acceleration scheme and other methods for the solution of non-linear equations. In particular:

• Eyert[6] and Fang and Saad[4] interpreted the algorithm within the class of quasi-Newton and multisecant methods, that generalize the well known secant method, for the solution of the non-linear equation ${\displaystyle g(x)=0}$; they also showed how the scheme can be seen as a method in the Broyden class;[7]
• Walker and Ni[3][8] showed that the Anderson acceleration scheme is equivalent to the GMRES method in the case of linear problems (i.e. the problem of finding a solution to ${\displaystyle A\mathbf {x} =\mathbf {x} }$ for some square matrix ${\displaystyle A}$), and can thus be seen as a generalization of GMRES to the non-linear case; a similar result was found by Washio and Oosterlee.[9]

Moreover, several equivalent or nearly equivalent methods have been independently developed by other authors,[9][10][11][12][13] although most often in the context of some specific application of interest rather than as a general method for fixed point equations.

## Example MATLAB implementation

The following is an example implementation in MATLAB language of the Anderson acceleration scheme for finding the fixed-point of the function ${\displaystyle f(x)=\sin(x)+\arctan(x)}$. Notice that:

• the optimization problem was solved in the form ${\displaystyle \gamma _{k}=\operatorname {argmin} _{\gamma \in \mathbb {R} ^{m_{k}}}\|g_{k}-{\mathcal {G}}_{k}\gamma \|_{2}}$ using QR decomposition;
• the computation of the QR decomposition is sub-optimal: indeed, at each iteration a single column is added to the matrix ${\displaystyle {\mathcal {G}}_{k}}$, and possibly a single column is removed; this fact can be exploited to efficiently update the QR decomposition with less computational effort;[14]
• the algorithm can be made more memory-efficient by storing only the latest few iterations and residuals, if the whole vector of iterations ${\displaystyle x_{k}}$ is not needed;
• the code is straightforwardly generalized to the case of a vector-valued ${\displaystyle f(x)}$.
f = @(x) sin(x) + atan(x); % Function whose fixed point is to be computed.
x0 = 1; % Initial guess.

k_max = 100; % Maximum number of iterations.
tol_res = 1e-6; % Tolerance on the residual.
m = 3; % Parameter m.

x = [x0, f(x0)]; % Vector of iterates x.
g = f(x) - x; % Vector of residuals.

G_k = g(2) - g(1); % Matrix of increments in residuals.
X_k = x(2) - x(1); % Matrix of increments in x.

k = 2;
while k < k_max && abs(g(k)) > tol_res
m_k = min(k, m);

% Solve the optimization problem by QR decomposition.
[Q, R] = qr(G_k);
gamma_k = R \ (Q' * g(k));

% Compute new iterate and new residual.
x(k + 1) = x(k) + g(k) - (X_k + G_k) * gamma_k;
g(k + 1) = f(x(k + 1)) - x(k + 1);

% Update increment matrices with new elements.
X_k = [X_k, x(k + 1) - x(k)];
G_k = [G_k, g(k + 1) - g(k)];

n = size(X_k, 2);
if n > m_k
X_k = X_k(:, n - m_k + 1:end);
G_k = G_k(:, n - m_k + 1:end);
end

k = k + 1;
end

% Prints result: Computed fixed point 2.013444 after 9 iterations
fprintf("Computed fixed point %f after %d iterations\n", x(end), k);


## Notes

1. ^ This formulation is not the same as given by the original author;[1] it is an equivalent, more explicit formulation given by Walker and Ni.[3]

## References

1. ^ a b c d Anderson, Donald G. (October 1965). "Iterative Procedures for Nonlinear Integral Equations". Journal of the ACM. 12 (4): 547–560. doi:10.1145/321296.321305.
2. ^ a b c d Quarteroni, Alfio; Sacco, Riccardo; Saleri, Fausto. Numerical mathematics (2nd ed.). Springer. ISBN 978-3-540-49809-4.
3. Walker, Homer F.; Ni, Peng (January 2011). "Anderson Acceleration for Fixed-Point Iterations". SIAM Journal on Numerical Analysis. 49 (4): 1715–1735. doi:10.1137/10078356X.
4. Fang, Haw-ren; Saad, Yousef (March 2009). "Two classes of multisecant methods for nonlinear acceleration". Numerical Linear Algebra with Applications. 16 (3): 197–221. doi:10.1002/nla.617.
5. ^ Evans, Claire; Pollock, Sara; Rebholz, Leo G.; Xiao, Mengying (20 February 2020). "A Proof That Anderson Acceleration Improves the Convergence Rate in Linearly Converging Fixed-Point Methods (But Not in Those Converging Quadratically)". SIAM Journal on Numerical Analysis. 58 (1): 788–810. arXiv:1810.08455. doi:10.1137/19M1245384.
6. ^ Eyert, V. (March 1996). "A Comparative Study on Methods for Convergence Acceleration of Iterative Vector Sequences". Journal of Computational Physics. 124 (2): 271–285. doi:10.1006/jcph.1996.0059.
7. ^ Broyden, C. G. (1965). "A class of methods for solving nonlinear simultaneous equations". Mathematics of Computation. 19 (92): 577–577. doi:10.1090/S0025-5718-1965-0198670-6.
8. ^ Ni, Peng (November 2009). Anderson Acceleration of Fixed-point Iteration with Applications to Electronic Structure Computations (PhD).
9. ^ a b Oosterlee, C. W.; Washio, T. (January 2000). "Krylov Subspace Acceleration of Nonlinear Multigrid with Application to Recirculating Flows". SIAM Journal on Scientific Computing. 21 (5): 1670–1690. doi:10.1137/S1064827598338093.
10. ^ Pulay, Péter (July 1980). "Convergence acceleration of iterative sequences. the case of scf iteration". Chemical Physics Letters. 73 (2): 393–398. doi:10.1016/0009-2614(80)80396-4.
11. ^ Pulay, P. (1982). "ImprovedSCF convergence acceleration". Journal of Computational Chemistry. 3 (4): 556–560. doi:10.1002/jcc.540030413.
12. ^ Carlson, Neil N.; Miller, Keith (May 1998). "Design and Application of a Gradient-Weighted Moving Finite Element Code I: in One Dimension". SIAM Journal on Scientific Computing. 19 (3): 728–765. doi:10.1137/S106482759426955X.
13. ^ Miller, Keith (November 2005). "Nonlinear Krylov and moving nodes in the method of lines". Journal of Computational and Applied Mathematics. 183 (2): 275–287. doi:10.1016/j.cam.2004.12.032.
14. ^ Daniel, J. W.; Gragg, W. B.; Kaufman, L.; Stewart, G. W. (October 1976). "Reorthogonalization and stable algorithms for updating the Gram-Schmidt $QR$ factorization". Mathematics of Computation. 30 (136): 772–772. doi:10.1090/S0025-5718-1976-0431641-8.