# Talk:Broyden–Fletcher–Goldfarb–Shanno algorithm

(Redirected from Talk:BFGS method)

## Asterisk operator in initialization algorithm

Perhaps someone with more specific experience in numerical optimization can explain the meaning of the asterisk operator in the algorithm section in the equation $B_0 = I * x$. I'm assuming $x$ here is $\mathbf{x}_0$ (as there's no other indication as to what $x$ should be) and $I$ is the identity matrix. (DrKBall (talk) 14:25, 12 September 2013 (UTC))

I personally think that $B_0 = I$ rather than $B_0 = I * x$. It is this value of $B_0$ which results in the first step of this algorithm conforming with gradient descent. (Ian Davis)

## Role of y and s

I can't believe this, but the roles of s and y are completely changed from what they should be (God knows since when). I corrected the formula. Please double check.

As a matter of fact I'm running this program with the formula as I input it and it converges. Maybe most people use the inverse update bec. it's better and that's why no one caught this bug?

Right now this formula seems incorrect: $B_{k+1} = B_k + \frac{\mathbf{s}_k \mathbf{s}_k^{\mathrm{T}}}{\mathbf{y}_k^{\mathrm{T}} \mathbf{s}_k} - \frac{B_k \mathbf{y}_k \mathbf{y}_k^{\mathrm{T}} B_k }{\mathbf{y}_k^{\mathrm{T}} B_k \mathbf{y}_k}.$
It can be easily checked by inspecting units of measure.
The correct formula can be found here: http://en.wikipedia.org/wiki/Quasi-Newton_method#Description_of_the_method 109.174.112.67 (talk) 06:11, 2 May 2013 (UTC)

## rank of matrix v. rank of tensor

What's the deal with the rank 2 matrix claim? Do we mean rank 2 tensor? It seems to me that if you take U = u v^T, then U is a matrix with rank of one (i.e. all rows/cols are linearly dependent). In other words, I think there is potential confusion between matrix rank and tensor rank. Birge 22:51, 28 March 2006 (UTC)

Two different rank 1 matrices are added; their sum is rank 2.

## Relation to steepest decent (gradient descent)

Could someone explain what the relationship between BFGS and gradient descent is? I could (very well) be wrong, but isn't it pretty much doing an approximate newton step? My notes say that when the Hessian is approximated by an identity matrix then BFGS will take a gradient descent step. Perhaps there's another relationship that I don't get, but it confused me to read that BFGS is "derived" from gradient descent.129.59.43.172 23:45, 3 May 2006 (UTC)

Newton's Method is typically described as a means of finding assignment of variables x to a function f(x) so that f(x) = 0. We do not wish to know where f(x) is 0 but where f(x) is a minimum (or maximum). This is the case when $f'(x) = 0$. So instead of using : $x_{1} = x_0 - \frac{f(x_0)}{f'(x_0)} \,.$ we would use : $x_{1} = x_0 - \frac{f'(x_0)}{f''(x_0)} \,.$. The Jacobian J generalizes $f'(x_0)$ and the hessian B generalizes $f''(x_0)$. Taking the inverse of the hessian generalizes the division.

Now $x_{1}$ is geometrically the position on the x axis where the tangent intersects it. Gradient descent, as opposed to Newton's method would I think correspond to computing an $x_{1}$ somewhere on the tangent perhaps nearer to the solution depending on the step increment, but not having the property that it was where the tangent interested the x axis.

What I don't get is why one then reintroduces a step length into the calculation of how much to alter the values of $f'(x_0)$. Having any value for this length other than one seems to depart from the whole point of Newton's Method. — Preceding unsigned comment added by Ijdavis (talkcontribs) 05:04, 23 January 2014 (UTC)

## too technical tag

this tag was first added when the article was just a stub; it was just one paragraph long. the article has considerably grown since then, and the stub tag was long ago removed. if the person who added the tag wants to add it back, please do but leave some suggestions on what details need to be better explained to a general audience. Lunch 03:58, 24 September 2006 (UTC)

## under-explained math

I'm totally confused as to what the variables are in the equations. If they are explained in other topics (like Newton's Method, etc.), then links should be made.

I am particularly confused by step 4 of the algorithm. B is a matrix, y and s are vectors, correct? Then the first term on the right is a matrix, but the second is a scalar (i.e., quotient of two vector dot products). The third term is another scalar. So, what am I missing? —Preceding unsigned comment added by Dave.rudolf (talkcontribs) 18:06, 24 October 2008 (UTC)

The second and third terms are not scalars. This article seems to be using the convention that vectors are column vectors, so the denominators vTw are scalar products, but the numerators vwT are n-by-n matrices. Algebraist 12:16, 1 March 2009 (UTC)

## Too Technical

It wasn't me that added the too technical tag originally, but I think it deserves it at the moment. I agree with the above post (under-explained math). The symbols are not explain - what is B? It is too technical because it is not currently possible for someone not familiar with the technical mathematics to use this algorithm. Other articles suggest that this is a very important algorithm for maximisation, therefore people may need to use it who are not experts in optimisation mathematics (which is how I ended up here - I'm going to have to look else where for this algo). Some pseudo-code for this algo would be a massive help. Thanks 81.98.250.190 (talk) 13:39, 11 January 2009 (UTC)

## BFGS or DFP?

according to Nocedal and Wright the updating B version of the algorithm is due to Davidon, Fletcher and Powell, the updating H = B^-1 version is the BFGS algorithm. If that is the case, then step 1 should be compute sk = -dot(Hk, grad) and the updating is as per the 'practically' section at the end. —Preceding unsigned comment added by 192.150.10.200 (talk) 17:35, 6 April 2009 (UTC)

Any type of quasi-Newton algorithm can update either $\mathbf{B}$ or $\mathbf{B}^{-1}$. See the page named "Quasi-Newton method."

## Sherman-Morrison or Woodbury?

According the the article, the update to the inverse of the hessian is via the Sherman–Morrison formula. However, this formula regards a rank one update. Because this is a rank two update, isn't this actually a special case of the Woodbury matrix identity? PDBailey (talk) 00:09, 12 April 2009 (UTC)

## Slight error?

I believe that $\mathbf{\alpha}_k \mathbf{s}_k$ should be used instead of $\mathbf{s}_k$ in the $\mathbf{B}_{k+1}$ and $\mathbf{B}_{k+1}^{-1}$ updates. Is this correct? It seems that the update should take into account how far the step size actually was. See also the page "Quasi-Newton method" which shows the updates to $\mathbf{B}$ and $\mathbf{B^{-1}}$ for several different quasi-Newton methods. Its formulas for BFGS are equivalent to those shown here, except that it uses actual step sizes instead of predicted step sizes. Having analyzed the formulas, this difference is not negligible.

## Slight Error

Yes there was a slight error. The correct result is produced in one of the literature sources referenced, namely Numerical Optimization by Nocedal and Wright, section 8.1. The step size used in the update for $B_k$ was incorrect. I corrected it by introducing $\mathbf{p}_k=\alpha_k \mathbf{s}_k$ and used it to replace $\mathbf{s}_k$ in the expression. I suspect that the expression for $B_k^{-1}$ is also incorrect but I could not find an explicit reference to it in the literature, though it should be possible to derive it from the Sherman-Morrison-Woodbury formula from Appendix A.2 and section 8.1 of the same literature source if someone has some spare time =P . This is actually a pretty serious error since it significantly affects the accuracy of the estimate for the Hessian which can seriously mess up your day. —Preceding unsigned comment added by 129.100.226.147 (talk) 18:48, 9 September 2009 (UTC)

## Slight Error

Yes the inverse formula was incorrect. However the notation $\mathbf{p}_k$ and $\mathbf{s}_k$ is the opposite of the standard notation. I switched them so $\mathbf{s}_k=\alpha_k \mathbf{p}_k$, which is standard in the literature. The inverse formula is now correct. —Preceding unsigned comment added by Drybid (talkcontribs) 21:46, 4 October 2009 (UTC)

## Explanation needed

I think that in the formula $\mathbf{B}_k \mathbf{p}_k = - \nabla \mathbf{f}(x_k)$ is a mistake. $\mathbf{B}_k$ is a matrix, $\mathbf{p}_k$ is a vector, and $\nabla \mathbf{f}(x_k)$ is a matrix. Then on the left-hand side there is a vector (a product of a matrix and a vector) but on the right-hand side there is still a matrix. Is there a mistake or I just miss something? Alex.

f(x) is scalar valued, so its gradient is vector valued. 018 (talk) 15:26, 2 June 2010 (UTC)

But what if I have $\mathbf{f}(x_k)$ vector? What is the formula then? —Preceding unsigned comment added by 130.161.210.163 (talk) 09:52, 8 June 2010 (UTC)

Broyden's method. In line one, it states that BFGS solves argmin{f(x)} where f(x) is an optimization problem. Perhaps this implies that f(x) is a vector to scalar mapping, but this is not explicit. In the second sentence it says that BFGS approximates Newton's method, which is only unambiguously true in the univariate case. For the multivariate case, it can be used to solve scalar f(x), not vector f(x) functions and there are two classes of 'Quasi-Newton methods'. The article also compares BFGS with Broyden's method in "So equivalently, Uk and Vk construct a rank-two update matrix which is robust against the scale problem often suffered in the gradient descent searching (e.g., in Broyden's method)." To me this implies that BFGS and Broyden's method are in some sense comparable, while in fact they solve completely different classes of problem (unless I am still confused, which is entirely possible). I'm sure many students and engineers would greatly appreciate a page which explains what methods are appropriate for solving scalar-to-scalar, vector-to-scalar, vector-to-vector, bounded, unbounded, derivative-available and derivative-free, minimization and root-finding problems noting which focus on accelerating convergence, which focus on robustness and which are or are not subject to numerical precision issues (where unique solutions exist). Finding Brent's method, BFGS (where I take a numerical derivative, which may be sub-optimal) and Broyden's method (where I still have stability problems) within the pages of Wikipedia has been difficult and we'd greatly appreciate help with this. Doug (talk) 17:07, 24 August 2013 (UTC)
Correction, Broyden's method is for root finding, I'm not sure how you are defining the minimum of a vector function, but the length would give you a scalar function. Doug (talk) 19:44, 24 August 2013 (UTC)

$\nabla \mathbf{f}(x_k)$ is not a matrix.  It is a vector containing the gradient at $x_k$.


The hessian discussed later which contains second derivatives is a matrix because one can differentiate twice using different partial derivatives at each step.

Ian — Preceding unsigned comment added by Ijdavis (talkcontribs) 05:15, 22 January 2014 (UTC)

## Differing formulae for incrementally approximating the inverse hessian..

This page was recently changed to give a different formula at the end for how the inverse hessian was to be computed. While the old and the new formula's appear to produce the same results (having implemented both), I think there is merit in showing both formulae, and discussing their differences.

The earlier formula can be computed using far fewer FLOPS, because it can be implemented without multiplying k by k matrices together, instead at worst computing vectors with k elements which are then multiplied with a k by k matrix. Further some parts of the computation reduce to scalars. It can also be computed using much less memory. In my implementation of the new formula I required two additional temporary k by k matrices, one to form the multiplying matrix and the other to handle the fact that the B^(-1) matrix cannot be easily updated in place. The old formula could easily be implemented using only temporary vectors, with B^(-1) simply being updated incrementally by addition of outer product vectors. A small benchmark suggests that while the old formula can be computed multiple times in about 5 seconds, the same number of computations using the new formulae takes about 2 minutes. That is a pretty stark difference.

The advantage of the new formula is that it is conceptually simpler, and highlights the fact that we are computing transpose(M) B^(-1) M + S. If floating point operations are more stable under multiplication than addition, it may offer greater stability, but given that far more FLOPS are being performed this seems somewhat suspect, since there will be more rounding errors introduced.

I would be interested in other peoples take on the merits of the old and new formulae.

Ian Davis (textserver.com@gmail.com) — Preceding unsigned comment added by Ijdavis (talkcontribs) 05:07, 22 January 2014 (UTC)

## A further comment

Arguably all of the information about how one incrementally computes a hessian and its inverse under perturbations would be better placed under the page that describes hessians. It is not relevant only to the BFGS algorithm. — Preceding unsigned comment added by Ijdavis (talkcontribs) 05:09, 23 January 2014 (UTC)

## Why do the formula's work

Nothing is really said to justify why the formulae given for computing the approximation to the hessian and its inverse actually do approximate the hessian. That part of this page seems a bit too close to black magic. Can someone explain what is going on here? — Preceding unsigned comment added by Ijdavis (talkcontribs) 05:14, 23 January 2014 (UTC)