Talk:Broyden–Fletcher–Goldfarb–Shanno algorithm

(Redirected from Talk:BFGS method)
Jump to: navigation, search
WikiProject Mathematics (Rated C-class, Low-importance)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 C Class
 Low Importance
Field: Applied mathematics

Article State

Large amounts of the article are presented without reference. Although this is probably ok for the general theory from the paper, it is not for parts of the algorithm (eg "usually obtained") and implementation sections. If you have contributed something to these, please add a citation rather than relying on the bibliography. If nothing is done within a month I will start pruning these. byo (talk) 03:19, 24 March 2016 (UTC)

Asterisk operator in initialization algorithm

Perhaps someone with more specific experience in numerical optimization can explain the meaning of the asterisk operator in the algorithm section in the equation ${\displaystyle B_{0}=I*x}$. I'm assuming ${\displaystyle x}$ here is ${\displaystyle \mathbf {x} _{0}}$ (as there's no other indication as to what ${\displaystyle x}$ should be) and ${\displaystyle I}$ is the identity matrix. (DrKBall (talk) 14:25, 12 September 2013 (UTC))

I personally think that ${\displaystyle B_{0}=I}$ rather than ${\displaystyle B_{0}=I*x}$. It is this value of ${\displaystyle B_{0}}$ which results in the first step of this algorithm conforming with gradient descent. (Ian Davis)

Role of y and s

I can't believe this, but the roles of s and y are completely changed from what they should be (God knows since when). I corrected the formula. Please double check.

As a matter of fact I'm running this program with the formula as I input it and it converges. Maybe most people use the inverse update bec. it's better and that's why no one caught this bug?

Right now this formula seems incorrect: ${\displaystyle B_{k+1}=B_{k}+{\frac {\mathbf {s} _{k}\mathbf {s} _{k}^{\mathrm {T} }}{\mathbf {y} _{k}^{\mathrm {T} }\mathbf {s} _{k}}}-{\frac {B_{k}\mathbf {y} _{k}\mathbf {y} _{k}^{\mathrm {T} }B_{k}}{\mathbf {y} _{k}^{\mathrm {T} }B_{k}\mathbf {y} _{k}}}.}$
It can be easily checked by inspecting units of measure.
The correct formula can be found here: http://en.wikipedia.org/wiki/Quasi-Newton_method#Description_of_the_method 109.174.112.67 (talk) 06:11, 2 May 2013 (UTC)

rank of matrix v. rank of tensor

What's the deal with the rank 2 matrix claim? Do we mean rank 2 tensor? It seems to me that if you take U = u v^T, then U is a matrix with rank of one (i.e. all rows/cols are linearly dependent). In other words, I think there is potential confusion between matrix rank and tensor rank. Birge 22:51, 28 March 2006 (UTC)

Two different rank 1 matrices are added; their sum is rank 2.

Relation to steepest decent (gradient descent)

Could someone explain what the relationship between BFGS and gradient descent is? I could (very well) be wrong, but isn't it pretty much doing an approximate newton step? My notes say that when the Hessian is approximated by an identity matrix then BFGS will take a gradient descent step. Perhaps there's another relationship that I don't get, but it confused me to read that BFGS is "derived" from gradient descent.129.59.43.172 23:45, 3 May 2006 (UTC)

My answer

Newton's Method is typically described as a means of finding assignment of variables x to a function f(x) so that f(x) = 0. We do not wish to know where f(x) is 0 but where f(x) is a minimum (or maximum). This is the case when ${\displaystyle f'(x)=0}$. So instead of using : ${\displaystyle x_{1}=x_{0}-{\frac {f(x_{0})}{f'(x_{0})}}\,.}$ we would use : ${\displaystyle x_{1}=x_{0}-{\frac {f'(x_{0})}{f''(x_{0})}}\,.}$. The Jacobian J generalizes ${\displaystyle f'(x_{0})}$ and the hessian B generalizes ${\displaystyle f''(x_{0})}$. Taking the inverse of the hessian generalizes the division.

Now ${\displaystyle x_{1}}$ is geometrically the position on the x axis where the tangent intersects it. Gradient descent, as opposed to Newton's method would I think correspond to computing an ${\displaystyle x_{1}}$ somewhere on the tangent perhaps nearer to the solution depending on the step increment, but not having the property that it was where the tangent interested the x axis.

What I don't get is why one then reintroduces a step length into the calculation of how much to alter the values of ${\displaystyle f'(x_{0})}$. Having any value for this length other than one seems to depart from the whole point of Newton's Method. — Preceding unsigned comment added by Ijdavis (talkcontribs) 05:04, 23 January 2014 (UTC)

too technical tag

this tag was first added when the article was just a stub; it was just one paragraph long. the article has considerably grown since then, and the stub tag was long ago removed. if the person who added the tag wants to add it back, please do but leave some suggestions on what details need to be better explained to a general audience. Lunch 03:58, 24 September 2006 (UTC)

under-explained math

I'm totally confused as to what the variables are in the equations. If they are explained in other topics (like Newton's Method, etc.), then links should be made.

I am particularly confused by step 4 of the algorithm. B is a matrix, y and s are vectors, correct? Then the first term on the right is a matrix, but the second is a scalar (i.e., quotient of two vector dot products). The third term is another scalar. So, what am I missing? —Preceding unsigned comment added by Dave.rudolf (talkcontribs) 18:06, 24 October 2008 (UTC)

The second and third terms are not scalars. This article seems to be using the convention that vectors are column vectors, so the denominators vTw are scalar products, but the numerators vwT are n-by-n matrices. Algebraist 12:16, 1 March 2009 (UTC)

Too Technical

It wasn't me that added the too technical tag originally, but I think it deserves it at the moment. I agree with the above post (under-explained math). The symbols are not explain - what is B? It is too technical because it is not currently possible for someone not familiar with the technical mathematics to use this algorithm. Other articles suggest that this is a very important algorithm for maximisation, therefore people may need to use it who are not experts in optimisation mathematics (which is how I ended up here - I'm going to have to look else where for this algo). Some pseudo-code for this algo would be a massive help. Thanks 81.98.250.190 (talk) 13:39, 11 January 2009 (UTC)

BFGS or DFP?

according to Nocedal and Wright the updating B version of the algorithm is due to Davidon, Fletcher and Powell, the updating H = B^-1 version is the BFGS algorithm. If that is the case, then step 1 should be compute sk = -dot(Hk, grad) and the updating is as per the 'practically' section at the end. —Preceding unsigned comment added by 192.150.10.200 (talk) 17:35, 6 April 2009 (UTC)

Any type of quasi-Newton algorithm can update either ${\displaystyle \mathbf {B} }$ or ${\displaystyle \mathbf {B} ^{-1}}$. See the page named "Quasi-Newton method."

Sherman-Morrison or Woodbury?

According the the article, the update to the inverse of the hessian is via the Sherman–Morrison formula. However, this formula regards a rank one update. Because this is a rank two update, isn't this actually a special case of the Woodbury matrix identity? PDBailey (talk) 00:09, 12 April 2009 (UTC)

Slight error?

I believe that ${\displaystyle \mathbf {\alpha } _{k}\mathbf {s} _{k}}$ should be used instead of ${\displaystyle \mathbf {s} _{k}}$ in the ${\displaystyle \mathbf {B} _{k+1}}$ and ${\displaystyle \mathbf {B} _{k+1}^{-1}}$ updates. Is this correct? It seems that the update should take into account how far the step size actually was. See also the page "Quasi-Newton method" which shows the updates to ${\displaystyle \mathbf {B} }$ and ${\displaystyle \mathbf {B^{-1}} }$ for several different quasi-Newton methods. Its formulas for BFGS are equivalent to those shown here, except that it uses actual step sizes instead of predicted step sizes. Having analyzed the formulas, this difference is not negligible.

Slight Error

Yes there was a slight error. The correct result is produced in one of the literature sources referenced, namely Numerical Optimization by Nocedal and Wright, section 8.1. The step size used in the update for ${\displaystyle B_{k}}$ was incorrect. I corrected it by introducing ${\displaystyle \mathbf {p} _{k}=\alpha _{k}\mathbf {s} _{k}}$ and used it to replace ${\displaystyle \mathbf {s} _{k}}$ in the expression. I suspect that the expression for ${\displaystyle B_{k}^{-1}}$ is also incorrect but I could not find an explicit reference to it in the literature, though it should be possible to derive it from the Sherman-Morrison-Woodbury formula from Appendix A.2 and section 8.1 of the same literature source if someone has some spare time =P . This is actually a pretty serious error since it significantly affects the accuracy of the estimate for the Hessian which can seriously mess up your day. —Preceding unsigned comment added by 129.100.226.147 (talk) 18:48, 9 September 2009 (UTC)

Slight Error

Yes the inverse formula was incorrect. However the notation ${\displaystyle \mathbf {p} _{k}}$ and ${\displaystyle \mathbf {s} _{k}}$ is the opposite of the standard notation. I switched them so ${\displaystyle \mathbf {s} _{k}=\alpha _{k}\mathbf {p} _{k}}$, which is standard in the literature. The inverse formula is now correct. —Preceding unsigned comment added by Drybid (talkcontribs) 21:46, 4 October 2009 (UTC)

Explanation needed

I think that in the formula ${\displaystyle \mathbf {B} _{k}\mathbf {p} _{k}=-\nabla \mathbf {f} (x_{k})}$ is a mistake. ${\displaystyle \mathbf {B} _{k}}$ is a matrix, ${\displaystyle \mathbf {p} _{k}}$ is a vector, and ${\displaystyle \nabla \mathbf {f} (x_{k})}$ is a matrix. Then on the left-hand side there is a vector (a product of a matrix and a vector) but on the right-hand side there is still a matrix. Is there a mistake or I just miss something? Alex.

f(x) is scalar valued, so its gradient is vector valued. 018 (talk) 15:26, 2 June 2010 (UTC)

But what if I have ${\displaystyle \mathbf {f} (x_{k})}$ vector? What is the formula then? —Preceding unsigned comment added by 130.161.210.163 (talk) 09:52, 8 June 2010 (UTC)

Broyden's method. In line one, it states that BFGS solves argmin{f(x)} where f(x) is an optimization problem. Perhaps this implies that f(x) is a vector to scalar mapping, but this is not explicit. In the second sentence it says that BFGS approximates Newton's method, which is only unambiguously true in the univariate case. For the multivariate case, it can be used to solve scalar f(x), not vector f(x) functions and there are two classes of 'Quasi-Newton methods'. The article also compares BFGS with Broyden's method in "So equivalently, Uk and Vk construct a rank-two update matrix which is robust against the scale problem often suffered in the gradient descent searching (e.g., in Broyden's method)." To me this implies that BFGS and Broyden's method are in some sense comparable, while in fact they solve completely different classes of problem (unless I am still confused, which is entirely possible). I'm sure many students and engineers would greatly appreciate a page which explains what methods are appropriate for solving scalar-to-scalar, vector-to-scalar, vector-to-vector, bounded, unbounded, derivative-available and derivative-free, minimization and root-finding problems noting which focus on accelerating convergence, which focus on robustness and which are or are not subject to numerical precision issues (where unique solutions exist). Finding Brent's method, BFGS (where I take a numerical derivative, which may be sub-optimal) and Broyden's method (where I still have stability problems) within the pages of Wikipedia has been difficult and we'd greatly appreciate help with this. Doug (talk) 17:07, 24 August 2013 (UTC)
Correction, Broyden's method is for root finding, I'm not sure how you are defining the minimum of a vector function, but the length would give you a scalar function. Doug (talk) 19:44, 24 August 2013 (UTC)

Answer

${\displaystyle \nabla \mathbf {f} (x_{k})}$ is not a matrix.  It is a vector containing the gradient at ${\displaystyle x_{k}}$.


The hessian discussed later which contains second derivatives is a matrix because one can differentiate twice using different partial derivatives at each step.

Ian — Preceding unsigned comment added by Ijdavis (talkcontribs) 05:15, 22 January 2014 (UTC)

Differing formulae for incrementally approximating the inverse hessian..

This page was recently changed to give a different formula at the end for how the inverse hessian was to be computed. While the old and the new formula's appear to produce the same results (having implemented both), I think there is merit in showing both formulae, and discussing their differences.

The earlier formula can be computed using far fewer FLOPS, because it can be implemented without multiplying k by k matrices together, instead at worst computing vectors with k elements which are then multiplied with a k by k matrix. Further some parts of the computation reduce to scalars. It can also be computed using much less memory. In my implementation of the new formula I required two additional temporary k by k matrices, one to form the multiplying matrix and the other to handle the fact that the B^(-1) matrix cannot be easily updated in place. The old formula could easily be implemented using only temporary vectors, with B^(-1) simply being updated incrementally by addition of outer product vectors. A small benchmark suggests that while the old formula can be computed multiple times in about 5 seconds, the same number of computations using the new formulae takes about 2 minutes. That is a pretty stark difference.

The advantage of the new formula is that it is conceptually simpler, and highlights the fact that we are computing transpose(M) B^(-1) M + S. If floating point operations are more stable under multiplication than addition, it may offer greater stability, but given that far more FLOPS are being performed this seems somewhat suspect, since there will be more rounding errors introduced.

I would be interested in other peoples take on the merits of the old and new formulae.

Ian Davis (textserver.com@gmail.com) — Preceding unsigned comment added by Ijdavis (talkcontribs) 05:07, 22 January 2014 (UTC)

A further comment

Arguably all of the information about how one incrementally computes a hessian and its inverse under perturbations would be better placed under the page that describes hessians. It is not relevant only to the BFGS algorithm. — Preceding unsigned comment added by Ijdavis (talkcontribs) 05:09, 23 January 2014 (UTC)

Why do the formula's work

Nothing is really said to justify why the formulae given for computing the approximation to the hessian and its inverse actually do approximate the hessian. That part of this page seems a bit too close to black magic. Can someone explain what is going on here? — Preceding unsigned comment added by Ijdavis (talkcontribs) 05:14, 23 January 2014 (UTC)

External links modified

Hello fellow Wikipedians,

I have just modified one external link on Broyden–Fletcher–Goldfarb–Shanno algorithm. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

You may set the |checked=, on this template, to true or failed to let other editors know you reviewed the change. If you find any errors, please use the tools below to fix them or call an editor by setting |needhelp= to your help request.

• If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
• If you found an error with any archives or the URLs themselves, you can fix them with this tool.

If you are unable to use these tools, you may set |needhelp=<your help request> on this template to request help from an experienced user. Please include details about your problem, to help other editors.

Cheers.—InternetArchiveBot 14:24, 9 November 2016 (UTC)

Another proof of B_k

Another proof of B_k is to minimize the function: ${\displaystyle min_{\mathbf {B} }||\mathbf {B-B_{k}} ||}$ subject to: ${\displaystyle \mathbf {B=B^{\top },Bs_{k}=y_{k}} }$ Here the matrix norm is ${\displaystyle \mathbf {||A||_{W}\equiv ||W^{1/2}AW^{1/2}||_{F}} }$, where ${\displaystyle \mathbf {G} }$ is the average Hessian.

Let ${\displaystyle \mathbf {E} =\mathbf {B-B_{k}} ;\gamma =y_{k};\delta =s_{k};\xi =\gamma -\mathbf {B} \delta }$, the Lagrangian function is:

${\displaystyle {\mathcal {L}}={\frac {1}{4}}trace(\mathbf {WE^{\top }WE} )+trace(\mathbf {\Lambda ^{\top }(E^{\top }-E} ))-\lambda ^{\top }\mathbf {W(E\delta -\xi )} }$

Such proof is also good to comprehend.--Dzhouaa (talk) 09:11, 25 January 2017 (UTC)