## Search in trial direction

I was under the impression that gradient descent did a line search in the trial direction (direction of steepest gradient). Can anyone else confirm this? njh 04:45, 25 November 2005 (UTC)

I guess you are saying that given a direction, one should find the optimal step in that direction. That is I guess also gradient descent, but from what I know, in practice it is very hard to do that. So one just takes some reasonable step in one direction, and then from there takes a new direction, as this article illustrates. Did I understand your question right? Oleg Alexandrov (talk) 05:59, 25 November 2005 (UTC)
Usually one finds a trial direction, then uses a one dimensional minimisation algorithm like brent's method along the vector thus defined. This is generally a better idea than making arbitrary small steps as you only need to compute the gradient at the minimum in each direction (of course it still suffers from zigzaging). Conjugate gradient method extends this to ensure that zigzaging doesn't happen. njh 08:37, 25 November 2005 (UTC)
I myself used gradient descent only for infinite-dimensional optimization, when the variable of optimization is not a vector, but a function. Then it is really a pain to find the optimal step. :) And you don't need to use an arbitrarily small step size as you say, you can make it adaptive, so you can try to jump a bit more at each step than at the previous one, and scale back if you jump too much. Of course, in this way you are not guaranteed to find the true minimum, but those infinite-dimensional problems often times have an infinite number of local minima to start with, so it is not as if you lost something. Oleg Alexandrov (talk) 15:54, 25 November 2005 (UTC)

## Someone needs to correct the definition

Doesn't sound logical to me: "algorythm that approaches a local minimumof a function" mean that the algorythm itself is approaching something. What?

I am interested in what I can do (or what I can find) with this algorythm.

Inyuki 21:46, 2 November 2006 (UTC)

You are right. That poor wording has been there from the very beginning. I rewrote the article at some point but failed to notice that.
This algorithm is used to find local minima of functions. Oleg Alexandrov (talk) 04:40, 3 November 2006 (UTC)

## Understanding the algorythm

Excusme, maybe someone could tell me, if I got it right, I tried to write it in procedural code here: http://howto.wikia.com/wiki/Gradient_descent

Thank you.

Inyuki 23:33, 7 November 2006 (UTC)

Looks okay to me, for the very basic algorithm. One problem is that you're unlikely to land on a point where the gradient is exactly zero. Terminate the loop when the gradient is "small enough" (choosing what this means is the hard part). -- Jitse Niesen (talk) 01:10, 8 November 2006 (UTC)
In a many applications, the minimum derivative is a user-input variable. --Cowbert 05:04, 29 December 2006 (UTC)

## More elegant formulation?

I've usually seen gradient descent formulated more elegantly as a partial differential equation x'(t)=-\nabla f(x(t)). Once discretized, it takes the form that is given in the article. I think it would be nice to add this, does anyone object? —Preceding unsigned comment added by Winblows (talkcontribs) 22:09, 6 August 2008 (UTC)

This "gradient flow" interpretation is neat indeed, but harder to grasp than the "discrete" formulation used in the article, and it is the discrete version of gradient descent which is most used in applications. Some materials on this proposed continuous formulation would be OK, as long as it is further down the article. Oleg Alexandrov (talk) 02:27, 7 August 2008 (UTC)

## Non-linear systems example introduces undefined symbols

As far as I can tell, the symbols z, z_0, and t are used without being introduced or defined earlier in the page. This makes the description very difficult to follow. —Preceding unsigned comment added by 216.244.31.162 (talk) 11:12, 25 February 2010 (UTC)

The description of using gradient descent as a way to solve a non-linear equation is impossible to follow because (as the previous comment suggested), basically none of the symbols are defined before they are used. What are "f" and "z" and everything else in that argument? —Preceding unsigned comment added by 69.196.185.126 (talk) 18:21, 13 May 2010 (UTC)

The "talker" right above me is right. The author(s) skip(s) around a bit changing style and notation. It's a nice attempt but has gone awry. Try Google for something better. <! W. Watson !>

Okay, I hope that helps. 018 (talk) 01:03, 15 June 2010 (UTC)

It also uses the jacobian matrix of the scalar function F, which is nonsense. Lostella (talk) 09:50, 12 October 2010 (UTC)

Now this is fixed, I believe. We take the transpose of the Jacobian of G with respect to X, and multiply it by the G vector. Which is equivalent to taking the gradient of our F function. Though I believe there is a 0.5 multiple which is not accounted for. This is a scalar, of course, and since the gamma "step" multiplies the same term, I suppose it doesn't matter. But it should probably be added for completeness. Peryeat (talk) 05:32, 10 January 2011 (UTC)

I believe it also made an error in arithmetic. The first entry in the G(X0) matrix is computed as -1.5. I believe it is -2.5, as the cosine of 0 is 1. It took me an hour and a half to figure out that was wrong. ;-) I'll attempt to fix it.Peryeat (talk) 03:16, 10 January 2011 (UTC)

I think it's better now. Many of the calculations needed to be redone in light of this, and the optimization is less exciting, but I think it's correct. Peryeat (talk) 05:32, 10 January 2011 (UTC)

### Introduction of animated example. January 16, 2011

I added an animation to try to aid in the understanding of this example. I know that there is great potential for something like this to work for explanation, but I don't have the proper software to edit the animation sufficiently. If anyone would like to clean it up, I'd appreciate it. I could give the Octave code to anyone who wants it. Peryeat (talk) 19:22, 16 January 2011 (UTC)