Talk:Hessian matrix

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Mathematics (Rated Start-class, Mid-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
Start Class
Mid Importance
 Field:  Analysis

Initial discussion[edit]

This would really be a lot easier to understand if we could see a visual representation.. something like

Hf(x,y) = [Partial 2 derivative with respect to x2], [partial 2 derivative yx]

         [Partial 2 derivative with respect to xy], [partial 2 derivative y2]

"Hessian matrices are used in large-scale optimization problems" looks incorrect to me: for high-dimensional problems second order methods are usually taken into account only for problems with some known exploitable (sparse) structure. In general the hessian matrix is too big to be stored. First order methods are the main choice in general. --Lostella (talk) 09:44, 10 July 2013 (UTC)

Is this correct?[edit]

The second phrase of 'Second derivative test' (If the Hessian is positive definite...) should not be 'If the determinat of the Hessian is positive definite...' ?

A positive-definite matrix is a type of symmetric matrix. A determinant is just a real number, which may be positive or negative, but not positive definite. Follow the link. -GTBacchus(talk) 23:12, 5 March 2006 (UTC)
Being positive-definite is not not related to being symmetric. It just says that all eigenvalues of or bilinear forms constructed with this matrix are positive definite (i.e. x^T A x >= 0, for all x). You only find both terms (i.e. symmetric and positive definite) going together so often, that there is already a shorthand for this: s.p.d. Nonetheless, the terms are distinct. (talk) 12:31, 6 April 2009 (UTC) (ezander)
Right. Consider a rotation matrix such as
Note that it has determinant of 1 but is not symmetric and has complex eigenvalues. —Ben FrantzDale (talk) 13:02, 6 April 2009 (UTC)


With regard to the del operator, is it that


Or am I just confused? —Ben FrantzDale 08:13, 28 March 2006 (UTC)

I think that is close, but you need to transpose one of the dels as well as write f as a diagonal matrix:

I'm pretty sure this is right--hope it helps. 16:56, 3 Apr 2006 (UTC)
the transpose is redundant, as it is part of the definition of the dyadic product. The shouldn't be there as that would make it a divergence, which is defined for vector functions whereas f here is a scalar function.
Kaoru Itou (talk) 20:49, 28 January 2009 (UTC)
Also, diagonalising f before multiplying it makes no difference:
Kaoru Itou (talk) 22:14, 28 January 2009 (UTC)


It would be good to have at least one example of the use of Hessians in optimization problems, and perhaps a few words on the subject of applications of Hessians to statistical problems, e.g. maximization of parameters. --Smári McCarthy 16:01, 19 May 2006 (UTC)

I agree. BriEnBest (talk) 20:00, 28 October 2010 (UTC)
Me too. --Kvng (talk) 15:01, 3 July 2012 (UTC)


The hessian displayed is incorrect, it should be 1\2 of the second derivative matrix. Charlielewis 06:34, 11 December 2006 (UTC)

Bordered Hessian[edit]

Is not clear how a Bordered Hessian with more than one constrain should look like. If I knew I would fix it.. --Marra 16:07, 19 February 2007 (UTC)

See the added "If there are, say, m constraints ...". Arie ten Cate 15:08, 6 May 2007 (UTC)

The definition of Bordered Hessian is extremely confusing I suggest using the definition used in Luenberger's book "Linear and Nonlinear Programming" I am adding it to my todo list. Will correct it soon. --Max Allen G (talk) 19:31, 8 April 2010 (UTC)

In the Bordered Hessian should it not be the Hessian of the Lagrange function instead of (what is currently presented) the Hessian of f? -- Ben —Preceding unsigned comment added by (talk) 11:27, 8 June 2010 (UTC)

I agree. This seems wrong to me. The article cites Fundamental Methods of Mathematical Economics, by Chiang. Chiang has this as the Hessian of the Lagrange function, as you described. I think it should be changed. — Preceding unsigned comment added by Blossomonte (talkcontribs) 15:07, 10 September 2013 (UTC)
Please can someone fix this. As it is, the property of being a maximum or minimum depends only on the gradient of the constraint function, which is clearly not correct. The curvature of the surface defined by the constraint function must also come into it, through its Hessian (which appears through the Hessian of the Lagrangian). Thus, what is written here is obviously wrong. — Preceding unsigned comment added by (talkcontribs) 03:08, 6 May 2014‎
It seems that you have not correctly read the paragraph beginning by "specifically": All the minors that are considered depend not only of the constraint function, but also of the second derivative of the function f. Also, please, sign your comments in the talk page by four tildes (~). D.Lazard (talk) 14:56, 6 May 2014 (UTC)

The Thereom is Wrong[edit]

I had learned that Fxy=Fyx is Youngs Thereom, not Swartz —The preceding unsigned comment was added by Lachliggity (talkcontribs) 03:02, 16 March 2007 (UTC).

What if det H is zero?[edit]

It would be nice if someone could include what to do when the determinant of the hessian matrix is zero. I thought you had to check with higher order derivatives, but I'm not too sure. Aphexer 09:52, 1 June 2007 (UTC)

"Univariate" function?[edit]

Please note that "univariate" in the intro refers to a statistical concept, which I believe does not apply here. Even in function (mathematics) there is no mention of "univariate functions", that anyway to me suggest function of one independent variable, which is not what we are discussing. I'll be bold and remove, please fix if you know what was meant. Thanks. 05:43, 9 August 2007 (UTC)

"Single-valued" perhaps? But do we really need to specify that? In the intro? I would just leave "function". 05:45, 9 August 2007 (UTC)

I think I made that change. My thought was, I wanted to differentiate "single valued" from (keeping in this notation's spirit) "multi-valued". Or to quote from the second sentence and from the fifth section, "real valued" vs. "vector valued". I did not want the first sentence to be ambiguous that in general the Hessian is a matrix, which then has a tensor extension for vector valued functions.
The term "univariate" does show my professional bias, and while I still think it's appropriate, "single-valued" is completely acceptable as well. I still have a concern that not qualifying the first sentence at all allows for the tensor to be considered a case of a Hessian matrix, when I think that is better thought of an extension of the concept, since it's not a matrix per se. However, I will not revert it and will chime in on any discussion and clarification here. Baccyak4H (Yak!) 14:15, 9 August 2007 (UTC)

Vector valued functions[edit]

"If is instead vector-valued, ..., then the array of second partial derivatives is not a matrix, but a tensor of rank 3."

I think this is wrong. Wouldn't the natural extension of the Hessian to a 3-valued (i.e.) function just be 3 Hessian matrices?

Is this sentence instead trying to generalize Hessian matrices to higher-order partial derivative tensors of single-valued functions? 07:17, 2 October 2007 (UTC)

I'm sure someone can point to a reference which will answer your question, but it would seem that analogous to the Jacobian of a vector valued function, which is a matrix and not (just) a set of derivative vectors, that a rank 3 tensor makes sense: one could take inner products with such a tensor, say like in a higher order term of a multivariate Taylor series. That operation doesn't make as much sense if all one has is a set of matrices. And it would seem one could always have one extant of the tensor index the elements of the vector of the function, with an arbitrary number of additional extants indexing arbitrary numbers of variables of differentiation. My $0.02. Baccyak4H (Yak!) 13:32, 2 October 2007 (UTC)
I can't see what you're getting at.
What I mean is that if where f maps to R^n and each f_i maps to R, then isn't
And the only function returning a tensor that makes sense is the higher-order partial derivatives of a real-valued function g(). E.g. if rank-3 tensor T holds the 3-rd order partial derivatives of g(), then:
If you disagree with this, can you explicitly state what entry should be (in terms of f=(f1,f2,...,fn)) if T is supposed to be the "hessian" of a vector-valued function? 22:57, 3 October 2007 (UTC)
i thought was a tensor, with Kaoru Itou (talk) 22:16, 4 February 2009 (UTC)

Riemannian geometry[edit]

Can someone write on this topic from the point of view of Riemannian geometry? (there should be links e.g. to covariant derivative). Commentor (talk) 05:15, 3 March 2008 (UTC)

I think, there's something wrong with the indices and the tensor- product. We define . Then ie in lokal coordinates (or in terms of the nabla, ) and thus . So if we once have written the Hessian with indices, i.e. , we do not need to write any more tensor products . (Since we need to get a tensor of rank 2!) —Preceding unsigned comment added by (talk) 19:14, 2 April 2011 (UTC)

Local polynomial expansion[edit]

As I understand it, the Hessian describes the second-order shape of a smooth function in a given neighborhood. So is this right?:

(Noting that the Jacobian matrix is equal to the gradient for scalar-valued functions.) That seems like it should be the vector-domained equivalent of

If that's right, I'll add that to the article. —Ben FrantzDale (talk) 04:30, 23 November 2008 (UTC)

That appears to be right, as discussed (crudely) in Taylor_series#Taylor_series_in_several_variables. —Ben FrantzDale (talk) 04:37, 23 November 2008 (UTC)

Approximation with Jacobian[edit]

Some optimization algorithms (e.g., levmar) approximate the Hessian of a cost function (half the sum of squares of a residual) with where J is the Jacobian matrix of r with respect to x:


For reference, here's the derrivation. I may add it to this page, since the approximation is important and has practical applications (source).

Here's my understanding:

If we have a sum-of-squares cost function:

then simple differentiation gives:


Then using the product rule inside the summation:

H.O.T. for small s.

With all that in mind, for small residuals we can aprpoximate the terms of H with


In words this says that for small residuals, the curvature of f (in any combination of directions) is approximated by the sum of squares of the rate of change of the components of the residuals in the same directions. We ignore the curvature of the residual components weighted the residuals—they are small so they don't have much room to curve and are additionally downweighted by their (small) value. —Ben FrantzDale (talk) 13:24, 10 August 2010 (UTC)


There is a problem with notation in the lead. H(f) and H(x) are used, but the arguments are very different. In the first case, f is a function, and in the second case, x is a vector. One should try to unify the notation or at least clarify the differences. Renato (talk) 22:04, 12 September 2011 (UTC)

Good point. I added clarification. —Ben FrantzDale (talk) 11:44, 13 September 2011 (UTC)

Hessian = (transpose of?) Jacobian of the gradient.[edit]

Given a scalar smooth function , the gradient will be a vector valued function . We can even write , a column vector where the first line is the derivative of f with respect to and so forth.

Then, the Jacobian of , as stated in Jacobian matrix and determinant, is a matrix where the i-th column has the derivative of all components of with respect to . This is the transpose of the Hessian, or is it not? Wisapi (talk) 17:04, 20 January 2016 (UTC)

The distinction is moot due to the symmetry of the Hessian if Clairaut's theorem holds for all the partial derivatives concerned, as in most elementary cases. If not, then consider the order of partial differentiation implied by the operators. Then one gets somewhat of a contradictory situation. Every source (Wolfram MathWorld among others) calls it simply the Jacobian of the gradient. But the same sources use the notation in the article, and for all i and j. I am in favor of the current statement, however, since the idea is that taking the total derivative twice yields a second derivative, without complications such as this.--Jasper Deng (talk) 17:48, 20 January 2016 (UTC)