# Talk:Hessian matrix

WikiProject Mathematics (Rated Start-class, Mid-importance)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 Start Class
 Mid Importance
Field:  Analysis

## Initial discussion

This would really be a lot easier to understand if we could see a visual representation.. something like

Hf(x,y) = [Partial 2 derivative with respect to x2], [partial 2 derivative yx]

         [Partial 2 derivative with respect to xy], [partial 2 derivative y2]


"Hessian matrices are used in large-scale optimization problems" looks incorrect to me: for high-dimensional problems second order methods are usually taken into account only for problems with some known exploitable (sparse) structure. In general the hessian matrix is too big to be stored. First order methods are the main choice in general. --Lostella (talk) 09:44, 10 July 2013 (UTC)

## Is this correct?

The second phrase of 'Second derivative test' (If the Hessian is positive definite...) should not be 'If the determinat of the Hessian is positive definite...' ?

A positive-definite matrix is a type of symmetric matrix. A determinant is just a real number, which may be positive or negative, but not positive definite. Follow the link. -GTBacchus(talk) 23:12, 5 March 2006 (UTC)
Being positive-definite is not not related to being symmetric. It just says that all eigenvalues of or bilinear forms constructed with this matrix are positive definite (i.e. x^T A x >= 0, for all x). You only find both terms (i.e. symmetric and positive definite) going together so often, that there is already a shorthand for this: s.p.d. Nonetheless, the terms are distinct. 134.169.77.186 (talk) 12:31, 6 April 2009 (UTC) (ezander)
Right. Consider a rotation matrix such as
R=${\displaystyle {\begin{bmatrix}0&1\\-1&0\end{bmatrix}}}$
Note that it has determinant of 1 but is not symmetric and has complex eigenvalues. —Ben FrantzDale (talk) 13:02, 6 April 2009 (UTC)

## Del

With regard to the del operator, is it that

${\displaystyle H=\nabla \otimes \nabla \cdot f}$?

Or am I just confused? —Ben FrantzDale 08:13, 28 March 2006 (UTC)

I think that is close, but you need to transpose one of the dels as well as write f as a diagonal matrix:
${\displaystyle H=\nabla \otimes \nabla ^{T}\cdot \mathrm {diag} (f)={\begin{bmatrix}{\frac {\partial }{\partial x_{1}}}\\{\frac {\partial }{\partial x_{2}}}\\\vdots \\{\frac {\partial }{\partial x_{n}}}\end{bmatrix}}\otimes {\begin{bmatrix}{\frac {\partial }{\partial x_{1}}}&{\frac {\partial }{\partial x_{2}}}&\cdots &{\frac {\partial }{\partial x_{n}}}\end{bmatrix}}\cdot \mathrm {diag} (f)}$
${\displaystyle H={\begin{bmatrix}{\frac {\partial ^{2}}{\partial x_{1}^{2}}}&{\frac {\partial ^{2}}{\partial x_{1}\partial x_{2}}}&\cdots &{\frac {\partial ^{2}}{\partial x_{1}\partial x_{n}}}\\{\frac {\partial ^{2}}{\partial x_{2}\partial x_{1}}}&{\frac {\partial ^{2}}{\partial x_{2}^{2}}}&\cdots &{\frac {\partial ^{2}}{\partial x_{2}\partial x_{n}}}\\\vdots &\vdots &\ddots &\vdots \\{\frac {\partial ^{2}}{\partial x_{n}\partial x_{1}}}&{\frac {\partial ^{2}}{\partial x_{n}\partial x_{2}}}&\cdots &{\frac {\partial ^{2}}{\partial x_{n}^{2}}}\end{bmatrix}}\cdot {\begin{bmatrix}f&0&\cdots &0\\0&f&\cdots &0\\\vdots &\vdots &\ddots &\vdots \\0&0&\cdots &f\end{bmatrix}}={\begin{bmatrix}{\frac {\partial ^{2}f}{\partial x_{1}^{2}}}&{\frac {\partial ^{2}f}{\partial x_{1}\partial x_{2}}}&\cdots &{\frac {\partial ^{2}f}{\partial x_{1}\partial x_{n}}}\\{\frac {\partial ^{2}f}{\partial x_{2}\partial x_{1}}}&{\frac {\partial ^{2}f}{\partial x_{2}^{2}}}&\cdots &{\frac {\partial ^{2}f}{\partial x_{2}\partial x_{n}}}\\\vdots &\vdots &\ddots &\vdots \\{\frac {\partial ^{2}f}{\partial x_{n}\partial x_{1}}}&{\frac {\partial ^{2}f}{\partial x_{n}\partial x_{2}}}&\cdots &{\frac {\partial ^{2}f}{\partial x_{n}^{2}}}\end{bmatrix}}}$

I'm pretty sure this is right--hope it helps. 16:56, 3 Apr 2006 (UTC)
the transpose is redundant, as it is part of the definition of the dyadic product. The ${\displaystyle \cdot }$ shouldn't be there as that would make it a divergence, which is defined for vector functions whereas f here is a scalar function.
${\displaystyle H=\nabla \otimes \nabla f={\begin{bmatrix}{\frac {\partial }{\partial x_{1}}}\\{\frac {\partial }{\partial x_{2}}}\\\vdots \\{\frac {\partial }{\partial x_{n}}}\end{bmatrix}}\otimes {\begin{bmatrix}{\frac {\partial f}{\partial x_{1}}}\\{\frac {\partial f}{\partial x_{2}}}\\\vdots \\{\frac {\partial f}{\partial x_{n}}}\end{bmatrix}}={\begin{bmatrix}{\frac {\partial }{\partial x_{1}}}\\{\frac {\partial }{\partial x_{2}}}\\\vdots \\{\frac {\partial }{\partial x_{n}}}\end{bmatrix}}{\begin{bmatrix}{\frac {\partial f}{\partial x_{1}}}&{\frac {\partial f}{\partial x_{2}}}&\cdots &{\frac {\partial f}{\partial x_{n}}}\end{bmatrix}}={\begin{bmatrix}{\frac {\partial ^{2}f}{\partial x_{1}^{2}}}&{\frac {\partial ^{2}f}{\partial x_{1}\partial x_{2}}}&\cdots &{\frac {\partial ^{2}f}{\partial x_{1}\partial x_{n}}}\\{\frac {\partial ^{2}f}{\partial x_{2}\partial x_{1}}}&{\frac {\partial ^{2}f}{\partial x_{2}^{2}}}&\cdots &{\frac {\partial ^{2}f}{\partial x_{2}\partial x_{n}}}\\\vdots &\vdots &\ddots &\vdots \\{\frac {\partial ^{2}f}{\partial x_{n}\partial x_{1}}}&{\frac {\partial ^{2}f}{\partial x_{n}\partial x_{2}}}&\cdots &{\frac {\partial ^{2}f}{\partial x_{n}^{2}}}\end{bmatrix}}}$
Kaoru Itou (talk) 20:49, 28 January 2009 (UTC)
Also, diagonalising f before multiplying it makes no difference:
${\displaystyle {\begin{bmatrix}{\frac {\partial ^{2}}{\partial x_{1}^{2}}}&{\frac {\partial ^{2}}{\partial x_{1}\partial x_{2}}}&\cdots &{\frac {\partial ^{2}}{\partial x_{1}\partial x_{n}}}\\{\frac {\partial ^{2}}{\partial x_{2}\partial x_{1}}}&{\frac {\partial ^{2}}{\partial x_{2}^{2}}}&\cdots &{\frac {\partial ^{2}}{\partial x_{2}\partial x_{n}}}\\\vdots &\vdots &\ddots &\vdots \\{\frac {\partial ^{2}}{\partial x_{n}\partial x_{1}}}&{\frac {\partial ^{2}}{\partial x_{n}\partial x_{2}}}&\cdots &{\frac {\partial ^{2}}{\partial x_{n}^{2}}}\end{bmatrix}}\cdot \mathrm {diag} (f)={\begin{bmatrix}{\frac {\partial ^{2}}{\partial x_{1}^{2}}}&{\frac {\partial ^{2}}{\partial x_{1}\partial x_{2}}}&\cdots &{\frac {\partial ^{2}}{\partial x_{1}\partial x_{n}}}\\{\frac {\partial ^{2}}{\partial x_{2}\partial x_{1}}}&{\frac {\partial ^{2}}{\partial x_{2}^{2}}}&\cdots &{\frac {\partial ^{2}}{\partial x_{2}\partial x_{n}}}\\\vdots &\vdots &\ddots &\vdots \\{\frac {\partial ^{2}}{\partial x_{n}\partial x_{1}}}&{\frac {\partial ^{2}}{\partial x_{n}\partial x_{2}}}&\cdots &{\frac {\partial ^{2}}{\partial x_{n}^{2}}}\end{bmatrix}}\cdot {\begin{bmatrix}1&0&\cdots &0\\0&1&\cdots &0\\\vdots &\vdots &\ddots &\vdots \\0&0&\cdots &1\end{bmatrix}}f={\begin{bmatrix}{\frac {\partial ^{2}}{\partial x_{1}^{2}}}&{\frac {\partial ^{2}}{\partial x_{1}\partial x_{2}}}&\cdots &{\frac {\partial ^{2}}{\partial x_{1}\partial x_{n}}}\\{\frac {\partial ^{2}}{\partial x_{2}\partial x_{1}}}&{\frac {\partial ^{2}}{\partial x_{2}^{2}}}&\cdots &{\frac {\partial ^{2}}{\partial x_{2}\partial x_{n}}}\\\vdots &\vdots &\ddots &\vdots \\{\frac {\partial ^{2}}{\partial x_{n}\partial x_{1}}}&{\frac {\partial ^{2}}{\partial x_{n}\partial x_{2}}}&\cdots &{\frac {\partial ^{2}}{\partial x_{n}^{2}}}\end{bmatrix}}f}$
Kaoru Itou (talk) 22:14, 28 January 2009 (UTC)

## Examples

It would be good to have at least one example of the use of Hessians in optimization problems, and perhaps a few words on the subject of applications of Hessians to statistical problems, e.g. maximization of parameters. --Smári McCarthy 16:01, 19 May 2006 (UTC)

I agree. BriEnBest (talk) 20:00, 28 October 2010 (UTC)
Me too. --Kvng (talk) 15:01, 3 July 2012 (UTC)

## HUH??

The hessian displayed is incorrect, it should be 1\2 of the second derivative matrix. Charlielewis 06:34, 11 December 2006 (UTC)

## Bordered Hessian

Is not clear how a Bordered Hessian with more than one constrain should look like. If I knew I would fix it.. --Marra 16:07, 19 February 2007 (UTC)

See the added "If there are, say, m constraints ...". Arie ten Cate 15:08, 6 May 2007 (UTC)

The definition of Bordered Hessian is extremely confusing I suggest using the definition used in Luenberger's book "Linear and Nonlinear Programming" I am adding it to my todo list. Will correct it soon. --Max Allen G (talk) 19:31, 8 April 2010 (UTC)

In the Bordered Hessian should it not be the Hessian of the Lagrange function instead of (what is currently presented) the Hessian of f? -- Ben —Preceding unsigned comment added by 130.238.11.97 (talk) 11:27, 8 June 2010 (UTC)

I agree. This seems wrong to me. The article cites Fundamental Methods of Mathematical Economics, by Chiang. Chiang has this as the Hessian of the Lagrange function, as you described. I think it should be changed. — Preceding unsigned comment added by Blossomonte (talkcontribs) 15:07, 10 September 2013 (UTC)
Please can someone fix this. As it is, the property of being a maximum or minimum depends only on the gradient of the constraint function, which is clearly not correct. The curvature of the surface defined by the constraint function must also come into it, through its Hessian (which appears through the Hessian of the Lagrangian). Thus, what is written here is obviously wrong. — Preceding unsigned comment added by 150.203.215.137 (talkcontribs) 03:08, 6 May 2014‎
It seems that you have not correctly read the paragraph beginning by "specifically": All the minors that are considered depend not only of the constraint function, but also of the second derivative of the function f. Also, please, sign your comments in the talk page by four tildes (~). D.Lazard (talk) 14:56, 6 May 2014 (UTC)

## The Thereom is Wrong

I had learned that Fxy=Fyx is Youngs Thereom, not Swartz —The preceding unsigned comment was added by Lachliggity (talkcontribs) 03:02, 16 March 2007 (UTC).

## What if det H is zero?

It would be nice if someone could include what to do when the determinant of the hessian matrix is zero. I thought you had to check with higher order derivatives, but I'm not too sure. Aphexer 09:52, 1 June 2007 (UTC)

## "Univariate" function?

Please note that "univariate" in the intro refers to a statistical concept, which I believe does not apply here. Even in function (mathematics) there is no mention of "univariate functions", that anyway to me suggest function of one independent variable, which is not what we are discussing. I'll be bold and remove, please fix if you know what was meant. Thanks. 83.67.217.254 05:43, 9 August 2007 (UTC)

"Single-valued" perhaps? But do we really need to specify that? In the intro? I would just leave "function". 83.67.217.254 05:45, 9 August 2007 (UTC)

I think I made that change. My thought was, I wanted to differentiate "single valued" from (keeping in this notation's spirit) "multi-valued". Or to quote from the second sentence and from the fifth section, "real valued" vs. "vector valued". I did not want the first sentence to be ambiguous that in general the Hessian is a matrix, which then has a tensor extension for vector valued functions.
The term "univariate" does show my professional bias, and while I still think it's appropriate, "single-valued" is completely acceptable as well. I still have a concern that not qualifying the first sentence at all allows for the tensor to be considered a case of a Hessian matrix, when I think that is better thought of an extension of the concept, since it's not a matrix per se. However, I will not revert it and will chime in on any discussion and clarification here. Baccyak4H (Yak!) 14:15, 9 August 2007 (UTC)

## Vector valued functions

"If is instead vector-valued, ..., then the array of second partial derivatives is not a matrix, but a tensor of rank 3."

I think this is wrong. Wouldn't the natural extension of the Hessian to a 3-valued (i.e.) function just be 3 Hessian matrices?

Is this sentence instead trying to generalize Hessian matrices to higher-order partial derivative tensors of single-valued functions?

68.107.83.19 07:17, 2 October 2007 (UTC)

I'm sure someone can point to a reference which will answer your question, but it would seem that analogous to the Jacobian of a vector valued function, which is a matrix and not (just) a set of derivative vectors, that a rank 3 tensor makes sense: one could take inner products with such a tensor, say like in a higher order term of a multivariate Taylor series. That operation doesn't make as much sense if all one has is a set of matrices. And it would seem one could always have one extant of the tensor index the elements of the vector of the function, with an arbitrary number of additional extants indexing arbitrary numbers of variables of differentiation. My \$0.02. Baccyak4H (Yak!) 13:32, 2 October 2007 (UTC)
I can't see what you're getting at.
What I mean is that if ${\displaystyle f=(f_{1},f_{2},...,f_{n})\,\!}$ where f maps to R^n and each f_i maps to R, then isn't
${\displaystyle H(f)=(H(f_{1}),H(f_{2}),...,H(f_{n}))\,\!}$
And the only function returning a tensor that makes sense is the higher-order partial derivatives of a real-valued function g(). E.g. if rank-3 tensor T holds the 3-rd order partial derivatives of g(), then:
${\displaystyle T_{i,j,k}={\frac {\partial ^{3}g}{\partial x_{i}\partial x_{j}\partial x_{k}}}\,\!}$
If you disagree with this, can you explicitly state what entry ${\displaystyle T_{ijk}\,\!}$ should be (in terms of f=(f1,f2,...,fn)) if T is supposed to be the "hessian" of a vector-valued function? 68.107.83.19 22:57, 3 October 2007 (UTC)
i thought ${\displaystyle H(f)=(H(f_{1}),H(f_{2}),...,H(f_{n}))}$ was a tensor, with ${\displaystyle T_{ijk}={\frac {\partial ^{2}f_{k}}{\partial x_{i}\partial x_{j}}}}$ Kaoru Itou (talk) 22:16, 4 February 2009 (UTC)

## Riemannian geometry

Can someone write on this topic from the point of view of Riemannian geometry? (there should be links e.g. to covariant derivative). Commentor (talk) 05:15, 3 March 2008 (UTC)

I think, there's something wrong with the indices and the tensor- product. We define ${\displaystyle Hess(f):=\nabla \nabla f=\nabla df}$. Then ie in lokal coordinates ${\displaystyle H_{ij}(f)=Hess(f)dx^{i}\otimes dx^{j}}$ (or in terms of the nabla, ${\displaystyle H_{ij}(f)=\nabla \nabla f(dx^{i}\otimes dx^{j})=\nabla df(dx^{i}\otimes dx^{j})}$) and thus ${\displaystyle H_{ij}(f)=(\nabla _{i}\partial _{j}f)={\frac {\partial ^{2}f}{\partial x^{i}\partial x^{j}}}-\Gamma _{ij}^{k}{\frac {\partial f}{\partial x^{k}}}}$. So if we once have written the Hessian with indices, i.e. ${\displaystyle H_{ij}}$, we do not need to write any more tensor products ${\displaystyle dx^{i}\otimes dx^{j}}$. (Since we need to get a tensor of rank 2!) —Preceding unsigned comment added by 86.32.173.12 (talk) 19:14, 2 April 2011 (UTC)

## Local polynomial expansion

As I understand it, the Hessian describes the second-order shape of a smooth function in a given neighborhood. So is this right?:

${\displaystyle y=f(\mathbf {x} +\Delta \mathbf {x} )\approx f(\mathbf {x} )+J(\mathbf {x} )\Delta \mathbf {x} +\Delta \mathbf {x} ^{\mathrm {T} }H(\mathbf {x} )\Delta \mathbf {x} }$

(Noting that the Jacobian matrix is equal to the gradient for scalar-valued functions.) That seems like it should be the vector-domained equivalent of

${\displaystyle y=f(x+\Delta x)\approx f(x)+f'(x)\Delta x+f''(x)\Delta x^{2}}$

If that's right, I'll add that to the article. —Ben FrantzDale (talk) 04:30, 23 November 2008 (UTC)

That appears to be right, as discussed (crudely) in Taylor_series#Taylor_series_in_several_variables. —Ben FrantzDale (talk) 04:37, 23 November 2008 (UTC)

## Approximation with Jacobian

Some optimization algorithms (e.g., levmar) approximate the Hessian of a cost function (half the sum of squares of a residual) with ${\displaystyle J^{\top }J}$ where J is the Jacobian matrix of r with respect to x:

${\displaystyle J={\frac {\partial r_{i}}{\partial x_{j}}}}$.

For reference, here's the derrivation. I may add it to this page, since the approximation is important and has practical applications (source).

Here's my understanding:

If we have a sum-of-squares cost function:

${\displaystyle f(x)={\frac {1}{2}}\|r(x)\|^{2}={\frac {1}{2}}\sum _{k}r_{k}^{2}}$

then simple differentiation gives:

${\displaystyle {\frac {\partial f}{\partial x_{i}}}=\sum _{k}{\frac {\partial r_{k}}{\partial x_{i}}}r_{k}=J^{\top }r}$.

Then using the product rule inside the summation:

${\displaystyle {\frac {\partial ^{2}f}{\partial x_{i}\partial x_{j}}}={\frac {\partial }{\partial x_{j}}}\left[\sum _{k}{\frac {\partial r_{k}}{\partial x_{i}}}r_{k}\right]=\sum _{k}\left[{\frac {\partial ^{2}r_{k}}{\partial x_{i}\partial x_{j}}}r_{k}+{\frac {\partial r_{k}}{\partial x_{i}}}{\frac {\partial r_{k}}{\partial x_{j}}}\right]=\left[\sum _{k}(\nabla ^{2}r_{k})r_{k}\right]+J^{\top }J}$
${\displaystyle ={\frac {\sum _{k}r_{k}\partial ^{2}r_{k}}{\partial x_{i}\partial x_{j}}}+{\frac {\sum _{k}(\partial r_{k})^{2}}{\partial x_{i}\partial x_{j}}}}$
${\displaystyle ={\frac {\sum _{k}(\partial r_{k})^{2}}{\partial x_{i}\partial x_{j}}}+{}}$ H.O.T. for small ${\displaystyle r_{k}}$s.

With all that in mind, for small residuals we can aprpoximate the terms of H with

${\displaystyle H_{ij}={\frac {\partial ^{2}f}{\partial x_{i}\partial x_{j}}}\approx \sum _{k}{\frac {\partial r_{k}}{\partial x_{i}}}{\frac {\partial r_{k}}{\partial x_{j}}}=\sum _{k}{\frac {(\partial r_{k})^{2}}{\partial x_{i}\partial x_{j}}}=J^{\top }J}$.

In words this says that for small residuals, the curvature of f (in any combination of directions) is approximated by the sum of squares of the rate of change of the components of the residuals in the same directions. We ignore the curvature of the residual components weighted the residuals—they are small so they don't have much room to curve and are additionally downweighted by their (small) value. —Ben FrantzDale (talk) 13:24, 10 August 2010 (UTC)

## Notation

There is a problem with notation in the lead. H(f) and H(x) are used, but the arguments are very different. In the first case, f is a function, and in the second case, x is a vector. One should try to unify the notation or at least clarify the differences. Renato (talk) 22:04, 12 September 2011 (UTC)

Good point. I added clarification. —Ben FrantzDale (talk) 11:44, 13 September 2011 (UTC)

## Hessian = (transpose of?) Jacobian of the gradient.

Given a scalar smooth function ${\displaystyle f:{\mathbb {R}}^{n}\to {\mathbb {R}}}$, the gradient will be a vector valued function ${\displaystyle \operatorname {grad} f:{\mathbb {R}}^{n}\to {\mathbb {R}}^{n}}$. We can even write ${\displaystyle g:=\operatorname {grad} f}$, a column vector where the first line is the derivative of f with respect to ${\displaystyle x_{1}}$ and so forth.

Then, the Jacobian of ${\displaystyle g}$, as stated in Jacobian matrix and determinant, is a matrix where the i-th column has the derivative of all components of ${\displaystyle g}$ with respect to ${\displaystyle x_{1}}$. This is the transpose of the Hessian, or is it not? Wisapi (talk) 17:04, 20 January 2016 (UTC)

The distinction is moot due to the symmetry of the Hessian if Clairaut's theorem holds for all the partial derivatives concerned, as in most elementary cases. If not, then consider the order of partial differentiation implied by the operators. Then one gets somewhat of a contradictory situation. Every source (Wolfram MathWorld among others) calls it simply the Jacobian of the gradient. But the same sources use the notation in the article, and ${\displaystyle {\frac {\partial ^{2}f}{\partial x_{i}\partial x_{j}}}:={\frac {\partial }{\partial x_{i}}}({\frac {\partial f}{\partial x_{j}}})}$ for all i and j. I am in favor of the current statement, however, since the idea is that taking the total derivative twice yields a second derivative, without complications such as this.--Jasper Deng (talk) 17:48, 20 January 2016 (UTC)