= Fisher information metric =

In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability distributions. It can be used to calculate the distance between probability distributions.

The metric is interesting in several aspects. By Chentsov's theorem, the Fisher information metric on statistical models is the only Riemannian metric (up to rescaling) that is invariant under sufficient statistics.

It can also be understood to be the infinitesimal form of the relative entropy (i.e., the Kullback–Leibler divergence); specifically, it is the Hessian of the divergence. Alternately, it can be understood as the metric induced by the flat space Euclidean metric, after appropriate changes of variable. When pulled back to a complex projective Hilbert space, it becomes the Fubini–Study metric; when written in terms of mixed states, it is the quantum Bures metric.

Considered purely as a matrix, it is known as the Fisher information matrix. Considered as a measurement technique, where it is used to estimate hidden parameters in terms of observed random variables, it is known as the observed information.

==Definition==
Given a statistical manifold with coordinates $\theta=(\theta_1, \theta_2, \ldots, \theta_n)$, one writes $p(x \mid \theta)$ for the likelihood, that is the probability density of x as a function of $\theta$. Here $x$ is drawn from the value space R for a (discrete or continuous) random variable X. The likelihood is normalized over $x$ but not $\theta$:
$\int_R p(x \mid \theta) \,dx = 1.$

The Fisher information metric then takes the form:

 $g_{jk}(\theta)
=
- \int_R
 \frac{\partial^2 \log p(x \mid \theta)}{\partial \theta_j \,\partial \theta_k}
 p(x \mid \theta) \, dx = \mathbb E_{x\sim p(x|\theta)}\left[-\frac{\partial^2 \log p(x \mid \theta)}{\partial \theta_j \,\partial \theta_k}\right]
.$

The integral is performed over all values x in R. The variable $\theta$ is now a coordinate on a Riemann manifold. The labels j and k index the local coordinate axes on the manifold.

When the probability is derived from the Gibbs measure, as it would be for any Markovian process, then $\theta$ can also be understood to be a Lagrange multiplier; Lagrange multipliers are used to enforce constraints, such as holding the expectation value of some quantity constant. If there are n constraints holding n different expectation values constant, then the dimension of the manifold is n dimensions smaller than the original space. In this case, the metric can be explicitly derived from the partition function; a derivation and discussion is presented there.

Equivalent forms are given by

 $g_{jk}(\theta)
=
\int_R\frac{\partial p(x \mid \theta)}{\partial \theta_j}
 \frac{\partial \log p(x \mid \theta)}{\partial \theta_k} dx = \int_R\frac{1}{p(x \mid \theta)}\frac{\partial p(x \mid \theta)}{\partial \theta_j}
 \frac{\partial p(x \mid \theta)}{\partial \theta_k} dx
.$

To show that the equivalent form equals the above definition, note that, because $p\cdot\partial \log p=\partial p$,

 $\mathbb E_{x\sim p(x|\theta)}
\left[
 \frac{\partial\log{}p(x \mid \theta)}{\partial \theta_j}
\right]=0,$

and apply $\frac{\partial}{\partial\theta_{k}}$ on both sides.

Multiplying and dividing by $p$ in the integrand, this can also be expressed as
 $\mathbb E_{x\sim p(x|\theta)}\left[\frac{\partial \log p(x\mid\theta)}{\partial\theta_j} \frac{\partial \log p(x\mid\theta)}{\partial\theta_j}\right].$
Substituting $i(x \mid \theta) = -\log{}p(x \mid \theta)$ from information theory, an equivalent form of the above definition is:

 $g_{jk}(\theta)
=
\int_R
 \frac{\partial^2 i(x \mid \theta)}{\partial \theta_j \,\partial \theta_k}
 p(x \mid \theta) \, dx
=
\mathbb E_{x\sim p(x|\theta)}
\left[
 \frac{\partial^2 i(x \mid \theta)}{\partial \theta_j \,\partial \theta_k}
\right]

=
\mathbb E_{x\sim p(x|\theta)}
\left[
 \frac{\partial i(x \mid \theta)}{\partial \theta_j} \frac{\partial i(x \mid \theta)}{\partial \theta_k}
\right].$

This last identity may look like a logical error at first, since it seems as though the second partial derivative has erroneously been factorized as a product of two first derivatives; however, this identity is true, and follows from the above algebraic manipulations of the log-likelihood derivatives within the expectation integral. Furthermore, because the expectations of each of these last two first partial derivative factors is zero, we can see that the Fisher information metric tensor is a covariance matrix, namely, the covariance matrix of the score functions (the partial derivatives of the negative log-likelihood) with respect to each parameter.

Finally, and perhaps most geometrically relevant, we can view the Fisher-Rao information metric as the inner product between two tangent vectors on the square-root embedding of probability distributions, namely,

$\int_R \frac{\partial\sqrt{p(x|\theta)}}{\partial\theta_j} \frac{\partial\sqrt{p(x|\theta)}}{\partial\theta_k}dx.$

== Examples ==
The Fisher information metric is particularly simple for the exponential family, which has $p(x \mid \theta) = \exp\!\bigl[\ \eta(\theta) \cdot T(x) - A(\theta) + B(x)\ \bigr]$The metric is $g_{jk}(\theta) = \frac{\partial^2 A(\theta)}{\partial \theta_j \,\partial \theta_k}
 \frac{\partial^2 \eta(\theta)}{\partial \theta_j \,\partial \theta_k}
\cdot \mathrm{E}[T(x)]$The metric has a particularly simple form if we are using the natural parameters. In this case, $\eta(\theta) = \theta$, so the metric is just $\nabla^2_\theta A$.

=== Normal distribution ===
Multivariate normal distribution $\mathcal N(\mu, \Sigma)$$-\ln p(x | \mu, \Sigma) = \frac 12 (x - \mu)^T \Sigma^{-1}(x - \mu) + \frac 12 \ln\det(\Sigma) + C$Let $T = \Sigma^{-1}$ be the precision matrix.

The metric diagonalizes to a mean part and a precision/variance part, that is, $g_{\mu, \Sigma} = 0$. This is because $\partial_{\Sigma}(\partial_\mu\ln p(x | \mu, \Sigma))$ is the product of $(x -\mu)$ with a term that depends only on $\Sigma$. Thus, its expectation is zero.

The mean part is the precision matrix: $g_{\mu_i, \mu_j} = T_{ij}$. The precision part is $g_{T,T} = -\frac 12 \nabla_T^2 \ln\det T$.

An approximate expression can be derived for a Gaussian mixture model.

==== Single-variable normal distribution ====
In particular, for single variable normal distribution, $g = \begin{bmatrix} t & 0 \\ 0 & (2t^2)^{-1} \end{bmatrix}$ where $t = \sigma^{-2}$. Let $x = \mu/\sqrt 2, y = \sigma$, then $ds^2 = td\mu^2 + \frac{1}{2t^2}dt^2 = \sigma^{-2}(d\mu^2 + 2d\sigma^2) = 2\frac{dx^2 + dy^2}{y^2}$This is the Poincaré half-plane model.

The shortest paths (geodesics) between two univariate normal distributions are either parallel to the $\sigma$ axis, or half circular arcs centered on the $\mu/\sqrt 2$-axis.

The geodesic connecting $\delta_{\mu_0}, \delta_{\mu_1}$ has formula $\phi \mapsto \mathcal N\left( \frac{\mu_0 + \mu_1}{2} + \frac{\mu_1 - \mu_0}{2} \cos\phi, \sigma^2 \sin^2\phi \right)$where $\sigma = \frac{\mu_1 - \mu_0}{2\sqrt 2}$, and the arc-length parametrization is $s = \sqrt 2 \ln \tan(\phi/2)$.

==Relation to the Kullback–Leibler divergence==
Alternatively, the metric can be obtained as the second derivative of the relative entropy or Kullback–Leibler divergence. To obtain this, one considers two probability distributions $P(\theta)$ and $P(\theta_0)$, which are infinitesimally close to one another, so that

 $P(\theta) = P(\theta_0) + \sum_j \Delta\theta^j \left.\frac{\partial P}{\partial \theta^j}\right|_{\theta_0}$

with $\Delta\theta^j$ an infinitesimally small change of $\theta$ in the j direction. Then, since the Kullback–Leibler divergence $D_{\mathrm{KL}}[P(\theta_0)\| P(\theta)]$ has an absolute minimum of 0 when $P(\theta) = P(\theta_0)$, one has an expansion up to second order in $\theta = \theta_0$ of the form

 $f_{\theta_0}(\theta) := D_{\mathrm{KL}}[P(\theta_0) \| P(\theta)] = \frac{1}{2} \sum_{jk}\Delta\theta^j\Delta\theta^k g_{jk}(\theta_0) + \mathrm{O}(\Delta\theta^3)$.

The symmetric matrix $g_{jk}$ is positive (semi) definite and is the Hessian matrix of the function $f_{\theta_0}(\theta)$ at the extremum point $\theta_0$. Intuitively, this states that the distance between two infinitesimally close points on a statistical differential manifold is the informational difference between them.

== Relation to Ruppeiner geometry==
The Ruppeiner metric and Weinhold metric are the Fisher information metric calculated for Gibbs distributions as the ones found in equilibrium statistical mechanics.

== Change in free entropy==
The action of a curve on a Riemannian manifold is given by

$A=\frac{1}{2}\int_a^b
\frac{\partial\theta^j}{\partial t}
g_{jk}(\theta)\frac{\partial\theta^k}{\partial t} dt$

The path parameter here is time t; this action can be understood to give the change in free entropy of a system as it is moved from time a to time b. Specifically, one has

 $\Delta S = (b-a) A \,$

as the change in free entropy. This observation has resulted in practical applications in chemical and processing industry: in order to minimize the change in free entropy of a system, one should follow the minimum geodesic path between the desired endpoints of the process. The geodesic minimizes the entropy, due to the Cauchy–Schwarz inequality, which states that the action is bounded below by the length of the curve, squared.

==Relation to the Jensen–Shannon divergence==
The Fisher metric also allows the action and the curve length to be related to the Jensen–Shannon divergence. Specifically, one has

 $(b-a)\int_a^b \frac{\partial\theta^j}{\partial t} g_{jk}\frac{\partial\theta^k}{\partial t} \,dt =
8\int_a^b dJSD$

where the integrand dJSD is understood to be the infinitesimal change in the Jensen–Shannon divergence along the path taken. Similarly, for the curve length, one has

 $\int_a^b \sqrt{\frac{\partial\theta^j}{\partial t} g_{jk}\frac{\partial\theta^k}{\partial t}} \,dt = \sqrt{8}\int_a^b \sqrt{dJSD}$

That is, the square root of the Jensen–Shannon divergence is just the Fisher metric (divided by the square root of 8).

==As Euclidean metric==
For a discrete probability space, that is, a probability space on a finite set of objects, the Fisher metric can be understood to simply be the Euclidean metric restricted to a positive orthant (e.g. "quadrant" in $\mathbb R^2$) of a unit sphere, after appropriate changes of variable.

Consider a flat, Euclidean space, of dimension N+1, parametrized by points $y=(y_0,\cdots,y_n)$. The metric for Euclidean space is given by

 $h=\sum_{i=0}^N dy_i \; dy_i$

where the $\textstyle dy_i$ are 1-forms; they are the basis vectors for the cotangent space. Writing $\textstyle \frac{\partial}{\partial y_j}$ as the basis vectors for the tangent space, so that

 $dy_j\left(\frac{\partial}{\partial y_k}\right) = \delta_{jk}$,

the Euclidean metric may be written as

 $h^\mathrm{flat}_{jk} = h\left(\frac{\partial}{\partial y_j}, \frac{\partial}{\partial y_k}\right) = \delta_{jk}$

The superscript 'flat' is there to remind that, when written in coordinate form, this metric is with respect to the flat-space coordinate $y$.

An N-dimensional unit sphere embedded in (N + 1)-dimensional Euclidean space may be defined as

 $\sum_{i=0}^N y_i^2 = 1$

This embedding induces a metric on the sphere, it is inherited directly from the Euclidean metric on the ambient space. It takes exactly the same form as the above, taking care to ensure that the coordinates are constrained to lie on the surface of the sphere. This can be done, e.g. with the technique of Lagrange multipliers.

Consider now the change of variable $p_i=y_i^2$. The sphere condition now becomes the probability normalization condition

 $\sum_i p_i = 1$

while the metric becomes

$\begin{align} h &=\sum_i dy_i \; dy_i
= \sum_i d\sqrt{p_i} \; d\sqrt{p_i} \\
&= \frac{1}{4}\sum_i \frac{dp_i \; dp_i}{p_i}
= \frac{1}{4}\sum_i p_i\; d(\log p_i) \; d(\log p_i)
\end{align}$

The last can be recognized as one-fourth of the Fisher information metric. To complete the process, recall that the probabilities are parametric functions of the manifold variables $\theta$, that is, one has $p_i = p_i(\theta)$. Thus, the above induces a metric on the parameter manifold:

$\begin{align} h
& = \frac{1}{4}\sum_i p_i(\theta) \; d(\log p_i(\theta))\; d(\log p_i(\theta)) \\
&= \frac{1}{4}\sum_{jk} \sum_i p_i(\theta) \;
\frac{\partial \log p_i(\theta)} {\partial \theta_j}
\frac{\partial \log p_i(\theta)} {\partial \theta_k}
d\theta_j d\theta_k
\end{align}$

or, in coordinate form, the Fisher information metric is:

$\begin{align}
g_{jk}(\theta)
 = 4h_{jk}^\mathrm{fisher}
&= 4 h\left(\frac{\partial}{\partial \theta_j},
\frac{\partial}{\partial \theta_k}\right) \\
& = \sum_i p_i(\theta) \;
\frac{\partial \log p_i(\theta)} {\partial \theta_j} \;
\frac{\partial \log p_i(\theta)} {\partial \theta_k} \\
& = \mathrm{E}\left[
\frac{\partial \log p_i(\theta)} {\partial \theta_j} \;
\frac{\partial \log p_i(\theta)} {\partial \theta_k}
\right]
\end{align}$

where, as before,

 $d\theta_j\left(\frac{\partial}{\partial \theta_k}\right) = \delta_{jk}.$

The superscript 'fisher' is present to remind that this expression is applicable for the coordinates $\theta$; whereas the non-coordinate form is the same as the Euclidean (flat-space) metric. That is, the Fisher information metric on a statistical manifold is simply (four times) the Euclidean metric restricted to the positive orthant of the sphere, after appropriate changes of variable.

When the random variable $p$ is not discrete, but continuous, the argument still holds. This can be seen in one of two different ways. One way is to carefully recast all of the above steps in an infinite-dimensional space, being careful to define limits appropriately, etc., in order to make sure that all manipulations are well-defined, convergent, etc. The other way, as noted by Gromov, is to use a category-theoretic approach; that is, to note that the above manipulations remain valid in the category of probabilities. Here such a category would have the Radon–Nikodym property, that is, the Radon–Nikodym theorem holds in this category. This includes the Hilbert spaces; these are square-integrable, and in the manipulations above, this is sufficient to safely replace the sum over squares by an integral over squares.

==As Fubini–Study metric==
The above manipulations deriving the Fisher metric from the Euclidean metric can be extended to complex projective Hilbert spaces. In this case, one obtains the Fubini–Study metric. This should perhaps be no surprise, as the Fubini–Study metric provides the means of measuring information in quantum mechanics. The Bures metric, also known as the Helstrom metric, is identical to the Fubini–Study metric, although the latter is usually written in terms of pure states, as below, whereas the Bures metric is written for mixed states. By setting the phase of the complex coordinate to zero, one obtains exactly one-fourth of the Fisher information metric, exactly as above.

One begins with the same trick, of constructing a probability amplitude, written in polar coordinates, so:

 $\psi(x;\theta) = \sqrt{p(x; \theta)} \; e^{i\alpha(x;\theta)}$

Here, $\psi(x;\theta)$ is a complex-valued probability amplitude; $p(x; \theta)$ and $\alpha(x;\theta)$ are strictly real. The previous calculations are obtained by
setting $\alpha(x;\theta)=0$. The usual condition that probabilities lie within a simplex, namely that

 $\int_X p(x;\theta) \, dx =1$

is equivalently expressed by the idea the square amplitude be normalized:

 $\int_X \vert \psi(x;\theta)\vert^2 \, dx = 1$

When $\psi(x;\theta)$ is real, this is the surface of a sphere.

The Fubini–Study metric, written in infinitesimal form, using quantum-mechanical bra–ket notation, is

 <math>ds^2 = \frac{\langle \delta \psi \mid \delta \psi \rangle}
{\langle \psi \mid \psi \rangle} -
\frac {\langle \delta \psi \mid \psi \rangle \;
\langle \psi \mid \delta \psi \rangle}
