# Talk:Fisher information

WikiProject Statistics (Rated Start-class, Mid-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

Start  This article has been rated as Start-Class on the quality scale.
Mid  This article has been rated as Mid-importance on the importance scale.

## Untitled

The first line in 'Example' miss a left-hand parenthesis ")". Thank you for a nice article!

There seem to be some superfluous brackets in the expectation notation: Neither $\mathbb E X ^2 \,$ nor $\left[\mathbb E X\right]^2\,$ is ambiguous, and in fact their difference should be the variance of X if I remember right. $\mathbb E X ^2 \,$ does not need to be disambiguated according to standard order of operations. -- 130.94.162.61 04:45, 10 February 2006 (UTC)
Also there are some philosophical issues not discussed in the article. Without some prior probability distribution on $\theta$, how can we ever hope to extract information about it? For example, take a person's height. We usually start with some a priori idea or expectation of what a person's height ought to be before taking any measurements at all. If we measure a person's height to be 13 feet, we would normally assume the measurement was wrong and probably discard it (as a so-called "outlier"). But if more and more measurements gave a result in the vicinity of 13 feet, it might dawn on us that we are measuring a giant. On the other hand, a single measurement of 5 feet 5 inches would probably convince us of someone else's height to a reasonable degree of accuracy. Fisher information doesn't say anything about a priori probability distributions on θ. A maximum likelihood estimator which assumes a "uniform" distribution over all the reals (w.r.t. the Lebesgue measure) is an absurdity. I'm not sure I'm making any sense (and feel free to delete this comment if I'm not), but I don't believe any information can be extracted about an unknown parameter without having beforehand some rough estimate of the a priori probability distribution of that parameter. -- 130.94.162.61 13:54, 10 February 2006 (UTC)

The above comment is specious. The writer brings up a point that Fisher Information does not speak to. Fisher information assumes that one is estimating a parameter and that there is no a priori distribution of that parameter. This is one of the weaknesses of Fisher Information. However, it is not relevant to an article about Fisher information except in the context of "Other formulations." There is, however an important error in this article. The second derivative version of the definition of Fisher Information is only valid if the proper regularity condition is met. I added the condition, though this may not be the best representation of it. The formula looks rather ugly to me, but I don't have time to make it pretty. Sorry! --67.85.203.239 22:15, 12 February 2006 (UTC)

My comment above was somewhat specious, but when I carry out the differentiation of the second derivative version of the Fisher information, I get a term
$\mathbb E \left[ \frac { \frac {\partial^2} {\partial\theta^2} f(X|\theta) } {f(X|\theta)} \right] \mathrm{\ or\ } \int_X \frac {\partial^2} {\partial\theta^2} f(x|\theta)\,dx$
that must be equal to zero. Is this valid for a regularity condition or at all what is wanted here? The regularity condition that was added to the article doesn't make much sense to me, since it contains a capital X and no expectation taken over it. Please excuse my ignorance. As to my comment above, I still think something belongs in the article (in the way of introduction) to tell someone like me what Fisher information is used for as well as when or why it should or shouldn't be used. As the article stands, it's just a bunch of mathematical formulae without much context or discussion. -- 130.94.162.61 22:06, 8 March 2006 (UTC)
There should be a little more discussion of the Cramér-Rao inequality, too. -- 130.94.162.61 22:31, 8 March 2006 (UTC)

But isn't it generally going to be the case (assuming the 2nd derivative exists)
$\int \frac{\partial^2}{\partial \theta^2}f(X ; \theta ) \, dx = \frac{\partial^2}{\partial \theta^2} \int f(X ; \theta ) \, dx = \frac{\partial^2}{\partial \theta^2} 1 = 0$
71.221.255.155 07:35, 8 December 2006 (UTC)

## Some things unclear(/wrong?)

In the expression

$\int \frac{\partial^2}{\partial \theta^2}f(X ; \theta ) \, dx = 0,$

might it be $f(x ; \theta )$?

Also, it is unclear whether the $\theta$'s must cover the whole parameter space, or could cover some subspace. In discussing the N-variate gaussian, it is said that the information matrix has indeces running from 1 to $N$, but there are $(N+1)(N+2)/2$ parameters to describe a gaussian. This is probably a mistake. PhysPhD

## Say more about Roy Frieden's work

I should admit that I have studied mathematical statistics. Even so, by Wiki standards, this entry is not unduly technical. I've added some links (and am sure more could be added) that should help the novice reader along. The first person to contribute to this talk page is an unwitting Bayesian, when (s)he calls for a "prior distribution" on θ. Information measures and entropy are bridges connecting classical and Bayesian statistics. This entry should sketch bits of those bridges, if only by including a few links. This entry should say more comparing and constrasting Fisher information with the measures of Shannon, Kullback-Leibler, and possibly others.

Wiki should also say more, somewhere, about the extraordinary work of Roy Frieden. Frieden, a respectable physicist, has written a nearly 500pp book arguing that a great deal of theoretical physics can be grounded in Fisher information and the calculus of variations. This should not come as complete surprise to anyone who has mastered Hamiltonian mechanics and has thought about the principle of least action, but even so, Frieden's book is a breathtaking high wire act. It appears that classical mechanics, electromagnetism, and thermodynamics, general relativity, and quantum electrodynamics are all merely different applications of a few core information-theoretic and variational principles. Frieden (2004) also includes a chapter on what he thinks his EPI approach could contribute to unsolved problems, such as quantum gravitation, turbulence, and topics in particle physics. Could EPI even prove to be the eventual gateway to that Holy Grail of contemporary science, the unification of the three fundamental forces, electroweak, strong, and gravitation? I should grant that EPI doesn't answer everything; for example, it sheds no light on why the fundamental dimensionless constants take on the values that they do. Curiously, Frieden says little about optics even though that was his professional specialty.202.36.179.65 13:19, 11 April 2006 (UTC)

The physical and mathematical correctness of Frieden's ideas have been characterized as highly dubious by several knowledgeable observers; see, for example, Ralph F. Streater's Lost Causes in Theoretical Physics: Physics from Fisher Information, and Cosma Shalizi's review of Physics from Fisher Information. QuispQuake 14:55, 12 July 2006 (UTC)
Hey, 202.36.179.65 don't be coy. You must be the man himself!81.178.157.195 (talk) 11:40, 31 January 2012 (UTC)

## B. Roy Frieden's anonymous POV-pushing edits

B. Roy Frieden claims to have developed a "universal method" in physics, based upon Fisher information. He has written a book about this. Unfortunately, while Frieden's ideas initially appear interesting, his claimed method has been characterized as highly dubious by knowledgeable observers (Google for a long discussion in sci.physics.research from some years ago.)

Note that Frieden is Prof. Em. of Optical Sciences at the University of Arizona. The data.optics.arizona.edu anon has used the following IPs to make a number of questionable edits:

1. 150.135.248.180 (talk · contribs)
1. 20 May 2005 confesses to being Roy Frieden in real life
2. 6 June 2006: adds cites of his papers to Extreme physical information
3. 23 May 2006 adds uncritical description of his own work in Lagrangian and uncritically cites his own controversial book
4. 22 October 2004 attributes uncertainty principle to Cramer-Rao inequality in Uncertainty Principle, which is potentially misleading
5. 21 October 2004 adds uncritical mention of his controversial claim that Maxwell-Boltzmann distribution can be obtained via his "method"
6. 21 October 2004 adds uncritical mention of his controversial claim that the Klein-Gordon equation can be "derived" via his "method"
2. 150.135.248.126 (talk · contribs)
1. 9 September 2004 adds uncritical description of his work to Fisher information
2. 8 September 2004 adds uncritical description of his highly dubious claim that EPI is a general approach to physics to Physical information
3. 16 August 2004 confesses IRL identity
4. 13 August 2004 creates uncritical account of his work in new article, Extreme physical information

These POV-pushing edits should be modified to more accurately describe the status of Frieden's work.---CH 21:54, 16 June 2006 (UTC)

Hear,hear! I totally agree with the first few sentences of this talk section, and perhaps it should appear in the article as a health warning.81.178.157.195 (talk) 11:39, 31 January 2012 (UTC)

## Graphs to improve technical accessibility

In addressing the technical accessibility tag above, I would recommend the addition of some graphs. For example, this concept could be related to the widely understood concept of the Gaussian bell curve. -- Beland 21:35, 4 November 2006 (UTC)

## Minus sign missing?

In the one-dimensional equation, there is a minus sign in the equation linking the second derivative of the log likelihood to the variance of theta. This stands to reason, as we want maximum, not minimum likelihood, so the second derivative becomes negative. In the matrix formulation below, there is no minus sign. Should it not be there, too? In practice, of course, one often minimizes sums of squares, or other "loss" functions, instead. This already is akin to -log(L). I am not a professional statistician, but I use statistics a lot in my profession, microbiology. I did not find the article too technical. After all, the subject itself is somewhat technical. Wikipedia does a great job of making gems such as this accessible. 82.73.149.14 19:51, 30 December 2006 (UTC)Bart Meijer

## Style

I think that the style in which parts of this article are written is more appropriate for a textbook than for an encyclopedia article. For example: "To informally derive the Fisher Information, we follow the approach described by Van Trees (1968) and Frieden (2004)" This type of comment is only really appropriate in a textbook where a single author or a few authors are writing a book with a coherent theme. An encyclopedia article ought to adopt a different style: in particular, I object to the use of the term "we", as on wikipedia, with so many authors and with anonymous authors, it is not clear who the word "we" refers to. Instead, I think we should word things "Van Trees (1968) and Frieden (2004) provide the following method of deriving the Fisher information informally:". I am going to rewrite this to try to eliminate these sorts of comments. But...I think this style problem goes beyond just the use of the word "we"...it's pretty pervasive and it needs deep changes. Cazort (talk) 18:14, 10 January 2008 (UTC)

## Informal Derivation & Definition

This derivation doesn't seem to be a derivation of the Fisher information, but rather, a derivation of the relationship between Fisher information and the bound on the variance of an estimator. Does everyone agree with me that this should be renamed? Also, this remark relates to the definition of Fisher information. For example, the comment "The Fisher information is the amount of information" is loaded, because it is not defined what information means. I am going to weaken this statement accordingly. If we can come up with a more rigorous and more precise definition then we should include it! Cazort (talk) 18:22, 10 January 2008 (UTC)

## How about putting in 'Mutual Information' and 'Joint Information' discussion

I've heard mention of "mutual information" and "joint information" (bivariate discrete random variables); shouldn't these terms be discussed?199.196.144.13 (talk) 21:08, 29 May 2008 (UTC)

## Merge “Observed information”

I suggest that the article Observed information be merged with the current, since it repeats the definition of the Fisher information, only substituting the expected value w.r.t. sample probability distribution instead of the expected value with respect to the population. As such, the observed information is simply the sample Fisher information.  … stpasha »  07:20, 24 January 2010 (UTC)

Well, the two are different things, and there is even an article contrasting the two in the refs for observed information, so I'm not sure why you think they need to be merged. Is it because there is not much detail in observed information? --Zvika (talk) 08:20, 24 January 2010 (UTC)
The observed information is given by the formula
$I_\hat\theta = -\frac{\partial^2}{\partial\theta\partial\theta'}\ell(\hat\theta) = -\frac{\partial^2}{\partial\theta\partial\theta'} \frac1n \sum_{i=1}^n \ln f(x_i|\hat\theta) = \widehat\operatorname{E}\bigg[{-\frac{\partial^2\ln f(x_i|\hat\theta)}{\partial\theta\partial\theta'}}\bigg]$
The “expected Fisher information” is given by similar formula, only we use the population expectation:
$\mathcal{I}_\hat\theta = \operatorname{E}\bigg[{-\frac{\partial^2\ln f(x_i|\hat\theta)}{\partial\theta\partial\theta'}}\bigg]$
Both of these are valid estimators for the Fisher information quantity, which is
$\mathcal{I} = \operatorname{E}\bigg[{-\frac{\partial^2\ln f(x_i|\theta_0)}{\partial\theta\partial\theta'}}\bigg]$
The article you are referring to compares properties of these two estimators and finds that the first one gives more accurate confidence intervals than the second one (although of course asymptotically they are equivalent). Anyways, the concept of “observed information” is just an estimator of the Fisher information of the model, and thus should be merged with this article, in my opinion.  … stpasha »  08:45, 24 January 2010 (UTC)
I vote to keep the two articles separate, for clarity purposes and so that it's more likely to find both phrases when searching on Google. 70.22.219.191 (talk) 22:34, 31 January 2010 (UTC)
Keep separate. To argue that " “observed information” is just an estimator of the Fisher information " ignores the fact that it is better to use the “observed information” in computations and statistical inference, as indicated in the reference in observed information. Melcombe (talk) 14:54, 14 September 2010 (UTC)

Merge tag removed, as no support or action for 2 years. Melcombe (talk) 00:22, 8 February 2012 (UTC)

## Fisher Information and Its Relation to Entropy

Thanks for correcting my edits to the Fisher information page, and sorry for saying something that wasn't quite correct (and also for getting the sign wrong!). The claim that the Fisher information is the Hessian of the entropy was in the article before I edited it, so it's good that it's gone now.

Correct me if I'm wrong, but it seems the Fisher information is always equal to the negative Hessian of the entropy for discrete probability distributions. I'd worked it out for discrete distributions and naively assumed it was true in general, but this looks like one of the many quirks of the definition of the continuous entropy as

$H=-\int p(x) \ln p(x) \, dx$.

(OT rant: IMO the continuous entropy should never have been defined that way, since it's not equal to the continuous limit of the discrete entropy, which actually diverges to infinity, and lacks many of the desirable properties of the discrete version. If you put in a scaling factor to prevent divergence, and are careful to make it invariant to coordinate changes, you always end up with a relative entropy instead of H as defined above.)

Anyway, if it is true that the Fisher information is equal to the negative Hessian of the entropy for discrete distributions I'd like to put the $-\partial^2 H/\partial\theta_i \partial\theta_j$ formula at some early point in the article (along with a caveat about continuous distributions), since it would help someone with my background get a handle on the Fisher information a bit more easily.

Nathaniel Virgo (talk) 14:19, 7 October 2010 (UTC)

If I take the difference of the two expressions, I find that they are equal if and only if
$\int \frac{\partial^2 f(x\mid\theta)}{\partial \theta_i \partial \theta_j} \ln f(x\mid\theta) \,dx = 0\,,$
or the discrete equivalent.
So, for instance, if I parameterize my distribution such that the probability (or probability density) of an outcome is some linear combination of the parameters (i.e., so that $\frac{\partial^2 f(x\mid\theta)}{\partial \theta_i \partial \theta_j} = 0$ for all i, j, and x) then I have equivalence. In particular, if θ and (1 − θ) are used to weight the linear mixture of two distributions, discrete or continuous, then I have equivalence. However, when things are non-linear, I may find myself in deep trouble. Quantling (talk) 19:47, 7 October 2010 (UTC)

## f(x;θ) what is the proper defenition

Hi All,

Firstly does the ; simbol mean the same as | (given) and secondly Im assuming f(x|θ) is a pdf for a continuous variable?

Thanks, Sachin Sachinabey (talk) 08:12, 9 May 2011 (UTC)

I believe yes, and yes continuous for θ and x, though they could both be vectors. I don't know why they used ; instead of | in the article. Dmcq (talk) 09:44, 9 May 2011 (UTC)
The semicolon may be subtly different from the pipe symbol. The latter is used to indicate a conditional probability,
$\Pr(x|\theta) = \frac{\Pr(x,\theta)}{\Pr(\theta)}\,$
and similarly for probability densities. Thus, the pipe symbol is most meaningful when it is reasonable to talk about $\Pr(\theta)$, a probability of $\theta$. On the other hand $\Pr(x; \theta)$ and ${\Pr}_{\theta}(x)$ indicate that the probability distribution is parameterized by $\theta$, but do not imply the existence of a probability distribution on $\theta$. Does that make sense? —Quantling (talk | contribs) 16:40, 10 May 2011 (UTC)

## Inverse of FIM for multivariate Gaussian

Nowhere in the article it says that the Fischer Information Matrix is the inverse of the Covariance matrix in the multivariate normal case. Yet this information is used in many sources especially in the context of Bayesian Networks (e.g. see http://en.wikipedia.org/wiki/Kalman_filter#Information_filter) — Preceding unsigned comment added by 89.204.138.242 (talk) 12:34, 23 January 2013 (UTC)

## Far too technical

But X doesn't depend on $\theta$ at all. $X$ is external data. $\theta$ is a model parameter with which we are modelling $X$.