# Talk:Maximum entropy probability distribution

WikiProject Statistics (Rated Start-class, Mid-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

Start  This article has been rated as Start-Class on the quality scale.
Mid  This article has been rated as Mid-importance on the importance scale.
WikiProject Physics (Rated Start-class, Mid-importance)
This article is within the scope of WikiProject Physics, a collaborative effort to improve the coverage of Physics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start  This article has been rated as Start-Class on the project's quality scale.
Mid  This article has been rated as Mid-importance on the project's importance scale.

## History

Who was the first to show that the normal has maximum entropy among all distributions on the real line with specified mean μ and standard deviation σ? A reference for this would be a nice addition to the article. Btyner 21:21, 13 May 2006 (UTC)

It was Shannon who showed that if all second-moments are known, the multivariate n-dimensional gaussian will maximize the entropy over all distributions. Theory of Communication See pp. 88-89 —Preceding unsigned comment added by Lovewarcoffee (talkcontribs) 05:04, 17 September 2008 (UTC)

## Merging

I am against merging these articles. The princible of maximal entropy and the maximal extropy probability distributions are completely different; there will not be enough room on the maximal entropy page to treat this topic well. It certainly needs its own article. I am removing the tag. Agalmic (talk) 01:19, 4 July 2008 (UTC)

## Maximum entropy in other coordinate systems

I am wondering what the maximum entropy distribution would be for a variable in agular space. Note that for such a variable with support [-180°,180°], the points -180° and 180° are the same. So saying that the mean is mean μ=0°, μ=90° or any other value would be equivalent. I suspect that under no specification of standard deviation σ, the maxent distribution would be continuous uniform in angular space, and not a Gaussian. Do you have any references on this? It would be good to write something about the coordinate system dependence on this article. Jorgenumata (talk) 10:04, 29 March 2009 (UTC)

OK, so I found out the answer. For a random variable with a circular distribution, given the mean and the (analog of) the variance, the von Mises distribution maximizes entropy. Jorgenumata (talk) 10:27, 28 April 2009 (UTC)

## Wrong information measure

The formula given for the Entropy of continuous probability distributions,

$H(X) = - \int_{-\infty}^\infty p(x)\log p(x) dx$

does not follow from Shannon's Theory. It lacks invariance under a change of variables $x \to y(x)$. I.e., if I look at the distribution of radiuses of circles, and someone else at the distribution of areas, then we will get different entropies.

In my opinion, this article should either be rewritten to describe only discrete-valued probability distributions, or a proper continuous information measure should be introduced, e.g., something like the Kullback-Leibler divergence,

$H^c(p(x)\|m(x)) = -\int p(x)\log\frac{p(x)}{m(x)}\,dx.$

See e.g. E.T.Jaynes, "Probability Theory", Chapter 12.3, Cambridge University Press, 2003; or Entropy_(information_theory)#Extending_discrete_entropy_to_the_continuous_case:_differential_entropy

Please tell me your opinion of what should be done! If I hear nothing, I'll just do something at some stage. Hanspi (talk) 12:23, 24 August 2009 (UTC)

I think most applications would not need the general underlying measure m. So it would probably be best to just note the assumption being made, with a link to the wiki article/section you have noted above. If there are any important applications that need something more complicated the situation might be different, but even then it may still be best to allow the simpler form to stand as an initial version as this is how it is likely to be expressed in literature most people will come across. Are there any known results for mixed discrete/continuous distributions? Melcombe (talk) 10:49, 25 August 2009 (UTC)
The "assumption being made" is precisely the problem; m(x) is the result of starting from a discrete probability distribution with n points and letting n go to infinity through a well-defined limit process. m=constant can only happen if x is limited to a finite range. m(x) is, apart from a constant factor, the prior probability distribution expressing complete ignorance of x; e.g., for scaling parameters, it would be Jeffrey's prior, proportional to 1/x. So the underlying assumption contradicts all of the examples on the page, which are probability distributions with infinite range of x.
I see that the general reader would not be interested in this and would never even try to evaluate the entropy of a continuous probability distribution. You say that "this is how it is likely to be expressed in literature most people will come across;" I think this is so, but is it then OK to repeat the error here when we know about it? I feel this is arguing according the lines "I'll tell you a lie, but it is a lie you can understand."Hanspi (talk) 05:59, 27 August 2009 (UTC)
This isn't the place to decide what is "right". Wikipedia is meant to reflect how terms and concepts are actually being used ... hence the need for citations (but this is too often flouted in the maths/stats articles). It would be OK to put in something cautionary either if you can find a citation that makes this point, or if the wikipedia article/section you mentioned already makes the point clearly enough. However, if you know of a source that does treat all the cases covered in a unified framework, and that you find more satisfactory, then perhaps you should revise the whole article to reflect that approach. After all, there are presently no in-line citations to support any of the stuff presently in the article. Melcombe (talk) 09:26, 27 August 2009 (UTC)
Oh, it is not necessary to decide here what is right, I can easily cite (and have done so above, Jaynes 2003) a place where this is done. The continuous entropy measure on that page is very clearly wrong. Imagine the following: you have a selection of squares of side lengths k that are integer multiples of 1 mm. You know the distribution $p_k$. Now your colleague looks at the same set of squares, but he looks at the areas $a=k^2$ and gets a $p_a$. Both of you calculate the entropy; both of you get the same result, as it should be, because the two of you are looking at the same thing. Now you simply let go of the restriction that the k are multiples of 1 mm, so you permit k to be any real number. Same thing; one of you will get $p(k)$, the other $p(a)$, still you are looking at the same collection of objects, but now you will get different entropies. Does this not worry you? It worries me!
Anyway, I saw that the page Principle_of_maximum_entropy that we cite already explains all this and refers to our page as an example page. We may just want to remove the theory from this page!Hanspi (talk) 06:17, 28 August 2009 (UTC)

(Unindenting for convenience) I have found two recent sources that both explicitly define "maximum entropy distributions" to be those with maximum entropy defined for m=1. Both sources explicitly recognise that this is something that is not invariant, but go on with it anyway. See (1) Williams, D. (2001) Weighing the Odds (page 197-199) Cambridge UP ISBN 0-521-00618-x ; (2) Bernardo, J.M., Smith, A.F.M. (2000) Bayesian Theory (pages 209, 366) Wiley. ISBN 0-471-49464-x. (And Bernardo&Smith reference Jayne's work (earlier, but on the same topic), so did know of it.) The article must at least recognise that this is how some/many sources define "maximum entropy distributions". However the article could go on to give the fuller definition but, if it does, each of the examples should be modified to state m explicitly (possibly with a justification). I note that the present exponential distribution example use m=1, rather than something like 1/x. If there are some examples with a different m, then they should be included as otherwise even discussing the more general xcase will look odd.

I see that the Principle_of_maximum_entropy article suggests that maximum entropy solves the problem of specifying a prior distribution. Your comments suggest a background in Bayesian stuff, so you might like to consider this in relation to Bernardo&Smith's assertion that it does not (page 366, point (iv)). However, I don't think these articles should become too Bayes-specific.

Melcombe (talk) 11:35, 28 August 2009 (UTC)

What do you mean by "too Bayes-specific"? Yes, of course I know about Bayes's rule, I have an information theory background, and I have read several articles by Jaynes from the time when the orthodox vs. Bayes war seems to have been going on ferociously, but I though this was over! A colleague who does information theory research assured me just a few weeks ago that at the research front, nobody doubts the Bayesian way anymore. Was this not correct?
Now to the problem at hand: Principle_of_maximum_entropy and this article together are inconsistent. One must be changed. Which one? I would definitely change this one. I don't have your literature, but together we could do something beautiful: we could omit the theory and refer the reader to Principle_of_maximum_entropy. If we then listed all disttributions for both m=1 and m=1/x, we could make two sections: m=1, "use these if the unknown quantity is a location parameter", and m=1/x, "use these if the unknown parameter is a scale parameter". I could explain this intuitively, and then we would have an article that is useful in practice, giving the answer "what distribution shall I use in which situation"? What do you think? Hanspi (talk) 14:51, 31 August 2009 (UTC)
P.S. Melcombe, I have ordered the books you cited from the Swiss network of libraries and information centres. I'll get them next week and can then read what you cited. Hanspi (talk) 18:20, 1 September 2009 (UTC)
By "not too Bayes-specific" I meant two things, neither related to Bayes/frequentist controversy. First, there are some long-established results ... for example that the normal distribution is the maximum entropy distribution under given conditions ... that exist without a Bayesian background. Secondly, there is a need to cover the physics-based context already in the article for which giving a meaning to the reference measure may be problematic enough on its own, without having to find a Bayesian connection.
To move forward, I have added the more general defintion to the article, keeping also the original with some citations (one of which is new). I have not added the Jaynes reference as I don't know whether that was in the context of "maximum entropy" or just "entropy". Do edit it as you think fit, but remember that this is meant to reflect what is in published sources, not to decide that Jaynes is either right or wrong and so to exclude anything different. Melcombe (talk) 10:45, 7 September 2009 (UTC)

## Caveats

I added the 'citation needed' because I'd really like some clarification on that point. Is it that in general there is no maximum entropy distribution given the first three moments, or is it that there's something particular about this combination (seems unlikely)?

This is also unclear (to me): "bounded above but there is no distribution which attains the maximal entropy"

Is it asymptotic? Then can't we just take the limit as you approach the bound to get the answer?

If not then can't we just look at all distributions that do exist and meet these conditions, and choose the one with the highest entropy (or the limit as you approach the highest entropy..)? because, there's probably something I misunderstand, if it's not asymptotic to the maximum entropy then to me it sounds like the article says:

#! python
assert all([f(x)<=1 for x in R]),"no f(x) is above 1"
assert max([f(x) for x in R]) == 1,"so the maximum value must be 1"


which is obviously wrong; its okay if the maximum turns out not to be right on the theoretical limit.

(of course my example is iterating over the real numbers... so who am I to complain about things seeming wrong)

Sukisuki (talk) 13:37, 10 April 2010 (UTC)

## Examples of maximum entropy distributions

Please include the Gamma distribution. — Preceding unsigned comment added by 187.64.42.156 (talk) 00:30, 19 April 2012 (UTC)

Initially empirical Zipf's distribution and its demonstrated form Mandelbrot's distribution (of which Zipf's is a special case) happen to be maximal entropy distributions too, if the constraint imposed is the order of magnitude (mean value of the log of rank for Zipf) and mean value of a more complicated expression involving usage cost and storage cost for Mandelbrot). I was shown that in 1974. I do not remember the demonstration very well, but they seemed pretty obvious at the time. Perhaps somebody can find a link to them ? 212.198.148.24 (talk) 19:06, 10 May 2013 (UTC)

## Alternative way of writing entropy

It is significant that $H(X) = E[-\log p(X)]$ (or $E[-\log(p(X)/m(X))]$ for the version with a measure)? Seems like an easy way of remembering this formula to me.

## conditions of example don't apply

I removed this:

In physics, this occurs when gravity acts on a gas that is kept at constant pressure and temperature: if X describes the distance of a molecule from the bottom, then the variable X is exponentially distributed (which also means that the density of the gas depends on height proportional to the exponential distribution). The reason: X is clearly positive and its mean, which corresponds to the average potential energy, is fixed. Over time, the system will attain its maximum entropy configuration, according to the second law of thermodynamics.

You can't just conclude this from the purely mathematical example (given in the article above where I removed this) of a maximum-entropy distribution constrained to have a given mean; you have to do the physics ! It can be viewed as a coincidence that the two examples both produce an exponential distribution. 178.38.142.81 (talk) 00:11, 2 February 2015 (UTC)