# Talk:Differential entropy

Jump to: navigation, search

## Maximization of differential entropy

The normal distribution maximizes differential entropy for a distribution with a given mean and a given variance. Perhaps it should be emphasized that there are distributions for which the mean and variance do not exist, such as the Cauchy distribution.

--Kaba3 (talk) 11:27, 20 October 2014 (UTC)

## unnecessary treatment

i am not sure how familiar the avg information theorist is with measure theory and integration. the article does use language such as "almost everywhere", "random variable", etc, so it must not be foreign. in that case, the division between "differential" and "non-differential" entropy is unnecessary, and quite misleading. one can simply use a single definition for entropy, for any random variable on a arbitrary measure space, and surely it has been done somewhere. Mct mht 03:53, 18 June 2006 (UTC)

While not being an expert on the subject, I do not agree, see Information entropy#Extending_discrete_entropy_to_the_continuous_case:_differential_entropy - it is probably possible to give such a unique definition, but it would add unnecessary complication to the article; say, a Computer Science student does not normally need to study measure theory (I have a certain informal knowledge of it), as not all probability student need to know it. Beyond that, Shannon entropy is almost always used in the discrete version even because that's the natural application; beyond that, the two versions have different properties (the other article dismiss differential entropy as being of little use).--Blaisorblade 18:34, 10 September 2007 (UTC)

I am an avg CS theory student, far from an expert. I was trying to find a general definition of entropy but I could not find anything. After a bit of thinking, I realized that entropy (differential or not) of a probability distribution (equivalently, of a random variable) must be defined with respect to a base measure. For the most common, discrete, entropy the base measure is the counting measure. For "continuous" probability distributions the entropy is computed with respect to the Lebesgue measure on the real line. In principe, however, you can choose any base measure. If you change the base measure, the entropy changes as well. (This seems counter-intuitive, since most people think that entropy is some inherent quantity.) In this respect the article is sloppy and does not mention with respect to which base measure the integral is computed and at the same time pretends that the measure space can be arbitrary. Either stick to the Lebesgue measure on the real line or do it in full generality. —Preceding unsigned comment added by 129.97.84.19 (talk) 22:22, 29 February 2008 (UTC)

I am a practicing Statistician with Ph.D in statistics, and I think there is a need for the definitions the way the original developer developed the thought processes about problems, concepts, and solutions. Academically I have been trained in measure theory. It is not important for single mathematical definition of information, but it is more important all types of people with different background understand the problems, concepts, and solutions. My vote is to keep simple (this is after all Wikipedia, not a contest on how to make it precise with measure theory). The number of pages allowed is unlimited in internet. Please use a separate section on measure theory based definitions. In fact, by separating out this way, one will appreciate the beauty of measure theory for conceptualing problems and solving them elegantly.

It is not "unnecessary treatment */. It is very much a necessary treatment. Thanks —Preceding unsigned comment added by 198.160.96.25 (talk) 06:00, 19 May 2008 (UTC)

The problem here is that differential entropy doesn't really mean much. Among other things, it's not even dimensionally correct. The term ${\displaystyle f(x)}$ would have units of ${\displaystyle {\frac {1}{dx}}}$, so it cannot be fed into the logarithm without some sort of scaling factor applied. This entire formula was the result of a mistake made by Shannon when he was really trying to define the limiting density of discrete points. As a result, this is useful in that it is part of the LDDP in the cases where the measure is constant in x over some interval, and the probability ${\displaystyle p(x)}$ is nearly zero outside of that interval, but the formula itself doesn't really mean anything. It is essentially unrelated to the ability to transmit information (which is the basis of entropy), and in fact, the entropy of any continuous distribution would always be infinite unless it is quantized. I think this article needs stronger pointers to LDDP, and the LDDP articles need to be much more clear about the relationship with discrete entropy. Vertigre (talk) 15:33, 8 April 2016 (UTC)

I added a warning to the top of the page redirecting people to LDDP a little more strongly than the previous versions had. I think this is the correct treatment for this subject. This page is worthy of note, and worth keeping, because differential entropy is used (incorrectly) all over the literature, so it's worth being aware of it, and understanding its properties, such as they are. Vertigre (talk) 15:41, 8 April 2016 (UTC)

## Error in Text

I believe there is an error in the differential entropy for multivariate Gaussian (it should be in minus sign). No access to tex to correct it. — Preceding unsigned comment added by 93.173.58.105 (talk) 17:08, 6 October 2014 (UTC)

I believe there is an error in the example of a uniform(0,1/2) distribution, the integral should evaluate to log(1/2), I do not have access to tex to correct it. —The preceding unsigned comment was added by 141.149.218.145 (talk) 07:46, 30 March 2007 (UTC).

September 2015 It seems to me there is else one error: in the table "Table of differential entropies" of the section "Differential entropies for various distributions", line for "Chi-squared", column "Entropy in nats". I suppose there should be plus sign before (1-k/2), i.e. it should be ln2Г(k/2) + (1-k/2)*psi(k/a)*(k/2). It can be derived from the general formula for "Gamma" distributions (next lines in the same table) for case scale parameter O=2 and shape parameter k=k/2. Besides correct formula for the entropy is given at wiki-page for "Gamma distribution" in the section "Information entropy". So I ask anybody who is a specialist in the theme and may prove my supposal - is there really an error? If yes, could you please correct it. — Preceding unsigned comment added by Ashc~ukwiki (talkcontribs) 18:58, 13 September 2015 (UTC)

## Merge proposal

This article talks about the same thing as Information entropy#Extending discrete entropy to the continuous case: differential entropy, but they give totally different (not contradictory) results and information. Since differential entropy is of little use in Computer Science (at least so I suppose), it can be seen as an out-of-topic section in Information entropy, i.e. something deserving just a mention in "See also". --Blaisorblade 18:41, 10 September 2007 (UTC)

I think it would be too quick to call differential entropy "of little use" in information theory. There are some hints to the opposite:
* Brand [1] suggested minimization of posterior differential entropy as a criterium for model selection.
* Neumann [2] shows that maximization of differential entropy under a constraint on expected model entropy is equivalent to maximization of relative entropy with a particular reference measure. That reference measure satisfies the demand from Jaynes [3] that (up to a multiplicative factor) the reference measure also had to be the "total ignorance" prior.
* Last not least, in physics there are frequently returning claims that differential entropy is useful, or in some settings even more powerful than relative entropy (e.g. Garbaczewski [4]). It would seem strange to me if such things do not find their counterparts in probability / information theory.
Webtier (talk) 12:52, 20 December 2007 (UTC)
[1] Matthew Brand: "Structure Learning in Conditional Probability Models via an Entropic Prior and Parameter Extinction". Neural Comp. vol. 11 (1999), pp. 1155-1182. Preprint: http://citeseer.ist.psu.edu/247075.html
[2] Tilman Neumann: “Bayesian Inference Featuring Entropic Priors”, in Proceedings of 27th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, American Institute of Physics, vol. 954 (2007), pp. 283–292. Preprint: http://www.tilman-neumann.de/docs/BIEP.pdf
[3] Edwin T. Jaynes, "Probability Theory: The Logic of Science", Corrected reprint, Cambridge University Press 2004, p. 377.
[4] Piotr Garbaczewski: "Differential Entropy and Dynamics of Uncertainty". Journal of Statistical Physics, vol. 123 (2006), no. 2, pp. 315-355. Preprint: http://arxiv.org/abs/quant-ph/0408192v3

## WikiProject class rating

This article was automatically assessed because at least one WikiProject had rated the article as start, and the rating on other projects was brought up to start class. BetacommandBot 09:48, 10 November 2007 (UTC)

## Continuous mutual information

Note that the continuous mutual information I(X;Y) has the distinction of retaining its fundamental significance as a measure of discrete information since it is actually the limit of the discrete mutual information of partitions of X and Y as these partitions become finer and finer. Thus it is invariant under quite general transformations of X and Y, and still represents the amount of discrete information that can be transmitted over a channel that admits a continuous space of values.

I actually added this statement some time ago to the article (anonymously under my IP address), and it was recently marked as needing a citation or reference. I made the statement a little weaker by changing "quite general transformations" to "linear transformations", and added Reza's book as a reference. However, it is still true that I(X;Y) is invariant under any bijective, continuous (and thus monotonic) transformations of the continuous spaces X and Y. This fact needs a reference, though. Deepmath (talk) 06:57, 12 March 2008 (UTC)

Mutual information is invariant under diffeomorphisms of X and Y, which is not hard to show. Don't know though whether the same is true of homeomorphisms. --Kaba3 (talk) 22:15, 4 November 2011 (UTC)

## WTF! This is a mess.

<rant> Hmm. At the top of this talk page, there is discussion of measure theory, but there is not a drop of measure theory to be found in this article. FWIW, at least some textbooks on probability *start* with defining measure theory (I own one such), so I'm not sure why this should be so scary.

Anyway, this article *completely* glosses over the hard problem of defining the continuous entropy. Consider a digital electronic oscillator, which nominally outputs two voltages: 0 Volts and 1 Volt. Now, if the oscillator is running at very high speed, it typically won't hit exactly 0 and 1, it will have a (continuous) distribution of voltages around 0 and around 1. If the distribution is very sloppy, and I compute entropy with the integral formula, I get a nice positive entropy. But if I sharpen up the oscillator, so that it hits the voltages 0 and 1 more and more precisely (more and more sharply peaked at 0 and 1), instead of being sloppy, the "continuous entropy" will go negative -- more and more negative, the sharper the peaks get. This is very disconcerting if one was hoping to get the entropy resembling that of a discrete coin toss out of the system.

This is an actual problem which does show up, and is partly solvable by using bins, and doing bin-counting. But this hardly merits the la-dee-da treatment given here! </rant> linas (talk) 15:21, 29 August 2008 (UTC)

## Normal Distribution Maximizes The Differential Entropy For a Given Variance

This property is given in the article, thought it would be valuable to add a full proof.
One proof could be made for smooth PDF's using Lagrange multipliers.

There is a more general proof:
Let ${\displaystyle g\left(x\right)}$ be a Normal Distribution PDF.
Without loss of generality (See Differential Entropy Properties) we could assume ${\displaystyle g\left(x\right)}$ to be centered with its second moment to be 1:
${\displaystyle \int _{-\infty }^{\infty }g\left(x\right)xdx=0,\ \int _{-\infty }^{\infty }g\left(x\right){x}^{2}dx=1}$
Let ${\displaystyle f\left(x\right)}$ be any arbitrary PDF with the same statistical properties.

We'll apply on them the Kullback–Leibler divergence (Mind the Minus Sign):
${\displaystyle -{D}_{KL}\left(f\left(x\right)\|g\left(x\right)\right)=}$

${\displaystyle =-\int _{-\infty }^{\infty }f\left(x\right)\log \left({\frac {f\left(x\right)}{g\left(x\right)}}\right)dx}$
${\displaystyle =\int _{-\infty }^{\infty }f\left(x\right){\underset {\log \left(x\right)\leq x-1}{\underbrace {\log \left({\frac {g\left(x\right)}{f\left(x\right)}}\right)} }}dx}$
${\displaystyle \leq \int _{-\infty }^{\infty }f\left(x\right)\left({\frac {g\left(x\right)}{f\left(x\right)}}-1\right)dx}$
${\displaystyle =\int _{-\infty }^{\infty }g\left(x\right)-f\left(x\right)dx=0}$

Looking back on ${\displaystyle \int _{-\infty }^{\infty }f\left(x\right)\log \left({\frac {g\left(x\right)}{f\left(x\right)}}\right)dx}$ we get:
${\displaystyle \int _{-\infty }^{\infty }f\left(x\right)\log \left({\frac {g\left(x\right)}{f\left(x\right)}}\right)dx}$

${\displaystyle ={h}_{f\left(x\right)}\left(X\right)+{\underset {=-{h}_{g\left(x\right)}\left(X\right)}{\underbrace {\int _{-\infty }^{\infty }f\left(x\right)\log \left(g\left(x\right)\right)dx)} }}}$
${\displaystyle ={h}_{f\left(x\right)}\left(X\right)-{h}_{g\left(x\right)}\left(X\right)}$

Applying the inequality we showed prviously we get:
${\displaystyle {h}_{f\left(x\right)}\left(X\right)-{h}_{g\left(x\right)}\left(X\right)\leq 0}$
With the Equality holds only for ${\displaystyle g\left(x\right)=f\left(x\right)}$. --Royi A (talk) 13:07, 30 January 2010 (UTC)

## Transformations of Random Variables

In the properties section, it is stated that the entropy of a transformed RV can be calculated by adding the expected value of the log of the Jacobian to the entropy of the original RV. I believe, however, that this is only true for bijective transforms, but not for the general case. —Preceding unsigned comment added by 129.27.140.190 (talk) 09:08, 19 July 2010 (UTC)

## Erroneous proof?

The current proof on that a normal distribution maximizes differential entropy seems erroneous to me. First, it does not use normality anywhere. Second, the last integral is actually not ${\displaystyle h_{g}}$.

--Kaba3 (talk) 15:52, 30 August 2010 (UTC)

I removed the erroneous proof. It can still be seen above. I have proved the result by using variational calculus together with the fact that differential entropy is a strictly concave function (when functions differing in a set of measure zero are taken as equivalent). That takes 5 pages as a Latex PDF. I wonder whether the proof would be worthy as its own article? Are you aware of shorter routes to a proof?

--Kaba3 (talk) 00:11, 23 September 2010 (UTC)

I don't see the error in the proof. Could you point me to the error?

I do use properties of the Normal Distribution, please, let me know as I intend to get it back to the page. --Royi A (talk) 16:33, 27 January 2011 (UTC)

The first part of your proof shows that Kullback-Leibler divergence is non-negative. That is correct. The error is in that what you define as ${\displaystyle h_{g}}$ is not the differential entropy of ${\displaystyle g}$ (it is a differential cross entropy). In particular, the Kullback-Leilber divergence is not a difference of two differential entropies. Therefore, the first part of the proof does not apply. Note also that you are not using any of the assumptions: the first part works for any pdfs f and g.

--Kaba3 (talk) 17:22, 15 February 2011 (UTC)

The proof seems ok. Here are some missing steps: http://min.us/lkIWMq I can incorporate this into article if you think it will be useful. --Vladimir Iofik (talk) 12:45, 5 April 2011 (UTC)

Adding those missing steps would be much appreciated:) After that the proof is fine. --Kaba3 (talk) 00:59, 20 April 2011 (UTC)

## Maximization in the normal distribution - Proof is incorrect

The normal distribution maximizes the entropy for a fixed variance and MEAN. In the final step of the proof, it says that

${\displaystyle \int _{-\infty }^{\infty }f(x){\frac {(x-\mu )^{2}}{2\sigma ^{2}}}={\frac {1}{2}}}$

which is true only if ${\displaystyle \mu }$ is also the mean of f(x). I will give an alternate proof based on Lagrange multipliers if no one has an objection. PAR (talk) 01:56, 27 July 2011 (UTC)

Since differential entropy is translation invariant we can assume that f(x) has the same mean as g(x). I've added this note to the proof. Vladimir Iofik (talk) 12:17, 2 October 2011 (UTC)

## Application to improper probability distributions

I undid an edit of User:Portabili, but perhaps there is something to discuss, so I bring it up here. The undone edit spoke of the limit of a normal distribution as the variance increases without bound and how that is similar to a uniform distribution and how the uniform distribution on the real line maximizes differential entropy. Unfortunately, the uniform distribution on the real line is at best an improper probability distribution. The article does not discuss these at all, and it is not at all clear from the present article how differential entropy would apply to an improper distribution. If someone has a source that discusses this, it might merit adding material to the article. 𝕃eegrc (talk) 13:39, 13 January 2017 (UTC)