Talk:Entropy (information theory)/Archive2

From Wikipedia, the free encyclopedia
Jump to: navigation, search


The statement at the end of the second paragraph is simply not true: "the shortest number of bits necessary to transmit the message is the Shannon entropy in bits/symbol multiplied by the number of symbols in the original message." -- the formula of (bit/symbol * number of symbols) does not give the entropy when multiplied by the number of symbols in the original message! The original should be replaced with something like the "shortest possible representation".—Preceding unsigned comment added by (talkcontribs)

Units and the Continuous Case

The extension to the continuous case has a subtle problem: the distribution f(x) has units of inverse length and the integral contains "log f(x)" in it. Logarithms should be taken on dimensionless quantities (quantities without units). Thus, the logarithm should be of the ratio of f(x) to some characteristic length L. Something like log [ f(x) / L ] would be more proper.

The problem with taking a transcendental function of a quantity with units arises from the way we define arithmetic operations for quantities with units. 5 m + 2 m is defined (5 m + 2 m = 7 m) but 5 m + 2 kg is not defined because the units are different among the quantities to be added. Transcendental functions (such as logarithms), of a variable x with units, present problems for determining the resulting units of the results of the functions of x. This is why scientists and engineers try to form ratios of quantities in which all the units cancel, and then apply transcendental functions to these ratios rather than the original quantities. As an example, in exp[-E/(kT)] the constant k has the proper units for canceling the units of energy E and temperature T so units cancel in the quantity E/(kT). Then the result of the operation, of a typical transcendental function on its dimensionless argument, is also dimensionless.

My suggested solution to the problem with the units raises another question: what choice of length L should be used in the expression log [ f(x) / L ]? I think any choice can work. —The preceding unsigned comment was added by (talk) 18:06, 17 December 2006 (UTC).

For canceling the inverse unit of length (actually the inverse unit of x), there should appear a product of f(x) and a length L under the logarithm, i.e. log [ f(x) L ]. This would be, indeed, bizare, as any length L would work - unless we are in the frame of quantum mechanics. In that case, we would simply use the smallest quantumly distinguishable value for L. If x is truly a length, then L could be Planck's length. But this is already too obfuscating for me. I would rather recommend on concentrating on the discrete formula of entropy: S = Sum [ p(i) log p(i) ]. Now, in the continuous case, the probability is infinitesimal an it is dP = f(x) dx. Thus, the exact transcription of the above formula with this probability would give S = Sum [ f(x) dx log ( f(x) dx ) ]. Now Sum would become Integral and log ( f(x) dx ) is a functional which must take a form of L(x) dx. The worst problem now is that there are two dx under one integral. This problem appears in the above modified formula for S. This problem must be worked out somehow. Its source is in the product in the initial Shannon entropy.
If you want to work with continuous variables, you're on much stronger ground if you work with the relative entropy, ie the Kullback-Leibler distance from some prior distribution, rather than the Shannon entropy. This avoids all the problems of the infinities and the physical dimensionality; and often, when you think it through, you'll find that it may make a lot more sense philosophically in the context of your application, too. Jheald 19:32, 7 February 2007 (UTC)
Of course, the relative entropy is very good for the continuous case, but, unlike Shannon entropy, it is relative, as it needs a second distribution from which to depart. I was thinking of a formula that would give a good absolute entropy, similar to the Shannon entropy, for the continuous case. This is purely speculative, though. —The preceding unsigned comment was added by (talk) 13:52, 8 February 2007 (UTC).

Extending discrete entropy to the continuous case: differential entropy

Q —The preceding unsigned comment was added by (talk) 10:18, 12 February 2007 (UTC).

The last definition of the differential entropy (second last formula) seems to malfunction. Actually, it should read

h[f] = lim (Delta -> 0) [ H^Delta + log Delta * Sum [ f(xi) Delta ] ]

This would ensure the complete canceling of the second sum in H^Delta. With the current formula, there would remain a non-canceling term:

h[f] = lim (Delta -> 0) [ H^Delta + log Delta ] = Integral[ f(x) log f(x) dx ] - -lim (Delta -> 0) [ log Delta * ( Sum [ f(xi) Delta ] -1 ) ] .

The last limit does not go to zero. Actually, through a l'Hopital applied to (1-Sum) / (1/log Delta) , it would go to

- lim (Delta -> 0) [ Delta (log Delta)^2 Sum[f(xi)] ],

and, as Delta -> 0, Sum[f(xi)] -> infinity as 1/Delta (since Sum[f(xi) Delta] -> 1), so it would cancel the first Delta in the limit above, and there would be only

- lim (Delta -> 0) [ (log Delta)^2 ] -> - infinity

Thus, the last definition of h[f] could not even be used. I recommend checking with a reliable source on this, then, maybe, if that formula is wrong, its erasure. Misfortunately, I have no knowledge of the way formulas are written in wikipedia (yet).

Roulette Example

In the roulette example, the entropy of a combination of numbers hit over P spins is defined as Omega/T, but the entropy is given as lg(Omega), which then calculates to the Shannon definition. Why is lg(Omega) used? (Note: I'm using the notation "lg" to denote "log base 2") 20:41, 31 March 2006 (UTC)

moved to talk page because wikipedia is not a textbook

Derivation of Shannon's entropy

Since the entropy was given as a definition, it does not need to be derived. On the other hand, a "derivation" can be given which gives a sense of the motivation for the definition as well as the link to thermodynamic entropy.

Q. Given a roulette with n pockets which are all equally likely to be landed on by the ball, what is the probability of obtaining a distribution (A1, A2, …, An) where Ai is the number of times pocket i was landed on and

is the total number of ball-landing events?

A. The probability is a multinomial distribution, viz.


is the number of possible combinations of outcomes (for the events) which fit the given distribution, and

is the number of all possible combinations of outcomes for the set of P events.

Q. And what is the entropy?

A. The entropy of the distribution is obtained from the logarithm of Ω:

The summations can be approximated closely by being replaced with integrals:

The integral of the logarithm is

So the entropy is

By letting px = Ax/P and doing some simple algebra we obtain:

and the term (1 − n) can be dropped since it is a constant, independent of the px distribution. The result is

(Isn't factor of P dropped in the formula above?) —Preceding unsigned comment added by (talk) 22:43, 18 November 2008 (UTC)
Not sure what you mean. At first glance it looks good to me. CRETOG8(t/c) 23:13, 18 November 2008 (UTC)

Thus, the Shannon entropy is a consequence of the equation

which relates to Boltzmann's definition,


of thermodynamic entropy, where k is the Boltzmann constant.

—The preceding unsigned comment was added by MisterSheik (talkcontribs) 17:34, 1 March 2007.

H(X), H(Ω), and the word 'outcome'

Recent edits to this page now stress the word "outcome" in the opening sentence:

information entropy is a measure of the average information content associated with the outcome [emphasised] of a random variable.

and have changed formulas like


There appears to have been a confusion between two meanings of the word "outcome". Previously, the word was being used on these pages in a loose, informal, everyday sense to mean "the range of the random variable X" -- ie the set of values {x1, x2, x3 ...) that might be revealed for X.

But "outcome" also has a technical meaning in probability, meaning the possible states of the universe {ω1, ω2, ω3 ...), which are then mapped down onto the states {x1, x2, x3 ...) by the random variable X (considered to be a function mapping Ω -> R).

It is important the mapping X may in general be many-to-one: so H(X) and H(Ω) are not in general the same. In fact we can say definitely that H(X) <= H(Ω), with equality holding only if the mapping is one-to-one over all subsets of Ω with non-zero measure. (the "data processing theorem").

The correct equations are therefore


But in general the two are not the same. -- Jheald 11:37, 4 March 2007 (UTC).

Sorry, I don't get it

Self-information of an event is a number, right? Not a random variable. Yes?

So how can entropy be the expectation of self-information? I sort-of understand what the formula is coming from, but it doesn't look theoretically sound... Thanks. 13:19, 4 March 2007 (UTC)

Ok, maybe I understand. I(omega) is a number, but I(X) is itself a random variable. I have fixed the formula. 13:27, 4 March 2007 (UTC)

Uh-oh, what have I done? "Failed to parse (Missing texvc executable; please see math/README to configure.)" Could you please fix? Thank you. 13:30, 4 March 2007 (UTC)

Compression of English Text

If I take the text of the book "Uncle Tom's Cabin", , its about a megabyte of text. If I compress it using winzip I get 395K bytes. bzip2: 295KB. paq8l 235KB. This isn't normal English text, but I think you get the idea. Daniel.Cardenas 19:06, 13 May 2007 (UTC)

Compression software does give a nice rule-of-thumb entropy estimate, but in this case the actual entropy is a lot lower because compression software designed for general-purpose use doesn't have the extensive knowledge of the language that allows humans to see more redundancy in the text. More rigorous experiments usually show lower entropy rates for English, typically between 1.0 and 1.5 bits per character, as described in the reference I've added. 19:23, 21 May 2007 (UTC)
Thanks, that was a good one.  :-) Daniel.Cardenas 19:35, 21 May 2007 (UTC)

Entropy of English text

The article currently says "The entropy of English text is between 1.0 and 1.5 bits per letter.". Shouldn't the entropy in question decrease as one discovers more and more patterns in the language, making a text more predictable? If so, I think it would be a good idea to be a little less precise, saying "The entropy of English text can be regarded as being between 1.0 and 1.5 bits per letter." or similar instead. —Bromskloss 11:43, 7 June 2007 (UTC)

No, that's like saying "The sum of 2 plus 2 can be regarded as 4." Entropy has a precise mathematical definition. It isn't just possible to "regard" it as having an exact value, it actually does have an exact value. At most it can be said that entropy is hard to measure, which (along with differences between receivers and in what's called "English") is the reason a range instead of a single value is given. It's true that knowing more about the language (i.e. having more ability to predict the text) decreases the entropy; the studies on which the referenced statement is based are generally assuming something like the average user of English. Anyway, the statement in the article is what's in the reference and it's not appropriate for us to second-guess it. 13:18, 26 June 2007 (UTC)

Boltzmann's lectures on entropy

Since entropy was formally introduced by Ludwig Boltzmann the article should refer to his work:

Boltzmann, Ludwig (1896, 1898). Vorlesungen über Gastheorie : 2 Volumes - Leipzig 1895/98 UB: O 5262-6. English version: Lectures on gas theory. Translated by Stephen G. Brush (1964) Berkeley: University of California Press; (1995) New York: Dover ISBN 0-486-68455-5

—The preceding unsigned comment was added by Algorithms (talkcontribs) 19:35, 7 June 2007.

log basis

Hmmm, this article seems to assume that logs must always be taken to base 2 - which is not the case. We can define entropy to whatever base we like (in coding it often makes things easier to define it to a base equal to the number of code symbols, which in computer science is typically 2). This leads to different units of measurements: bits vs. nats vs. hartleys.

The article should probably be modified to reflect this HyDeckar 01:16, 13 June 2007 (UTC)

Mistake inside an external reference

Regrading the reference: Information is not entropy, information is not uncertainty ! - a discussion of the use of the terms "information" and "entropy".

They referenced article is mistaken. It refutes the claim that "information is proportional to physical randomness". However, the more random a system is the more information we need in order to describe it. I suggest we remove this reference.

—The preceding unsigned comment was added by (talk) 07:32, 13 June 2007

I agree. That reference reads more like a rant than a discussion. Its author appears to lack some basic understanding of thermodynamic vs. information-theoretic entropy. The above comment is absolutely correct in that "the more random a system is the more information we need in order to describe it." 16:36, 25 September 2007 (UTC)

Looking for reference

Im looking for realiable, hard references for the following phrase in the article:

"Shannon's entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc. See Markov chain."

Im sorry if the above concept is a bit basic and present in basic textbooks. I have not studied the subject formally, but i may have to apply the entropy concenpt in a small analysis for my master's dissertation.

Units in the continuous case

I think there need to be some explanition on the matter of units for the continuous case.

f(x) will have the unit 1/x. Unless x is dimmensionless the unit of entropy will inclue the log of a unit which is weird. This is a strong reason why it is more useful for the continuous case to use the relative entropy of a distribution, where the general form is the Kullback-Leibler divergence from the distribution to a reference measure m(x). It could be pointed out that a useful special case of the relative entropy is:

which should corresponds to a rectangular distribution of m(x) between xmin and xmax. It is the entropy of a general bounded signal, and it gives the entropy in bits.

Petkr 13:38, 6 October 2007 (UTC)

Entropy vs Entropy Rate

not sure about the section `Limitations of entropy as information content'.

quote Consider a source that produces the string ABABABABAB... in which A is always followed by B and vice versa. If the probabilistic model considers individual letters as independent, the entropy rate of the sequence is 1 bit per character. But if the sequence is considered as "AB AB AB AB AB..." with symbols as two-character blocks, then the entropy rate is 0 bits per character. endquote

the average number of bits needed to encode this string is zero (asymptotically)

also, treating this as a markov chain (order 1), we can see from the formula in and also in this article that the entropy rate is 0

also in the next paragraph quote However, if we use very large blocks, then the estimate of per-character entropy rate may become artificially low. endquote

isn't the `per-character entropy rate' redundant? should be either the `per-character entropy' or the `entropy rate' —Preceding unsigned comment added by (talk) 07:23, 16 January 2008 (UTC)


Since "uncertainty" (whatever that may mean) is used as a motivating factor in this article, it might be good to have a brief discussion about what is meant by "uncertainty." Should the reader simply assume the common definition of uncertainty? Or is there a specific technical meaning to this word that should be introduced? —Preceding unsigned comment added by (talk) 19:41, 27 January 2008 (UTC)

The article states:“Equivalently, the Shannon entropy is a measure of the average information content the recipient is missing when he does not know the value of the random variable.” This has also been interpreted as an uncertainty in a system, not a measure of the information.
This interpretation is valid if we are sending a message from a sender to a receiver along a noisy channel, which may make the message uncertain. But there is an alternative interpretation where information entropy is hardly a measure of uncertainty.
For instance if we replace a generation with Gaussian distributed quantitative characters of one billion individuals in a large population with a new generation, the situation is quite different. This is like sending one billion different Gaussian distributed messages in parallel from parents to offspring. Every new message is a random – noisy - recombination of messages from two randomly chosen parents, for instance.
As I see it, there is per definition no uncertainty with respect to the survival of the parents, and a moment matrix of their characters may as well exist. Thus a Gaussian distribution may serve as a good approximation of the region of acceptability, A, determining the possible spread of parents along A. See also the article about "Entropy in thermodynamics ... [[1]]--Kjells (talk) 13:30, 8 June 2008 (UTC)

Limitations of entropy as information content

This section needs a major rewrite. It correctly states that Shannon entropy depends crucially on a probabilistic model. Several important points need to be made, though.

  • When we are talking about the information content of an individual message, we are talking about its self-information, not entropy. Entropy is a measure of the complexity of the whole probability distribution, not of an individual message. Entropy is the expected self-information of a message, given our probabilistic model.
  • The Kolmogorov complexity is, as stated, a measure of the complexity of an individual message, independent of any probability distribution, however it is only defined up to an additive constant, which depends on the specific model of computation chosen.
  • Nonetheless the information entropy provides a lower bound on the expected Kolmogorov complexity of a message, i.e.:

Such a bound would be extremely to obtain in the case of a single message, due to the halting problem. Deepmath (talk) 21:28, 15 July 2008 (UTC)

The example given about the sequence ABABAB... sounds like utter nonsense to me: a source that always produces the same sequence has entropy 0, regardless of whether the sequence consists of a single symbol or not. For instance, the sequence of integers produced by counting from 0 has entropy 0, even though each symbol (integer) is different. —Preceding unsigned comment added by (talk) 19:30, 21 January 2010 (UTC)

Name change suggestion to alleviate confusion

Resolved: Page moved.

I suggest renaming this article to either "Entropy (information theory)", or preferably, "Shannon entropy". The term "Information entropy" seems to be rarely used in a serious academic context, and I believe the term is redundant and unnecessarily confusing. Information is entropy in the context of Shannon's theory, and when it is necessary to disambiguate this type of information-theoretic entropy from other concepts such as thermodynamic entropy, topological entropy, Rényi entropy, Tsallis entropy etc., "Shannon entropy" is the term almost universally used. For me, the term "information entropy" is too vague and could easily be interpreted to include such concepts as Rényi entropy and Tsallis entropy, and not just Shannon entropy (which this article exclusively discusses). Most if not all uses of the term "entropy" in some sense quantify the "information", diversity, dissipation, or "mixing up" that is present in a probability distribution, stochastic process, or the microstates of a physical system.

I would do this myself, but this article is rather frequently viewed, so I am seeking some input first. Deepmath (talk) 01:29, 23 August 2008 (UTC)

Support - Sounds sensible to me. I favour the name "Entropy (information theory)" rather than "Shannon entropy" - because that will make it clearer to newbies that this is where they come to find out what the unqualified word "entropy" means when they come across it in an information-theory context. Very often "entropy" is discussed in the literature without specifying that it's "Shannon entropy", even in many cases where the discussion only applies to Shannon entropy. --mcld (talk) 08:36, 23 August 2008 (UTC)
Yes, and the article could begin with "In information theory, entropy is a measure of [...]. Several types of entropy can be introduced, the most common one is Shannon entropy, defined as [...]. Other definitions include Rényi entropy, which is a generalization of Shannon entropy, [...]. Throughout this article, the unqualified word entropy will refer to Shannon entropy.", or something similar. --A r m y 1 9 8 7  10:59, 23 August 2008 (UTC)
Thank you both for the input. If no major objections are forthcoming in the next few days, I say we go ahead with the move to "Entropy (information theory)". However, I counted 403 articles that link here, excluding user and talk pages. Is there a bot somewhere we could use to at least pipe those links to the new article title? That would let the people watching those articles about the new title and avoid all those annoying redirects for people who are just browsing Wikipedia. I guess I'm just not familiar with what's generally done in cases like this. Deepmath (talk) 21:05, 23 August 2008 (UTC)
Moving the page is quite easy, see WP:MV for a howto. A redirect will automatically be put in place. In general there's no need to go around fixing the 403 articles - they will gradually get fixed, either by bots or by users, and most users won't even notice the difference since the redirect thing happens so transparently. Some things will need tweaking I think, but nothing like 403. But that link I gave has all the info. --mcld (talk) 13:40, 25 August 2008 (UTC)
Entropy (information theory) already has an edit history. Deepmath (talk) 00:46, 27 August 2008 (UTC)
Still can't do the move, even though I tried to move the old page out of the way. An administrator needs to do this. Deepmath (talk) 06:42, 27 August 2008 (UTC)
Already having an edit history isn't a valid reason not to move - the edit history would just have to be copied to the talk page to preserve it. Dcoetzee 04:27, 27 August 2008 (UTC)
I'm not trying to suggest it is. The pre-existing edit history for the target page simply makes it technically more difficult to accomplish the move without an administrator's help. I'll see what I can do. Deepmath (talk) 06:22, 27 August 2008 (UTC)
Oh, I see now. :-) I'm an admin and I'll do the move once the discussion settles (has it already?) Dcoetzee 07:09, 27 August 2008 (UTC)
seems pretty settled to me... --mcld (talk) 08:22, 19 September 2008 (UTC)
The article was moved, indeed. I'm adding a {{resolved}} tag at the top. A r m y 1 9 8 7 ! ! ! 10:00, 19 September 2008 (UTC)
oops ok, thanks --mcld (talk) 14:27, 19 September 2008 (UTC)
And by the way, as to Army1987's suggestions, it might be a little confusing for newbies to talk about Rényi entropy right in the intro. I did try the edit the intro a little, but if you feel you can word things there a little more clearly, please go right ahead. Or perhaps a section later in the article about generalizations of Shannon's entropy would be more appropriate for mentioning Rényi entropy. Also by the way, I read Hartley's 1928 paper "Transmission of Information" after somebody posted a link to it in the Information theory article. This guy was apparently the first one to recognize that the amount of information that could be transmitted was proportional to the logarithm of the number of choices available. He did not attempt to analyze the mathematics behind unequal probability distributions like Shannon did, but he basically invented the concept of "bandwidth" as we know it today: that the rate of information that can be transmitted over a continuous channel is proportional to the width of the range of frequencies that one is allowed to use. And the formula for unequal probability distributions, , was already known to Boltzmann and Gibbs from their study of the entropy discovered by Clausius based upon Carnot's work improving the efficiency (and theoretical understanding) of steam engines. Deepmath (talk) 22:03, 23 August 2008 (UTC)

Scientists make simple things complicated

Very many scientists like to make simple things complicated and earn the respect over this. Information entropy is a very good example of such attempt. Actually entropy is only a number of possible permutations expressed in bits divided by the length of the message. And the concept is simple as well. For the given statistical distribution of symbols we can calculate the number of possible permutations and enumerate all messages. If we do that, we can send statistics and index of the message in enumeration list instead of the message and message can be restored. But the index of the message has length as well and it can be very long so we consider the worst case scenario and take the longest index that is number of possible permutations. For example, if we have message with symbols A,B,C of 1000 symbols long with statistics 700, 200 and 100. The number of possible permutations is (1000!) / (700! * 200! * 100!). The approximate bit length of this number divided by the number of symbols is (log(1000!) – log(700!) – log(200!) – log(100!))/1000 = 1.147 bits/symbol, where all logarithms have base 2. If you calculate the entropy it is 1.157. The figures are close and they asymptotically approach each other with the growing size of the message. The limits are explained by Sterling formula, so there is no trick, just approximation. Obviously, when writing his famous article Claude Shannon did not have an idea what is going on and could not explain clear what the entropy is. He simply noticed that in compression by making binary trees similar to Huffman tree the bit length of the symbol is close to –log(p) but always larger and introduced entropy as a compression limit without clear understanding. The article was published in 1948 and Huffman algorithm did not exist but there were other similar algorithms that provided slightly different binary trees with the same concept as Huffman tree, so Shannon knew them. Surprising is not Shannon’s entropy but the other scientists who use obscure and misleading terminology for 60 years. Entropy is a measure for a number of different messages that can be possibly constructed with constrain given as frequency for every symbol that is all, simple and clear. —Preceding unsigned comment added by (talk) 17:47, 24 June 2008 (UTC)

Ok, so you're angry and thinking Claude Shannon sucks. Even as I type I realize this is a pointless post but seriously, expressing it unambigously in mathematical terms that are irrefutable is essential, especially in a subject area such as this. (talk) 02:15, 25 November 2008 (UTC)

This is the way that Kardar introduced the information entropy in his book Statistical Physics of Particles. There is also a wikibook at the external connection named An Intuitive Guide to the Concept of Entropy Arising in Various Sectors of Science, which this kind of opinion might be contributed to. Tschijnmotschau (talk) 09:02, 3 December 2010 (UTC)

Missing figure for continuous case

The section about the entropy of a continuous function refers to a figure, but no figure is present. —Preceding unsigned comment added by Halberdo (talkcontribs) 17:10, 22 December 2008 (UTC)

The corresponding text apparently was added almost three years ago, and apparently the figure itself never was added as an image but only as a comment:

<!-- Figure: Discretizing the function $ f$ into bins of width $ \Delta$
 \includegraphics[width=\textwidth]{function-with-bins.eps} -->

Furthermore, apparently the text was copied and pasted from PlanetMath (see here) without proper attribution of the authors as I think would be required by the GNU Free Documentation License. This talk page mentions that the article incorporates material from PlanetMath, which is licensed under the GFDL, but I am not sure that is enough? So, should the section be removed as a copyright violation? — Tobias Bergemann (talk) 21:28, 22 December 2008 (UTC)

I haven't read the article or preceding comments, but AFAIK it is not a copyright violation to copy GFDL-licensed material to Wikipedia as long as it has proper attribution (maybe you need to change the attribution above to more closely reflect the kind of attribution PlanetMath wants, at most). Shreevatsa (talk) 21:38, 22 December 2008 (UTC)
I think you are right. As far as I understand the history of the article at PlanetMath, all visible versions of that article were authored by Kenneth Shum. — Tobias Bergemann (talk) 22:21, 22 December 2008 (UTC)