Talk:Entropy (information theory)

From Wikipedia, the free encyclopedia
Jump to: navigation, search
          This article is of interest to the following WikiProjects:
WikiProject Statistics (Rated B-class, High-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

B-Class article B  This article has been rated as B-Class on the quality scale.
 High  This article has been rated as High-importance on the importance scale.
 
WikiProject Mathematics (Rated B-class, High-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
B Class
High Importance
 Field: Applied mathematics
One of the 500 most frequently viewed mathematics articles.
This article has comments.
WikiProject Physics (Rated C-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Physics, a collaborative effort to improve the coverage of Physics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
 

This article has comments here.

WikiProject Engineering (Rated B-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Engineering, a collaborative effort to improve the coverage of engineering on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B-Class article B  This article has been rated as B-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 
WikiProject Telecommunications (Rated B-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Telecommunications, a collaborative effort to improve the coverage of Telecommunications on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B-Class article B  This article has been rated as B-Class on the quality scale.
 High  This article has been rated as High-importance on the importance scale.
 
WikiProject Electronics (Rated B-class, High-importance)
WikiProject icon This article is part of WikiProject Electronics, an attempt to provide a standard approach to writing articles about electronics on Wikipedia. If you would like to participate, you can choose to edit the article attached to this page, or visit the project page, where you can join the project and see a list of open tasks. Leave messages at the project talk page
B-Class article B  This article has been rated as B-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 
WikiProject Computing (Rated C-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
 
Note icon
An image has been requested for this article. Please remove the needs-image parameter once the image is added.
WikiProject Systems (Rated B-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Systems, which collaborates on articles related to systems and systems science.
B-Class article B  This article has been rated as B-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
Taskforce icon
This article is not associated with a particular field. Fields are listed on the template page.
 

This article has comments here.

This article has an assessment summary page.

Independent and identically distributed?[edit]

I would like to remove the statement in the introduction that states that entropy is only defined for messages with an independent and identically distributed alphabet. This is not true, it is defined for any message, each letter of which may have different probabilities, and for which there may be correlations among the letters. For example, it is stated that in the English language, the pair "qu" occurs very frequently compared to other pairs beginning with "q". In other words, for the letter "q", the letter following it is highly correlated to it, not independent. Also, in English, the characters are not identically distributed, "e" is more probable than "x". For a message which is N letters long, with an alphabet of m letters, there are m^N possible messages, each with their own probability pi, with no restrictions on these probabilities other than that they are non-negative and their sum is equal to unity. The entropy of this set of messages is the sum of -pi log(pi over all m^N possible messages. All sorts of different letter frequencies and correlations may result from this set of m^N probabilities. PAR (talk) 11:50, 30 April 2011 (UTC)

It is a long time since I thought about this topic, but I think that text in the lead is talking about the best possible system (the one which conveys most information with each "character" that is transmitted). It looks out of place, or perhaps poorly expressed. Johnuniq (talk) 03:56, 1 May 2011 (UTC)

To say Shannon's entropy is only defined for independent and identically distributed spaces is perhaps a bit misleading. Perhaps the better way of saying this is that Shannon's entropy is a quantification of information in an i.i.d. You can apply it to English, but then it no longer represents the amount of information transmitted (see Kolmogorov Complexity). Note that you have also completely misunderstood what i.i.d. means in terms of English. It does NOT mean that the characters all have the same probability. It means that the probability of each character any a given position in unaffected by knowledge of characters at other positions (independent), and that the probability of each character is the same in every position (identically distributed). Naturally English doesn't abide by this, which means that in theory Shannon's entropy is probably not a lower bound for compressing English (again, see Kolmogorov Complexity).Elemental Magician (talk) 08:55, 11 April 2013 (UTC)


Entropy as an apparent effect of conservation of information.[edit]

If entropy is considered an equilibrium property as in energy physics, then it conflicts with the conservation of information. But the second law of thermodynamics may simply be a apparent effect of the conservation of information, that is, entropy is really the amount of information it takes to describe a system, each reaction creates new information but information cannot be destroyed. That means the second law of thermodynamics is not an independent law of physics at all, but just an apparent effect of the fact that information can be created but not destroyed. The arrow of time is thus not about destruction, but about the continuous creation of information. This explains how the same laws of physics can cause self-organization. Organized systems are not anyhow less chaotic than non-organized systems at all, and the spill heat life produces can, in an information-physical sense, be considered beings eliminated by evolution rather than a step towards equilibrium. It is possible that overload of information will cause the arrow of time to extract vacuum energy into usefulness rather than heat death. 217.28.207.226 (talk) 10:51, 23 August 2011 (UTC)Martin J Sallberg

This sounds interesting, but the details are not explained clearly. For example, information can be destroyed--we do it whenever we delete a file. And there is no reason given that "The arrow of time is...about the continuous creation of information." There is no basis to edit the article until these claims are explained more clearly (and with references). David Spector (talk) 11:31, 17 February 2012 (UTC)

In response to the above response, you are wrong about information being destroyed when you delete a file. The information is dispersed and thereafter unretrievable, but it is not destroyed. Information is never destroyed.— Preceding unsigned comment added by 69.143.174.166 (talkcontribs) 06:21, 17 October 2013

Ambiguous intro[edit]

The 2nd paragraph ends: "As another example, the entropy rate of English text is between 1.0 and 1.5 bits per letter,[6] or as low as 0.6 to 1.3 bits per letter, according to estimates by Shannon based on experiments where humans were asked to predict the next letter in a sample of English text.[7]"

The first range cited means even (50/50) or worse, the second means better than even or worse than even. If neither of these are typos wouldn't it be better to just say "0.6 to 1.5"?

robotwisdom (talk) 21:09, 27 March 2013 (UTC)

As far as I understand it, these were experimental studies with humans and therefore there's a high variance in the results. The two studies found slightly different ranges and you can't combine them like this. What really should be mentioned here is a study that analyzes English text electronically and computes the entropy rate in some way.
Btw, 50/50 means that, when given an initial sequence of text such as "The fox w", people were able to exclude, on average, half of all possible 26 letters as the next letter. ylloh (talk) 18:15, 28 March 2013 (UTC)
I think rather it means they can guess the next letter half the time-- much much harder. (I'm trying to gauge the entropy of Joyce's invented language in Finnegans Wake.) robotwisdom (talk) 00:54, 29 March 2013 (UTC)
Then it would not be 1 bit per letter but \log_2(26) / 2 \approx 2.35 bits per letter, I think. ylloh (talk) 23:56, 29 March 2013 (UTC)
(Is there somewhere else on the Web we should move this discussion?) If you arrange the 26 letters from commonest down (etaoinshrdlu, we used to say) couldn't you just blindly guess the top 13 and be right far more than half the time? There's a Web app somewhere that gave me values between 4 and 5 for both FW and plain English. I'm looking for a metric that conveys how much harder it is to proofread FW. robotwisdom (talk) 08:40, 30 March 2013 (UTC)
They weren't giving 13 guesses for the next letter, they were giving one -- and were right (more than) half the time.
It's an illustration of the high redundancy of English. If the letters were completely random, you would need to transmit log2 26 = 4.70 bits of information to identify each one. But actually English is not so random, and is quite compressible: you can often have a quite good guess as to what the next letter is. So if you build a good predictive model for the possibilities for the next letter, on average you only need to send one bit per symbol to identify it. See eg videos of Dasher in action. Jheald (talk) 10:00, 30 March 2013 (UTC)

Now I'm imagining a gigantic decision-tree including every imaginable message that might be sent and every variation in how it might be expressed. This would allow some sort of optimal compression... and FW strains this via puns and riddles and allusions. robotwisdom (talk) 18:04, 30 March 2013 (UTC)

Absurdly based and highly misleading estimate of entropy of English[edit]

From the article: "As another example, the entropy rate of English text is between 1.0 and 1.5 bits per letter,[6] or as low as 0.6 to 1.3 bits per letter, according to estimates by Shannon based on experiments where humans were asked to predict the next letter in a sample of English text."

Since information content is only equal to Shannon's entropy given an independent and identically distributed language, which English is not, asking people to predict the next letter in a sample of English text would underestimate the Shannon entropy of English if people could see the parts of the text that they were not predicting. If people were not given any information, but were merely asked to give a probability of each character (or something like that) this should be included.

Judging by the fact that Shannon derived a rather lower value of entropy with this experiment, my guess is that this wasn't actually measuring the Shannon entropy (but something like the Algorithmic probability).Elemental Magician (talk) 09:03, 11 April 2013 (UTC)

This seems to be an extension of the discussion further up-page of whether Shannon entropy is only defined for processes that are i.i.d., which came to the view that it was not so restricted.
In the entropy of English case, it seems not unreasonable to consider the average entropy per character to be the average of \scriptstyle{\sum p_i \log p_i}, where the probabilities pi give the conditional probability distribution for a character at position n given all that have been seen previously. This is in fact the very example that PAR (talk · contribs) gave above for application of Shannon entropy to a non-IID process, where for example the probabilities pi for a letter following a 'q' will not be the same as those for a letter following a 't'. Jheald (talk) 16:57, 11 April 2013 (UTC)

Characterization[edit]

The characterization section can probably be rewritten by a professional. If I am correct "A Characterization of Entropy in Terms of Information Loss" by Baez et al. only require functoriality, convex linearity, and continuity for the definition of Shannon's entropy. The current statements are mutually overlapping and probably not entirely true. Andy (talk) 14:10, 2 July 2013 (UTC)

It seems characterization is central to understanding the motivation for the definition of entropy. Perhaps it could be mentioned or referred to somewhere in the introduction? 130.243.214.198 (talk) 13:17, 7 March 2014 (UTC)

Definition[edit]

Two vaguenesses in the Definition section.

1. "When taken from a finite sample, the entropy can explicitly be written as...": here it sounds like the author means that the entropy is taken from a finite sample, which is certainly not the case.

2. The term n_i in the expanded form for H(X) is never defined. Scorwin (talk) 17:03, 3 October 2013 (UTC)

Units of entropy[edit]

The introduction states bits, nats and bans as common units of Shannons entropy, while the definition section uses dits, rather than bans. — Preceding unsigned comment added by 129.177.143.236 (talk) 10:56, 18 November 2013 (UTC)

Kullback–Leibler divergence[edit]

The definition given here looks different from that on http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence Please reconcile 108.6.15.34 (talk) 21:04, 15 June 2014 (UTC)

Note that f, in the article here, equals p divided by m. Substituting p / m instead of f into the definition used in the article gives
D_{\mathrm{KL}}(p\|m) = \int \ln\left(\frac{p(x)}{m(x)}\right) p(x) \, {\rm d}x, \!
the form used in the Kullback-Liebler article. Jheald (talk) 07:09, 16 June 2014 (UTC)

Estimating entropy from a sample[edit]

If you know about estimating entropy from a finite sample that is randomly drawn with replacement according to a probability distribution on a (possibly infinite) population, would you please add it to the article? For example, drawing a sample of size 1 gives p=1 for the drawn value and p=0 for all other values, yielding an estimated entropy of 0 regardless of the actual probability distribution on the population. More generally, the expected value of the entropy computed from a finite sample will be less than the actual entropy of the population. And so on. Leegrc (talk) 15:43, 9 March 2015 (UTC)

Grammar, please?[edit]

Would someone familiar with the terms and practices AND of the English language go through and fix, please?

Bits vs. shannons[edit]

While it may be true that there is a unit called the "shannon" that measures Shannon entropy using logarithms base 2, it is my experience that it is never used. Always the identical unit called the "bit" is used instead. Shouldn't the article reflect this common usage? Leegrc (talk) 17:20, 18 May 2015 (UTC)

The lead does mention the common usage, which should prevent problems even for the lay reader. It would be non-encyclopaedic to use a term that perpetuates an ambiguity and hence a misconception, though, with the only motivation that common usage dominates. The concept of bits used to represent information and bits as units of information are so close as to engender confusion (especially since they become numerically equal under the illustrative assumption of uniform probability); many people cannot distinguish the two. The alternative is to go to lengths to belabour the distinction between the two so that it is clear that they are different quantities and different units, even though they are both called "information" measured in "bits". My feeling is that this encyclopaedic onus is discharged more naturally by using the unambiguous, albeit uncommon, use of the "correct" units. However, more opinions and debate on this issue would be useful. —Quondum 18:08, 18 May 2015 (UTC)
My strong recommendation would be to use "bits" throughout. In my opinion, the use of "shannons" is as wrongheaded as having as a unit of capacity the litre, and using it to measure the capacity of empty vessels, but then introducing a different unit -- the "pemberton" perhaps -- for talking about the amount of fluid that one is going to put into those vessels.
As User:Leegrc says, the shannon is simply not used in the real world; and IMO its use here is positively confusing, by suggesting to people there is a difference between shannons and bits, when actually there isn't. If you want to send the information to identify one stream of data out of a probability distribution of possibilities, you need to transmit so many bits per second. It's as simple as that. Jheald (talk) 19:10, 18 May 2015 (UTC)
Did you even read my sentence "The concept of bits used to represent information and bits as units of information are so close as to engender confusion (especially since they become numerically equal under the illustrative assumption of uniform probability); many people cannot distinguish the two."? Let's stick to an encyclopaedic approach, should we? Not much of what you say here is even correct: it is akin to saying "683 candela = 1 watt/sr, simple as that." Are you saying that we should replace the article Shannon (unit) with a redirect to Bit? Was Alan Turing daft to propose a definition of a unit of information (the ban) distinct from the decimal digit? Is IEC 80000-13 to be ignored? —Quondum 19:56, 18 May 2015 (UTC)

The edit using another WP article as a citation violates WP guidelines. Besides, that article is largely written from the perspective of a computer scientist who has little knowledge of or understanding about entropy and information theory. —Quondum 20:22, 18 May 2015 (UTC)

I appreciate the distinction between the bit as a unit of storage (for RAM, disks, etc.) and the bit as a unit of information. It is perhaps unfortunate that the same word is used in both contexts, but it is nonetheless fact. Using "shannon" instead of the latter use of "bit" would make the distinction, but unfortunately the use of the "shannon" unit is pretty darn rare, and thus is problematic. Leegrc (talk) 20:32, 18 May 2015 (UTC)
Okay, cool. Perhaps we can consider using the bit instead of the shannon, with suitable explanation of the distinction between "bit of information or entropy" and "bit of data". But we still cannot equate the two. In the context of entropy, I would prefer using the nat as the dominant unit of reference for the article though; what are the feelings on that? —Quondum 21:59, 18 May 2015 (UTC)
@Quondum:. The overwhelming majority of textbooks on information theory, coding theory, quantum computing etc use bits, not shannons. In fact, there's not a single textbook on my shelves that uses shannon. So yes, in this case I would ignore IEC 80000-13, and follow the overwhelming usage by reliable sources instead. There is previous form for this -- for example at Gibbs free energy we follow the traditional usage, rather than IUPAC's "Gibbs energy". (And no, I don't want to know about Gibibits either).
The situation is different from Candela, because unlike the raw power intensity, the Candela incorporates a biological factor specific to human physiology. In contrast if a source has an entropy rate of n bits per second, that corresponds an information rate of n bits per second, which can be encoded (asymptotically) in n storage bits per second. This is the content of Shannon's source coding theorem, and introducing anything other than bits is simply to bring in unnecessary confusion -- there is no advantage to be had in trying to make the distinction you're trying to make.
The "ban" (more specifically the "deciban") was introduced as a bit of a joke -- a play on "Banbury sheet" and "decibel", being a logarithmic scale for Bayes factors that would otherwise multiply. But it has the advantage of being shorter and punchier than "hartley" or "decimal digit", which is why I personally prefer it. In this regard it fits in well with the rest of the family: "bit", "byte" and "nat". It may also be worth remembering that Good and Turing were working in a specific area, before the publication of Shannon's communication theory, which is when the equivalence measures of uncertainty and measures of information capacity really got put on the front page.
Finally, should we merge Shannon (unit) into Bit ? I'd prefer not to, because as far as possible I'd prefer not to introduce the confusion of the Shannon at all. Better IMO to leave it like Gibibit in its own ghetto of an article, so with luck most people will be able to get along without ever needing to be troubled by it. (To which a good start would be not troubling them with it here). Jheald (talk) 21:47, 18 May 2015 (UTC)
You seem to be firmly of the midset that data and information are the same quantity. Your argument above is merely one of using "bit" to mean the same as "shannon", but not to merge the concepts. Whether we adopt the unit "bit" or not (I'd prefer the nat), it must be made clear that data and information are distinct quantities, and bits as units of information are distinct as units from bits of data. And, no, the candela example applies: just like there is a function modifying the quantity, the function of probability is present in information, but not in data. —Quondum 21:54, 18 May 2015 (UTC)
@Quondum: It's more like saying that wood is weighed in kilograms and metal is weighed in kilograms. That doesn't mean wood = metal. It means they're weighed on the same scale, so it makes sense to use the same unit. Jheald (talk) 22:14, 18 May 2015 (UTC)
Then how would you express that 8 bits of data might have 2.37 bits of entropy in your analogy (how can the same object have a mass of 3 kilograms and 7 kilograms simultaneously)? The candela example seems apt: the wattage per unit area of radiation from a lamp provides an upper bound on its luminous intensity, just as the data capacity of a register gives an upper bound on its entropy. In the luminous intensity example, the spectrum connects the two, and in the information example it is the probability density function that connects the two. —Quondum 23:20, 18 May 2015 (UTC)
@Quondum: It's quite straightforward. It means you can compress that 8 bits of data and store it in (on average) 2.37 bits. Jheald (talk) 01:11, 19 May 2015 (UTC)
I know what it means. I don't particularly see the point of this discussion. —Quondum 01:18, 19 May 2015 (UTC)
@Quondum: The point is that you're compressing bits into bits -- just like you crush down a ball of silver foil from a larger volume to a smaller volume. You don't need a separate unit for the volume of crushed-down silver foil, and it's a bad idea to introduce one because it hides what's going on.
It's a really important idea that Shannon entropy measures the number of bits that you can compress something into -- that those are the same bits that you measure your storage in, and that there is a formula (Shannon's entropy formula) that tells you how many of them you will need. It's all about bits. Introducing a spurious new unit makes that harder to see, and makes things harder to understand than they should be -- things like Kullback-Leibler divergence, or Minimum message length/Minimum description length approaches to inference with arguments like "bits back". Better to just to get used to the idea from the very start that Shannon entropy is measured in bits. Jheald (talk) 01:58, 19 May 2015 (UTC)
And what of the remainder of the family of Rényi entropy? —Quondum 02:10, 19 May 2015 (UTC)
To be honest, I've never really thought about using any unit for something like the collision entropy H2. Can you give any examples of such use?
Measurement of Shannon entropy in bits has an operational meaning anchored by Shannon's source coding theorem, ie how many storage bits one would need (on average) to express a maximally compressed message. But the data compression theorem is specific to Shannon entropy.
Rényi entropy can be regarded as the logarithm of an "effective number of types" (see diversity index), where different orders of the entropy correspond to whether more probable species should be given a more dominant or a less dominant weighting than they are in the Shannon entropy. But I don't think I've ever seen a Rényi entropy quoted in bits, or indeed in shannons -- I think these specifically relate to Shannon entropy. Jheald (talk) 09:11, 19 May 2015 (UTC)
On further investigation, it seems that there are people who do give units to Rényi entropies. The words "in units of shannons" under the graph in the Rényi entropy article were added by you diff, so I don't think can be taken as much of a straw either way. But here's a Masters thesis which measures the Rényi entropy in bits [1] (with a somewhat better graph); and I guess the use of bits isn't so inappropriate if they stand for the number of binary digits, this time in the "effective number of types". Jheald (talk) 09:45, 19 May 2015 (UTC)
Your dismissiveness is not helping. You really do not seem open to considering the merits of other ideas before you write them off. Besides, this really does not belong on this talk page. —Quondum 15:55, 19 May 2015 (UTC)
Fine, let's just agree to rip out all instances of the unit "shannon" from the page, as Leegrc originally suggested, and then I will let it drop.
I have given it consideration, and the result of that consideration is that the more I have considered it the more certain I am that the use of "shannon", rather than "bit" in the context Shannon originally introduced the word, with its equivalence to storage bits, is a stumbling-block that we help nobody by putting in their path. Jheald (talk) 16:51, 19 May 2015 (UTC)
I agree with JHeald:
"Then how would you express that 8 bits of data might have 2.37 bits of entropy in your analogy (how can the same object have a mass of 3 kilograms and 7 kilograms simultaneously)?"
First of all, an instance of data does not have entropy. Data is measured in bits, entropy is measured in bits. They are *operationally* connected by the idea that any data instance may be compressed to at least the entropy of the process. Simplistically, entropy is the number of yes/no questions you have to ask to determine an instance of the data, and knowing a data instance answers that many questions. Same units ("number of questions"), entropy asks, data answers. The number of bits in the data minus the number of questions the data instance answers (answer=yes/no, a bit) is the redundancy of the data (in bits). You cannot have an equation like that with mixed units, so they are the same, and finagling the equation with unit conversion constants is counterproductive, in my mind.
In thermodynamics, the (thermodynamic) entropy is measured in Joules/Kelvin, which is the same units as the heat capacity. This does not mean that the two are equal for a given system. They can be related to each other, however.
I agree that there is a distinction, but I have never had any conceptual difficulty dealing with bits of data and bits of entropy. The connection between the two is more important than their difference. I have also never seen the "Shannon" used in the literature. Two reasons not to use it.
I prefer the bit when it comes to information theoretic discussions. It is intuitively more obvious, particularly when probabilities are equal. Entropy can then be viewed as "how many questions do I have to ask to determine a particular data instance?" Even in thermodynamics, the thermodynamic entropy in bits can be expressed as the number of questions I have to ask to determine the microstate, given the macrostate. PAR (talk) 06:04, 21 May 2015 (UTC)