# Talk:Entropy (information theory)

WikiProject Statistics (Rated B-class, High-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

B  This article has been rated as B-Class on the quality scale.
High  This article has been rated as High-importance on the importance scale.
WikiProject Mathematics (Rated B-class, High-importance)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 B Class
 High Importance
Field: Applied mathematics
One of the 500 most frequently viewed mathematics articles.
WikiProject Physics (Rated C-class, Mid-importance)
This article is within the scope of WikiProject Physics, a collaborative effort to improve the coverage of Physics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C  This article has been rated as C-Class on the project's quality scale.
Mid  This article has been rated as Mid-importance on the project's importance scale.

WikiProject Engineering (Rated B-class, High-importance)
This article is within the scope of WikiProject Engineering, a collaborative effort to improve the coverage of engineering on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B  This article has been rated as B-Class on the project's quality scale.
High  This article has been rated as High-importance on the project's importance scale.
WikiProject Telecommunications (Rated B-class, High-importance)
This article is within the scope of WikiProject Telecommunications, a collaborative effort to improve the coverage of Telecommunications on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B  This article has been rated as B-Class on the quality scale.
High  This article has been rated as High-importance on the importance scale.
WikiProject Electronics (Rated B-class, High-importance)
This article is part of WikiProject Electronics, an attempt to provide a standard approach to writing articles about electronics on Wikipedia. If you would like to participate, you can choose to edit the article attached to this page, or visit the project page, where you can join the project and see a list of open tasks. Leave messages at the project talk page
B  This article has been rated as B-Class on the project's quality scale.
High  This article has been rated as High-importance on the project's importance scale.
WikiProject Computing (Rated C-class, Mid-importance)
This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C  This article has been rated as C-Class on the project's quality scale.
Mid  This article has been rated as Mid-importance on the project's importance scale.
WikiProject Systems (Rated B-class, High-importance)
B  This article has been rated as B-Class on the project's quality scale.
High  This article has been rated as High-importance on the project's importance scale.

## Independent and identically distributed?

I would like to remove the statement in the introduction that states that entropy is only defined for messages with an independent and identically distributed alphabet. This is not true, it is defined for any message, each letter of which may have different probabilities, and for which there may be correlations among the letters. For example, it is stated that in the English language, the pair "qu" occurs very frequently compared to other pairs beginning with "q". In other words, for the letter "q", the letter following it is highly correlated to it, not independent. Also, in English, the characters are not identically distributed, "e" is more probable than "x". For a message which is N letters long, with an alphabet of m letters, there are m^N possible messages, each with their own probability pi, with no restrictions on these probabilities other than that they are non-negative and their sum is equal to unity. The entropy of this set of messages is the sum of -pi log(pi over all m^N possible messages. All sorts of different letter frequencies and correlations may result from this set of m^N probabilities. PAR (talk) 11:50, 30 April 2011 (UTC)

It is a long time since I thought about this topic, but I think that text in the lead is talking about the best possible system (the one which conveys most information with each "character" that is transmitted). It looks out of place, or perhaps poorly expressed. Johnuniq (talk) 03:56, 1 May 2011 (UTC)

To say Shannon's entropy is only defined for independent and identically distributed spaces is perhaps a bit misleading. Perhaps the better way of saying this is that Shannon's entropy is a quantification of information in an i.i.d. You can apply it to English, but then it no longer represents the amount of information transmitted (see Kolmogorov Complexity). Note that you have also completely misunderstood what i.i.d. means in terms of English. It does NOT mean that the characters all have the same probability. It means that the probability of each character any a given position in unaffected by knowledge of characters at other positions (independent), and that the probability of each character is the same in every position (identically distributed). Naturally English doesn't abide by this, which means that in theory Shannon's entropy is probably not a lower bound for compressing English (again, see Kolmogorov Complexity).Elemental Magician (talk) 08:55, 11 April 2013 (UTC)

## Entropy as an apparent effect of conservation of information.

If entropy is considered an equilibrium property as in energy physics, then it conflicts with the conservation of information. But the second law of thermodynamics may simply be a apparent effect of the conservation of information, that is, entropy is really the amount of information it takes to describe a system, each reaction creates new information but information cannot be destroyed. That means the second law of thermodynamics is not an independent law of physics at all, but just an apparent effect of the fact that information can be created but not destroyed. The arrow of time is thus not about destruction, but about the continuous creation of information. This explains how the same laws of physics can cause self-organization. Organized systems are not anyhow less chaotic than non-organized systems at all, and the spill heat life produces can, in an information-physical sense, be considered beings eliminated by evolution rather than a step towards equilibrium. It is possible that overload of information will cause the arrow of time to extract vacuum energy into usefulness rather than heat death. 217.28.207.226 (talk) 10:51, 23 August 2011 (UTC)Martin J Sallberg

This sounds interesting, but the details are not explained clearly. For example, information can be destroyed--we do it whenever we delete a file. And there is no reason given that "The arrow of time is...about the continuous creation of information." There is no basis to edit the article until these claims are explained more clearly (and with references). David Spector (talk) 11:31, 17 February 2012 (UTC)

In response to the above response, you are wrong about information being destroyed when you delete a file. The information is dispersed and thereafter unretrievable, but it is not destroyed. Information is never destroyed.— Preceding unsigned comment added by 69.143.174.166 (talkcontribs) 06:21, 17 October 2013

## Ambiguous intro

The 2nd paragraph ends: "As another example, the entropy rate of English text is between 1.0 and 1.5 bits per letter,[6] or as low as 0.6 to 1.3 bits per letter, according to estimates by Shannon based on experiments where humans were asked to predict the next letter in a sample of English text.[7]"

The first range cited means even (50/50) or worse, the second means better than even or worse than even. If neither of these are typos wouldn't it be better to just say "0.6 to 1.5"?

robotwisdom (talk) 21:09, 27 March 2013 (UTC)

As far as I understand it, these were experimental studies with humans and therefore there's a high variance in the results. The two studies found slightly different ranges and you can't combine them like this. What really should be mentioned here is a study that analyzes English text electronically and computes the entropy rate in some way.
Btw, 50/50 means that, when given an initial sequence of text such as "The fox w", people were able to exclude, on average, half of all possible 26 letters as the next letter. ylloh (talk) 18:15, 28 March 2013 (UTC)
I think rather it means they can guess the next letter half the time-- much much harder. (I'm trying to gauge the entropy of Joyce's invented language in Finnegans Wake.) robotwisdom (talk) 00:54, 29 March 2013 (UTC)
Then it would not be 1 bit per letter but $\log_2(26) / 2 \approx 2.35$ bits per letter, I think. ylloh (talk) 23:56, 29 March 2013 (UTC)
(Is there somewhere else on the Web we should move this discussion?) If you arrange the 26 letters from commonest down (etaoinshrdlu, we used to say) couldn't you just blindly guess the top 13 and be right far more than half the time? There's a Web app somewhere that gave me values between 4 and 5 for both FW and plain English. I'm looking for a metric that conveys how much harder it is to proofread FW. robotwisdom (talk) 08:40, 30 March 2013 (UTC)
They weren't giving 13 guesses for the next letter, they were giving one -- and were right (more than) half the time.
It's an illustration of the high redundancy of English. If the letters were completely random, you would need to transmit log2 26 = 4.70 bits of information to identify each one. But actually English is not so random, and is quite compressible: you can often have a quite good guess as to what the next letter is. So if you build a good predictive model for the possibilities for the next letter, on average you only need to send one bit per symbol to identify it. See eg videos of Dasher in action. Jheald (talk) 10:00, 30 March 2013 (UTC)

Now I'm imagining a gigantic decision-tree including every imaginable message that might be sent and every variation in how it might be expressed. This would allow some sort of optimal compression... and FW strains this via puns and riddles and allusions. robotwisdom (talk) 18:04, 30 March 2013 (UTC)

## Absurdly based and highly misleading estimate of entropy of English

From the article: "As another example, the entropy rate of English text is between 1.0 and 1.5 bits per letter,[6] or as low as 0.6 to 1.3 bits per letter, according to estimates by Shannon based on experiments where humans were asked to predict the next letter in a sample of English text."

Since information content is only equal to Shannon's entropy given an independent and identically distributed language, which English is not, asking people to predict the next letter in a sample of English text would underestimate the Shannon entropy of English if people could see the parts of the text that they were not predicting. If people were not given any information, but were merely asked to give a probability of each character (or something like that) this should be included.

Judging by the fact that Shannon derived a rather lower value of entropy with this experiment, my guess is that this wasn't actually measuring the Shannon entropy (but something like the Algorithmic probability).Elemental Magician (talk) 09:03, 11 April 2013 (UTC)

This seems to be an extension of the discussion further up-page of whether Shannon entropy is only defined for processes that are i.i.d., which came to the view that it was not so restricted.
In the entropy of English case, it seems not unreasonable to consider the average entropy per character to be the average of $\scriptstyle{\sum p_i \log p_i}$, where the probabilities pi give the conditional probability distribution for a character at position n given all that have been seen previously. This is in fact the very example that PAR (talk · contribs) gave above for application of Shannon entropy to a non-IID process, where for example the probabilities pi for a letter following a 'q' will not be the same as those for a letter following a 't'. Jheald (talk) 16:57, 11 April 2013 (UTC)

## Characterization

The characterization section can probably be rewritten by a professional. If I am correct "A Characterization of Entropy in Terms of Information Loss" by Baez et al. only require functoriality, convex linearity, and continuity for the definition of Shannon's entropy. The current statements are mutually overlapping and probably not entirely true. Andy (talk) 14:10, 2 July 2013 (UTC)

It seems characterization is central to understanding the motivation for the definition of entropy. Perhaps it could be mentioned or referred to somewhere in the introduction? 130.243.214.198 (talk) 13:17, 7 March 2014 (UTC)

## Definition

Two vaguenesses in the Definition section.

1. "When taken from a finite sample, the entropy can explicitly be written as...": here it sounds like the author means that the entropy is taken from a finite sample, which is certainly not the case.

2. The term n_i in the expanded form for H(X) is never defined. Scorwin (talk) 17:03, 3 October 2013 (UTC)

## Units of entropy

The introduction states bits, nats and bans as common units of Shannons entropy, while the definition section uses dits, rather than bans. — Preceding unsigned comment added by 129.177.143.236 (talk) 10:56, 18 November 2013 (UTC)

## Kullback–Leibler divergence

The definition given here looks different from that on http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence Please reconcile 108.6.15.34 (talk) 21:04, 15 June 2014 (UTC)

Note that f, in the article here, equals p divided by m. Substituting p / m instead of f into the definition used in the article gives
$D_{\mathrm{KL}}(p\|m) = \int \ln\left(\frac{p(x)}{m(x)}\right) p(x) \, {\rm d}x, \!$
the form used in the Kullback-Liebler article. Jheald (talk) 07:09, 16 June 2014 (UTC)