Talk:Information theory: Difference between revisions

Content deleted Content added

Inline

Revision as of 01:56, 7 November 2016

This is the talk page for discussing improvements to the Information theory article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Archives: 1, 2: 180 days

Information theory received a peer review by Wikipedia editors, which is now archived. It may contain ideas you can use to improve this article.

Template:Vital article

This article has not yet been rated on Wikipedia's content assessment scale.
It is of interest to the following WikiProjects:

Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.

Statistics B‑class Top‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
B	This article has been rated as B-class on Wikipedia's content assessment scale.
Top	This article has been rated as Top-importance on the importance scale.

Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.

Computing B‑class Mid‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
B	This article has been rated as B-class on Wikipedia's content assessment scale.
Mid	This article has been rated as Mid-importance on the project's importance scale.

Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.

Computer science B‑class Top‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

B

This article has been rated as B-class on Wikipedia's content assessment scale.

Top

This article has been rated as Top-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Mathematics Top‑priority

	Mathematics portal This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.MathematicsWikipedia:WikiProject MathematicsTemplate:WikiProject Mathematicsmathematics articles
Top	This article has been rated as Top-priority on the project's priority scale.

Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.

Telecommunications B‑class Top‑importance

	Telecommunication portal This article is within the scope of WikiProject Telecommunications, a collaborative effort to improve the coverage of Telecommunications on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.TelecommunicationsWikipedia:WikiProject TelecommunicationsTemplate:WikiProject TelecommunicationsTelecommunications articles
B	This article has been rated as B-class on Wikipedia's content assessment scale.
Top	This article has been rated as Top-importance on the project's importance scale.

Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.

Cryptography: Computer science B‑class Top‑importance

	This article is within the scope of WikiProject Cryptography, a collaborative effort to improve the coverage of Cryptography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.CryptographyWikipedia:WikiProject CryptographyTemplate:WikiProject CryptographyCryptography articles
B	This article has been rated as B-class on Wikipedia's content assessment scale.
Top	This article has been rated as Top-importance on the importance scale.
	This article is supported by WikiProject Computer science (assessed as Top-importance).

Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.

Systems B‑class High‑importance

	Systems science portal This article is within the scope of WikiProject Systems, which collaborates on articles related to systems and systems science.SystemsWikipedia:WikiProject SystemsTemplate:WikiProject SystemsSystems articles
B	This article has been rated as B-class on Wikipedia's content assessment scale.
High	This article has been rated as High-importance on the project's importance scale.
	This article is within the field of Cybernetics.

Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.

Neuroscience B‑class Mid‑importance

	This article is within the scope of WikiProject Neuroscience, a collaborative effort to improve the coverage of Neuroscience on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.NeuroscienceWikipedia:WikiProject NeuroscienceTemplate:WikiProject Neuroscienceneuroscience articles
B	This article has been rated as B-class on Wikipedia's content assessment scale.
Mid	This article has been rated as Mid-importance on the project's importance scale.

Archives

/Archive 1 (January 2003 – July 2006)
/Archive 2 (June 2006 – )

This page has archives. Sections older than 180 days may be automatically archived by when more than 5 sections are present.

This is the talk page for discussing improvements to the Information theory article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Archives: 1, 2: 180 days

Shannon's H = entropy/symbol

From A Mathematical Theory of Communication, By C. E. SHANNON

Reprinted with corrections from The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948.

The choice of a logarithmic base corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly bits, a word suggested by J. W. Tukey. ...
Suppose we have a set of possible events whose probabilities of occurrence are p1􏰘 p2􏰘...pn. These probabilities are known but that is all we know concerning which event will occur. Can we find a measure of how much “choice” is involved in the selection of the event or of how uncertain we are of the outcome? If there is such a measure, say H􏰀 p1􏰘 p2􏰘 􏰗 􏰗 􏰗 􏰘 pn􏰁, it is reasonable to require of it the following properties:

H should be continuous in the pi.

If all the pi are equal, pi 􏰃 1 , then H should be a monotonic increasing function of n. With equally n likely events there is more choice, or uncertainty, when there are more possible events. (emphasis mine)

If a choice be broken down into two successive choices, the original H should be the weighted sum of the individual values of H. ...

Theorem 2: The only H satisfying the three above assumptions is of the form: $H=-K\sum _{i=1}^{n}p_{i}logp_{i}$
where K is a positive constant.

Quite simply, if we have an m-symbol alphabet and n equally likely independent symbols in a message, the probability of a given n-symbol message is $p_{i}=(1/m)^{n}$ . There are $m^{n}$ such messages $\left(\sum _{all}p_{i}=1\right)$ . Shannon's H defined above (K=1) is then H=n log(m). For a 2-symbol alphabet, log base 2, H=n bits, the number of symbols in a message, clearly NOT an entropy per symbol. PAR (talk) 05:50, 10 December 2015 (UTC)[reply]

I have removed statements to the contrary - The removed material provides a double definition of H, the definition by Shannon is the only one we should be using. The removed material confuses the information content of a message composed of independently distributed symbols, with probabilities estimated from a single message, with the entropy - the expected value of the information content of a message averaged over all messages, whether or not the symbols are independent. Furthermore, when a file or message is viewed as all the data a source will ever generate, the Shannon entropy H in bits per symbol is zero. The probability of the message is 1 and 1 log(1)=0. Strictly speaking, entropy has no meaning for a single message drawn from a population of multiple possible messages. It is a number associated with the population, not with a message drawn from that population. It cannot be defined without an a priori estimate or knowledge of those message probabilities. PAR (talk) 07:11, 11 December 2015 (UTC)[reply]

No, "per symbol" is per symbol of data received by a source, just as the article stated before you followed me here. It is not "per unique symbol" as you're interpreting it the units. You did not quote Shannon where he stated the units of H. To actually quote Shannon in section 1.7: "This (H) is the entropy per symbol of text" and "H or H' measures the amount of information generated by the source per symbol or per second." Where you reference his saying the p's should sum to 1 in section 1.6 is where he says "H is the entropy of a set of probabilities" and since they sum to 1 in this context, this is on the basis of 1 complete symbol from a source, that is, entropy per 1 symbol. This was clearly explained with quotes from Shannon in what you reverted. Also, by your reasoning a 10 MB of data from a source would have and entropy of no more than 1. My contributions survived 5 days without complaint, so please let someone else who's got a correct point of view edit it. Ywaz (talk) 18:46, 11 December 2015 (UTC)[reply]

Agreed.

I think we are using two different sources. The source I used is listed above and the quotes are correct. Could you please give the "Shannon" source you are using? PAR (talk) 20:14, 11 December 2015 (UTC)[reply]

Your quotes are fine. I'm quoting your source, top of page 11, literally 2 sentences before your quote, and page 13. Ywaz (talk) 00:19, 12 December 2015 (UTC)[reply]

Well this Shannon paper is not as straightforward as I thought. He does use two symbols for H, one for the general definition which is tied only to a probability mass function associated with an "event", no mention of symbols, etc, and another which is an entropy per symbol of a source of symbols behaving as a Markoff process. For the Markoff process, an entropy Hi is defined for a state according to the general definition, and then a new H, the H of the Markoff process is defined in terms of those Hi. That's an unfortunate duplication, but the first, general, definition is preferred as a definition of entropy. The second is an application of the concept of entropy to a specific situation.

I did quote Shannon where he stated the units - the quote involving Turkey is on page 1. Also, Shannon, in various places makes it clear that the frequencies of symbol occurrence in a finite message are NOT the frequencies associated with the source, only an approximation that gets better and better with longer messages, being exact only in the limit of infinite length. If you think I am wrong on this, please point out where Shannon says otherwise.

Also, yes, by my reasoning a 10 MB .gif file from a source would have and entropy of zero, IF IT WERE THE ONLY MESSAGE THE SOURCE WOULD AND COULD PROVIDE. Entropy is a measure of MISSING information, and once you have that 10Mb, there is no missing information. If a library contains 1024 books, and every day I send someone in to pick one out at random, the entropy of the process is log2(1024)=10 bits. Its the amount of information given to me by knowing which book is brought out, or the amount of missing information if I don't. It has nothing to do with a completely different process, the information IN the book, or in your case, the .gif file. If there were only one book in the library (or one .gif file), the entropy would be log2(1)=0, i.e. complete certainty.

I have to admit, that if it is the only book in the library, then the letter frequencies in the book are, by definition, the exact frequencies of the process which created the book. But the process created no other books. You can use these frequencies to compare the entropy or entropy/symbol of the first half of the book to the last half, etc. etc., but the minute you apply them to another book, you have changed the process, and they become only approximations. You cannot talk about the exact entropy of a lone message. If its truly alone, the entropy of the message is zero. If you divide it in half, then those halves are not alone, and you can compare their entropies.

In statistical mechanics, this corresponds to knowing the microstate at a particular instant. If we just think classically, then the future microstate history of the system can be predicted from that microstate, from the positions and velocities of each particle. There is no need for thermodynamics, its a pure mechanics problem. Entropy a flat zero, complete certainty.

I will continue to read the article, since, as I said, its not as straightforward as I thought. PAR (talk) 06:55, 13 December 2015 (UTC)[reply]

If there is a distribution on finitely many objects then there is an entropy. A non-standard way (by current standards) to specify a sample from a distribution of symbols is to juxtapose the symbols from the sample. The entropy of this sample can then be computed and it is an approximation to the entropy of the true distribution. If you want to write up a short example of this, that is fine, but please use modern notation for a set (or tuple) and please do not call it "entropy per symbol" which is a complete bastardization of the current meaning of that term. Please give the answer it "bits", "nats", or similar not in the bastardized "bits/symbol", "nats/symbol", etc.

On the other hand, if there is a stream of data or a collection of infinite strings, such as described in the coding theory section then the entropy as just described for the case of a finite sample or distribution can be problematic. In this case, there could be an unbounded number of possibilities with negligible probability each and thus the computed entropy could be unboundedly large. In this case it can make sense to talk about how the entropy grows with the length of the string. This is explained the the Rate subsection of the Coding Theory section. In this case one gets as the answer an amount of entropy per (additional) symbol. This is an entropy rate (or information rate) and has units of "bits/symbol", "nats/symbol", etc. In this case it is common to juxtapose the symbols into what is commonly known as a "string", because each string has its own probability that is not necessarily directly computable from the characters that comprise it. If you think the Coding Theory or Rate sections need improvement, please feel free to make improvements there.

The language in the last few days misuses (by current standards) the meaning of entropy per symbol. Unfortunately it is therefore a significant negative for the reader. I am removing it. 𝕃eegrc (talk) 19:17, 14 December 2015 (UTC)[reply]

Leegrc, do you have a reference to show that Shannon's definition of entropy is not the accepted definition? Honestly, your position is not tenable and PAR does not agree with you. Please stop undoing my edits on the entropy pages in your proclamation that Shannon's classic book should be ignored. Let someone else undo my edits. Ywaz (talk) 21:57, 17 January 2016 (UTC)[reply]

It appears that this conversation is being continued in the following section "Please provide a reference or remove this material". That is where you will find my response. 𝕃eegrc (talk) 17:31, 19 January 2016 (UTC)[reply]

Please provide a reference or remove this material

Ywaz - This is the same material that was previously removed. Please provide a clear reference for the statement "When a file or message is viewed as all the data a source will ever generate, the Shannon entropy H in bits per symbol is...". Either that or remove the statement, and the conclusions drawn from it. PAR (talk) 03:06, 18 January 2016 (UTC)[reply]

Leergc removed it before you and I had reached agreement that H is in "bits/symbol". I believe I reposted about half as much as the previous text that I had. How about "When a set of symbols is big enough to accurately characterize the probabilities of each symbol, the Shannon entropy H in bits per symbol is..."? The primary clarification I want to get across to readers is that when they see "Shannon Entropy" they have to determine if the writer is talking about specific entropy H (per symbol) or total entropy N*H in bits (or shannons).Ywaz (talk) 16:28, 18 January 2016 (UTC)[reply]

I don't have Shannon's original paper. Surely, if his terminology as you see it is still in use you can find a modern textbook that uses "bits/sybmol" in the way you believe to be correct. Would you cite it? I would cite a textbook that says that "bits/sybmol" is an incorrect description for entropy (other than in the case of strings of unbounded length), if publishing negative results were commonplace; but such is not commonplace. :-( 𝕃eegrc (talk) 17:31, 19 January 2016 (UTC)[reply]

You can see in the discussion directly above where I showed PAR the quotes from Shannon himself, so I just removed the "disputed" tag you just added. PAR provided a link to Shannon's text at the beginning of the previous topic above. In addition to the quotes I showed PAR on pages 13 and 14, Shannon says on pages 16, 17, 18, 19, and 20 that H is "bits per symbol". Do a "CTRL-F" on that link and search for "per symbol". Ywaz (talk) 18:37, 19 January 2016 (UTC)[reply]

I have moved the paragraphs in question to the historical section of the article. I trust that you are faithfully reporting what you see in Shannon's paper and thus I do not dispute that it is a part of history. Until you can cite a modern textbook that uses the same terminology, please leave these paragraphs in the historical section. 𝕃eegrc (talk) 12:52, 20 January 2016 (UTC)[reply]

There is not a historical difference. I do not know where you think I am disagreeing with anyone. Shannon called H "entropy" like everyone else does today. I would like to make sure people understand Shannon's entropy H, the whole world over in all times, is in units of entropy per symbol like shannons and modern researchers say, which means it is an intensive entropy S⁰ and not the regular physical entropy S that you can get from S=N*H. As far as a modern reference, I believe this well-known NIH researcher's home page is clear enough: https://schneider.ncifcrf.gov/

PAR, I think I've got the connection between physical and information entropy worked out for an ideal gas on the [Sakur-Tetrode talk page]. In short, they are same if you send messages by taking a finite number of identical marbles out of a bag to use as "symbols" to be placed in a much larger number of slots in space (or time) to represent phase space. In other words, there is a bigger difference than I hoped for in a simplest case. It may turn out that Gibbs and QM entropy is closer to Shannon's H, but it may require an equally strained view of information to get it there. At least Landauer's principle implies a direct and deep connection. Ywaz (talk) 23:37, 20 January 2016 (UTC)[reply]

Thank you for the link to Tom Schneider's work. That helps me to understand why you have been advocating for "bits/symbol" where I thought "bits" was more appropriate. I now see it as a tomAYto vs. tomAHto thing; both right, depending upon how exactly things are framed. In particular, with a few copy edits I think I have managed to preserve your "bits/symbol" with enough context that makes it meaningful to us tomAYto folks too. 𝕃eegrc (talk) 18:51, 21 January 2016 (UTC)[reply]

No. "Bits" for H is wrong in every way: logically, mathematically, historically, currently, and factually. Can you cite any authority in information theory that says H is in bits? The core problem is that Shannon called H "entropy" instead of "specific entropy". You can see his error blatantly in section 1.7 of his booklet where he says H is in units of "entropy/symbol". That's exactly like calling some function X "meters" and then saying it has units of "meters/second". Anyone saying it is in "bits" is doing a disservice to others and they are unable to do correct calculations. For example, by Landauer's principle, the minimal physical entropy generated by erasing 1 bit is S=kT*ln(2). So if you want to build a super-efficient supercomputer in the future and need to know the minimal heat it releases and you calculate H and think it is in "bits", then you will get the wrong answer in Joules of the minimal energy needed to operate it and how much cooling it will need. Ywaz (talk) 10:46, 22 January 2016 (UTC)[reply]

I suspect that we are 99% in agreement; let me explain my thoughts further. It does not have to be a distribution over sybmols in order to have an entropy, right? For my first meal tomorrow I can have eggs, cereal, stir fry, etc. each with some probability, and that distribution has an entropy. I can report that entropy in bits (or nats, etc.). If it is the case that the distributions for subsequent breakfasts are independent and identically distributed with tomorrow's breakfast then I can report the same numerical quantity and give it units of bits/day. Furthermore, if it turns out that your breakfasts are chosen from a distribution that is independent and identically distributed to mine then I can report the same numerical quantity but give it units of bits/day/person. How many of "per day", "per person", etc. are appropriate depends upon the precise wording of the problem. The edits I made to the text are meant to make the wording yield "bits/symbol" as the natural units regardless of whether the reader is you or me. 𝕃eegrc (talk) 14:10, 22 January 2016 (UTC)[reply]

As a side note, I was not the one who removed your examples and formulae that used "counts", etc., though I was tempted to at least make revisions there. I suspect that we still have some disagreements to resolve for that part. 𝕃eegrc (talk) 14:10, 22 January 2016 (UTC)[reply]

No, it H is always a statistical measure applied to symbols. "Eggs" is a symbol as far as entropy is concerned, not something you eat. But this seems beside the point. A distribution of symbols has an "entropy" only because Shannon made the horrendous mistake of not clarifying more clearly that it is a specific entropy, not a total entropy. No, you can't report H "entropy" in bits according to Shannon or any other good authority. The N*H is the entropy in bits. H is bits/symbols. Getting units right is not subjective or optional if you want to be correct. You just made a mistake by not keeping the units straight: you can't say bits/day/person because then you would calculate our combined entropy H in bits/day as 2 people * H bits/day/person = 2*H bits/day. But the H calculated for us individually or together gives the same H.

How are you and Barrelproof going to calculate the probabilities of symbols in data if you do not count them? It seems to me that he should not be reverting edits simply because it "seems" to him (his wording) that the edit was not correct. But I suppose that means I have to prove with references that is how it is done if someone interested in the article is skeptical. I've seen programs that calculate the entropy of strings and usually they use the variable "count" just like I did. Ywaz (talk) 23:12, 22 January 2016 (UTC)[reply]

Probabilities are often estimated by counting, but that is a matter of practical application and approximation – not theory. The article is about information theory. The relationship between the results of counting and true probabilities can be complex to analyze properly (e.g., issues relating to sampling and the questions of ergodicity and stationarity must be considered), and those details are unnecessary to consider in this article and are generally not part of an introduction to information theory. For example, the results from counting will only be reasonably valid if the number of samples measured by the counts is very large. Shannon's paper and other well-regarded sources on information theory do not equate counting with true probability. Probability is not the same thing as Statistics, which is why there are separate articles on those two subjects. Much of the study of information theory is conducted using idealized source models, such as by studying what happens with Bernoulli trials or with a Markov process or Gaussian random variables. No counting is used in such analysis. I am personally very skeptical of anything written in an introduction to information theory that is based on the idea that Shannon made some "horrendous mistake". I have seen no evidence that people who work on information theory have found any significant mistakes in Shannon's work. His work is universally held in the very highest regard among mainstream theoreticians. —BarrelProof (talk) 21:33, 23 January 2016 (UTC)[reply]

BarrelProof, I had stated the count method was if the data was all the source ever generated, which negates your ergodicity, sampling, and stationarity complaints. Shannon's section 1.7 relies on a count method for large N. Shannon may not have mentioned small N but I do not know why "respected" texts, however you define that, would not use it on small N in the same way the Fourier transform is applied to non-repeating signals: you simply treat it as if it was repeated. Also, physical entropy does not depend on an infinite N, or even a large N, and Landauer's limit shows how deep the connection is. I was able to derive the entropy equation for a monoatomic ideal gas from using the sum of the surpisals which is a form of H. The surprisal form was needed to maintain the random variable requirement of H. You are using an appeal to authority to justify an obviously terrible mistake in Shannon's nomenclature that makes it harder for newcomers to "catch on". It is a horrendous mistake for Shannon to call "entropy/symbol" an "entropy" instead of "specific entropy". Look how hard it was to convince Leegrc and PAR that H is entropy/symbol even after quoting Shannon himself. If you disagree, then please explain to me why it would be OK to call a speed in meters/second a distance. Ywaz (talk) 14:39, 24 January 2016 (UTC)[reply]

I am not convinced that Shannon made a mistake, I am not convinced he didn't. Its dumb to argue about what he wrote, when it is what it is. Can we agree that this source ^[1] is what we are talking about?]

After all of the above, my request has still not been answered: Please provide a clear reference for the statement "When a file or message is viewed as all the data a source will ever generate, the Shannon entropy H in bits per symbol is...". PAR (talk) 00:35, 25 January 2016 (UTC)[reply]

That phrase needs to be specified more clearly and I do not have a book reference (I have not looked other than Shannon) to justify programmers who are doing it this way (and probably unknowingly making the required assumptions like it being a random variable), so I have not complained at it being deleted. We already agreed on that as the source, and he did not mention short strings. I am not currently arguing for going back to the things I had written simply because you guys seem to want references to at least show "it's done that way" even if the math and logic are correct. My last few comments are to show the other excuses being made for deleting what I wrote are factually wrong, as evidenced by Barrelproof and Leegrc not (so far) negating my rebuttals. I am not saying Shannon had a factual error, but made a horrendous nomenclature error in the sense that it makes entropy harder to understand for all the less-brilliant people who do not understand he meant his H is specific entropy. If he had not formulated his definitions to require the p's in H to sum to 1, then his Boltzmann-type H could have been a real Gibbs and QM entropy S.Ywaz (talk) 02:11, 25 January 2016 (UTC)[reply]

I've been reading Shannon & Weaver ^[2] and what I have read makes sense to me. I have no problem with the present "entropy per symbol" as long as it is clear that this is a special case of a streaming information source as contrasted with the more general case of an entropy being associated with a set of probabilities p_i.

The "error" that Shannon & Weaver make is the assignment of the letter H to a set of probabilities p_i (^[1] page 50) as:

H=-\sum _{i}p_{i}log(p_{i})

and also to the entropy of an information source delivering a stream of symbols, each symbol having an INDEPENDENT probability of p_i, in which case the above sum yields the entropy per symbol (^[1] page 53). Both definitions are purely mathematical, devoid of any interpretation per se, it only acquires meaning in the real world through the association of the probabilities p_i with some process or situation in the real world. The second is more restrictive in the assumption that each symbol in the message is INDEPENDENT.

As a physicist, I am naturally interested in information entropy as it relates to Boltzmann's equation H=k log(W). This uses the first definition of entropy. Every macrostate (given by e.g. temperature and pressure and volume for an ideal gas) consists of a set of microstates - in classical stat. mech., the specification of the position and velocity of each particle in the gas. It is ASSUMED that each microstate has a probability p_i=1/W where W is the number of microstates which yield the given macrostate. Using these probabilities, the Shannon entropy is just log(W), (in bits if the log is base 2, in nats if base e) and Boltzmann's entropy is the Boltzmann constant (k) times the Shannon entropy in nats.

Each microstate can be considered a single "symbol" out of W possible symbols. If we have a collection of, say 5 consecutive microstates of the gas taken at 1-second intervals, all having the same macrostate (the gas is in equilibrium), then we have a "message" consisting of 5 "symbols". According to the second definition of H, the entropy log(W) for each microstate and the entropy of the 5-microstate "message" is 5 log(W), and we might say that Boltzmann's H is the entropy per symbol", or "entropy per microstate". I believe this was the source of confusion. Its a semantic problem, clearly illustrated by this example. Saying that H is "entropy per microstate" DOES NOT MEAN that each microstate has an entropy. Furthermore, we cannot specify H without the p_i which were ASSUMED known.

The idea that a single symbol (microstate) has an entropy is dealt with in Shannon & Weaver, and it is clearly stated that a single message has entropy zero, in the absence of noise. To quote Shannon from ^[1] page 62:

If a source can produce only one particular message its entropy is zero, and no channel is required.

Also Weaver page 20 (parenthesis mine):

The discussion of the last few paragraphs centers around the quantity "the average uncertainty in the message source when the received signal (symbol) is known." It can equally well be phrased in terms of the similar quantity "the average uncertainty concerning the received signal when the message sent is known." This latter uncertainty would, of course, also be zero if there were no noise.

See page 40 where Shannon makes it very clear that the entropy of a message (e.g. AABBC) is a function of the symbol probabilites AND THEIR CORRELATIONS, all of which must be known beforehand before entropy can be calculated. Shannon draws a clear distinction (^[1] page 60-61) between an estimation of entropy for a given message and "true" entropy (parentheses mine):

The average number HI of binary digits used per symbol of original message is easily estimated....
(by counting frequencies of symbols in "original message")
We see from this that the inefficiency in coding, when only a finite delay of N symbols is used, need not be greater than 1/N plus the difference between the true entropy H and the entropy GN calculated for sequences of length N.

In other words, if you are going to estimate entropy from messages, "true" entropy is only found in the limit of infinitely long messages. PAR (talk) 06:48, 25 January 2016 (UTC)[reply]

References

^ ^a ^b ^c ^d ^e Shannon, Claude; Weaver, Warren (1964). The Mathematical Theory of Communication (PDF). Urbana, Illinois: University of Illinois Press. p. 29ff. LCCN 49-11922.
^ Cite error: The named reference ShannonWeaver1968 was invoked but never defined (see the help page).

H in Shannon's words (on page 10 and then again on page 11) is simply the entropy of the set of p's ONLY if the p's sum to 1, which means it is "per symbol" as it is in the rest of the book. I mentioned this before. On both pages he says "H is the entropy of the set of probabilities" and "If the p's are equal p=1/n". This is completely different from Gibbs and QM entropy even though the Shannon's H look the same.[edit: PAR corrected this and the next statement] They are not the same because of the above requirement Shannon placed on H, that the p's sum to 1. Boltzmann's equation is not H=k*ln(states). It's S, not H. Boltzmann's H comes from S = k*N*H where NH=ln(states). This is what makes Shannon's H the same as Boltzmann's H and why he mentioned Boltzmann's H-theorem as the "entropy" not Boltzmann's S=k*ln(states) and not Gibbs or QM entropy. H works fine as a specific entropy where the bulk material is known to be independent from other bulks. The only problem in converting H to physical entropy on a microscopic level is that even in the case of an ideal gas the atoms are not independent. When there are N atoms in a gas, all but 1 of them have the possibility of being motionless. Only 1 carrying all the energy is a possibility. That's one of the states physical entropy has that is missed if you err in treating the N atoms as independent. I came across this problem when trying to derive the entropy of an Einstein solid from information theory: physical entropy always comes out higher than information theory if you do not carefully calculate the information content in a way that removes the N's mutual dependency on energy. So H is hard to use directly. The definition of "symbol" would get ugly if you try to apply H directly. You would end up just using Boltzmann's methods, i.e. the partition function. Ywaz (talk) 15:42, 25 January 2016 (UTC)[reply]

First of all, yes, I should have said S=k log(W), not H=k log(W). Then S=k H. W is the number of microstates in a macrostate, (the number of "W"ays a macrostate can be realized). Each microstate is assumed to have equal probability p_i = 1/W and then the sum of -(1/W)log(1/W) over all W microstates is just H=log(W). This is not completely different from Gibbs entropy, it is the same, identical. Multiplying by N gives a false result, S != k N H.

This is the core of the misunderstanding. We are dealing with two different processes, two different "H"s. Case 1 is when you have a probability distribution p_i (all of which are positive and which sum to 1, by definition). There is an entropy associated with those p_i given by the usual formula. There are no "symbols" per se. Case 2 is when you have a stream of symbols. If those symbols are independent, then you can assign a probability p_i to the occurence of a symbol. If they are not, you cannot, you have to have a more complicated set of probabilities. Case 2 is dealt with using the more general concepts of case 1. If the p_i are independent, they represent the simple probability that a symbol will occur in a message, and the entropy associated with THAT set of probabilities is the "entropy per symbol". In other words, it is the entropy of THE FULL SET OF N-symbol messages divided by the number of symbols in a message (N). THIS DOES NOT MEAN THAT A SYMBOL HAS AN ENTROPY, IT DOES NOT. THIS DOES NOT MEAN THAT A MESSAGE HAS AN ENTROPY, IT DOES NOT. Entropy is a number associated with the full set of N-symbol messages, NOT with any particular symbol or message. As Shannon said on pages 60-61 of the 1964 Shannon-Weaver book, you can estimate probabilities from a message to get a "tentative entropy" and this tentative entropy will approach the "true" entropy, as the number of symbols increases, goes to infinity. But strictly speaking, as Shannon said on page 62 of the book, the entropy of a lone message, in which the "full set" consists of a single message, is zero. You cannot speak about entropy without speaking about the "full set" of messages.

PLEASE - I am repeating myself over and over, yet your objections are simply to state your argument without analyzing mine. I have tried to analyze yours, and point out falsehoods, backed up by statements from Shannon and Weaver. If there is something that you don't understand about the above, say so, don't just ignore it. If there is something you disagree with, explain why. This will be a fruitless conversation if you just repeat your understanding of the situation without trying to analyze mine.

You say "They are not the same because of the above requirement Shannon placed on H, that the p's sum to 1." and also " If he had not formulated his definitions to require the p's in H to sum to 1, then his Boltzmann-type H could have been a real Gibbs and QM entropy S". This is WRONG. The p's ALWAYS sum to one, otherwise they are not probabilities. You are not considering the case where the p_i's are not independent. In other words, the statistical situation cannot be represented by a probability p_i associated with each symbol. In this case you don't have p_i, you have something more complicated, p_{i j} for example where the probability of getting symbol j is a function of the previous symbol i. Those p_{i j} always sum to 1 by definition, and the entropy is the sum of -p_{i j} log p_{i j} over all i and j.

You say "the only problem in converting H to physical entropy on a microscopic level is that even in the case of an ideal gas the atoms are not independent." No, that's only a problem only if you insist on your narrow definition of entropy, in which the "symbols" are all independent. Shannon dealt with this on page 12 of the Shannon-Weaver book ^[1]. Suppose you have a bag of 100 coins, 50 are trick coins with two heads, 50 are normal coins with a head and a tail. You pick a coin out of the bag, and without looking at it, flip it twice. You will have a two-symbol message. You will have HH occur with probability p_HH=1/2+1/8=5/8, HT, TH and TT will occur with probability p_HT=p_TH=p_TT=1/8. There is no p_i here. There is no way to assign a probability to H or T, such that the probability of XY=p_Xp_Y where X and Y are H and/or T. There is no p_i associated with each symbol that lets you calculate the sum over H and T of -p_i log(p_i). You are stuck with the four probabilites p_HH, p_HT, p_TH, p_TT. They sum to 1. The entropy of the process is the sum of -p_{i j} log(p_{i j}>) over all four possible i,j pairs.

Things only get "ugly" because you are not doing it right. When you do it right, things are beautiful. PAR (talk) 19:56, 25 January 2016 (UTC)[reply]

I didn't realized the p's summed to 1 in Gibbs and QM. Why does the Wikipedia H-theorem page use S=k*N*H? As I said, the entropy of a message is only valid if it is considered all the data the source will ever generate, so it is the full set of messages, not just a single message. Specific entropy is useful. Of course I do not expect a single symbol by itself to carry H, but H is the entropy the source sends per symbol on average in case 1 and 2. Although case 1 is an "event", I do not see why that can't be assigned a symbol. If an event is a thing that can be defined, then a symbol can be assigned to it. H is discreet, so I do not know how symbols can't be assigned to whatever is determining the probability. I do not see any difference between your case 1 and case 2 if it isn't merely a matter of dependencies and independence. As far as a message carrying information, it seems clear a message of length M of independent symbols from a source known to have H will have H*M entropy (information) in it, if M displays the anticipated distribution of p's, and the degree to which it does not is something I believe you described to me with before. Concerning your coin example, you gave me the symbols: HH, HT, TH, and TT are the symbols for that example. The symbols are the things to which you assign the p's. Ywaz (talk) 20:52, 25 January 2016 (UTC)[reply]

Regarding the H-Theorem page, I think it's wrong. It has a strange definition of H as <log(P)> rather than - <log(P)>, but then there is a negative sign in the entropy equation which is written S = -NkH. Each microstate has an equal probability, so P=1/W, and H *should be* H=-Σ P log(P) = -Σ (1/W)log(1/W)=log(W). Boltzmann's entropy equation is S=k log(W), and I don't know where the N comes from. There is no reference for the statement.

Ok, I see what you are saying, I agree, you can say that each p_i in Case 1 refers to a symbol, and Case 1 is a one-symbol message. In the case of thermodynamics and stat. mech., the full set is all W microstates, so there are W symbols, each with the same probability 1/W. If you consider, say, a gas at a particular pressure, temperature, volume, it exists in a particular macrostate, and one of the W microstates, its a one-symbol message. That microstate (symbol) is not known, however. We cannot measure the microstate of a large system. We can quantify our lack of knowledge by calculating the information entropy (H) of the macrostate, and it is log(W). log(W) in bits is the average number of yes/no questions we would have to ask about the microstate of the gas in order to determine what that microstate is. Its a huge number. The thermodynamic entropy S is then k log(W) and k is a small number, and we get a reasonable number for S.

Regarding the HH, HT, TH, TT example, we are on the same page here. H and T are not the symbols, HH, HT, TH, and TT are.PAR (talk) 07:30, 26 January 2016 (UTC)[reply]

S=k N H is only one of the equations they gave. They say what you said a month or two ago: it's only valid for independent particles, so I do not think it is in error. Someone on the talk page also complains about them not keeping the -1 with H. In the 2nd paragraph you sum it up well. Ywaz (talk) 10:53, 26 January 2016 (UTC)[reply]

Well, I guess I don't know what I meant when I said that. I don't know of a case where H is not log(W). That's the same as saying I don't know of a case where the probabilities of a microstate are not p_i = 1/W for every microstate. The devil is in the details of what constitutes a microstate, what with quantum versus classical, identical particles, Bose statistics, Fermi statistics, etc. etc. PAR (talk) 22:37, 26 January 2016 (UTC)[reply]

I do not follow your 3rd sentence, and I'll look into the possibility that it's not a valid formula under that condition. To me it is the same specific entropy: take two boxes of gas that are exactly the same. The entropy of both is 2 times the entropy of 1. So S=kH is the entropy of 1, and S=kHN is the entropy of 2.

From Shannon's H to ideal gas S

This is a way I can go from information theory to the Sackur-Tetrode equation by simply using the sum of the surprisals or in a more complicated way by using Shannon's H. It gives the same result and has 0.16% difference from ST for neon at standard conditions.

I came up with the following while trying to view it as a sum of surprisals. When fewer atoms are carrying the same total kinetic energy E, E/i each, they will each have a larger momentum which increases the number of possible states they can have inside the volume in accordance with the uncertainty principle. A complication is that momentum increases as the square root energy and can go in 3 different directions (it's a vector, not just a magnitude), so there is a 3/2 power involved.

$S=k_{B}\sum _{i=1}^{N}\ln \left({\frac {\Omega _{i}}{i}}\right)$ where $\Omega _{i}=\Omega _{N}*\left({\frac {N}{i}}\right)^{\frac {3}{2}}$

You can make the following substitions to get the Sackur-Tetrode equation:

$\Omega _{N}=\left({\frac {xp}{\frac {\hbar }{2\sigma ^{2}}}}\right)^{3},x=V^{\frac {1}{3}},p=\left(2mU/N\right)^{\frac {1}{2}},\sigma =0.341,U/N={\frac {3}{2}}k_{B}T$

The probability of encountering an atom with a certain momentum depends (through the total energy constraint) on the momentum of the other atoms. So the probability of the states of the individual atoms is not a random variable with regard to the other atoms, so I can't easily write H as a function of the independent atoms (I can't use S=N*H) directly. But by looking at the math, it seems valid to consider it an S=ΩH. The H below inside the parenthesis is entropy of each possible message where the energy is divided evenly among the atoms. The 1/i makes it entropy per moving atom, then the sum over N gives total entropy. Is it summing N messages or N atoms? Both? The sum for i over N was for messages, but then the 1/i re-interpreted it to be per atom. Notice my sum for j does not actually use j as it is just adding the same p_i up for all states.

$S=k_{B}\sum _{i=1}^{N}\left[{\frac {1}{i}}*\left(-1*\sum _{j=1}^{\Omega _{i}}{\frac {i}{\Omega _{i}}}\ln {\frac {i}{\Omega _{i}}}\right)\right]=\sum _{i=1}^{N}\ln \left({\frac {\Omega _{i}}{i}}\right)$

This is the same as the Sackur-Tetrode equation, although I used the standard deviation to count states through uncertainty principle instead of the 4 π / 3 constants that were in the ST equation which appears to have a small error (0.20%).

Here's how I used Shannon's H in a more direct but complicated way but got the same result:

There will be N symbols to count in order to measure the Shannon entropy: empty states (there are Ω-N of them), states with a moving atom (i of them), and states with a still atom (N-i). Total energy determines what messages the physical system can send, not the length of the message. It can send many different messages. This is why it's hard to connect information entropy to physical entropy: physical entropy has more freedom than a normal message source. So in this approximation, N messages will have their Shannon H calculated and averaged. Total energy is evenly split among "i" moving atoms, where "i" will vary from 1 to N, giving N messages. The number of phase states (the length of each message) increases as "i" (moving atoms) decreases because each atom has to have more momentum to carry the total energy. The Ω_i states (message length) for a given "i" of moving atoms is a 3/2 power of energy because the uncertainty principle determining number of states is 1D and volume is 3D, and momentum is a square root of energy.

$S_{2}={\frac {1}{N}}\sum _{i=1}^{N}\left[H_{i}\Omega _{i}\right]$

Use $S=k_{B}\ln(2)S_{2}$ to convert to physical entropy. Shannon's entropy H_i for the 3 symbols is the sum of the probability of encountering an empty state, a moving-atom state, and a "still" atom state. I am employing a cheat beyond the above reasoning by counting only 1/2 the entropy of the empty states. Maybe that's a QM effect.

$H_{i}=-0.5*{\frac {\Omega _{i}-N}{\Omega _{i}}}\log _{2}\left({\frac {\Omega _{i}-N}{\Omega _{i}}}\right)-{\frac {i}{\Omega _{i}}}\log _{2}\left({\frac {i}{\Omega _{i}}}\right)-{\frac {N-i}{\Omega _{i}}}\log _{2}\left({\frac {N-i}{\Omega _{i}}}\right)$

Notice H*Ω simplifies to the count equation programmers use. E=empty state, M=state with moving atom, S=state with still atom.

$S_{i}=H_{i}\Omega _{i}=0.5E\log _{2}\left({\frac {\Omega _{i}}{E}}\right)+M\log _{2}\left({\frac {\Omega _{i}}{M}}\right)+S\log _{2}\left({\frac {\Omega _{i}}{S}}\right)$

By some miracle the above simplifies to the previous equation. I couldn't do it, so I wrote a Perl program to calculate it directly to compare it to the ST equation for neon gas at ambient conditions for a 0.2 micron cube (so small to reduce number of loops to N<1 million). Ω/N=3.8 million. I confirmed the ST equation is correct with the official standard molar entropy S⁰ for neon: 146.22 entropy/mole / 6.022E23 * N. It was within 0.20%. I changed P, T, or N by 1/100 and 100x and the difference was from 0.24% and 0.12%.

 #!/usr/bin/perl 
 # neon gas entropy by Sackur-Tetrode (ST), sum of surprisals (SS), and Shannon's H (SH) 
 $T=298; $V=8E-21; $kB=1.381E-23; $m=20.8*1.66E-27; $h=6.6262E-34; $P=101325;  # neon, 1 atm, 0.2 micron sides cube
 $N = int($P*$V/$kB/$T+0.5); $U=$N*3/2*$kB*$T; 
 $ST = $kB*$N*(log($V/$N*(4*3.142/3*$m*$U/$N/$h**2)**1.5)+5/2)/log(2.718); 
 $x = $V**0.33333;  $p = (2*$m*$U/$N)**0.5; $O = ($x*$p/($h/(4*3.142*0.341**2)))**3;
 for ($i=1;$i<$N;$i++) {  $Oi=$O*($N/$i)**1.5; 
     $SH += 0.5*($Oi-$N)*log($Oi/($Oi-$N)) + $i*log($Oi/$i)  + ($N-$i)*log($Oi/($N-$i)); 
     $SS += log($Oi/($i));     }  
 $SH += 0.5*($O-$N)*log($O/($O-$N)) + $N*log($O/$N); # for $i=$N
 $SH = $kB*$SH/log(2.718)/$N;  
 $SS = $kB*$SS/log(2.718);  
 print "SH=$SH, SS=$SS, ST=$ST, SH/ST=".$SH/$ST.", N=".int($N).", Omega=".int($O).", O/N=".int($O/$N); 
 exit;

Ywaz (talk) 19:54, 27 January 2016 (UTC)[reply]

Hm - I'm still trying to understand the above. What Ben Naim did was to say the info entropy S/kB is the sum of four entropies: N(hpos+hmom+hquant+hex)".

hpos is the classical entropic uncertainty in position of of a particle. Basically the probability is evenly distributed in the box, so the probability its in a {dx,dy,dz} box at {x,y,z} is some constant times dx dy dz, and that constant is 1/V where V is the volume, so when you integrate over the box you get 1. So hpos=log(V).

hmom is the classical entropic uncertainty in momentum of the particle. The probability is given by the Maxwell-Boltzmann distribution, and when you calculate it out, its hpos=(3/2)(1+log(2 π m k T))

hquant is the uncertainty due to Heisenberg uncertainty. I don't have Ben-Naim's book with me, so I don't remember exactly how its derived, but its hquant=-3 log(h).

hex is the uncertainty (reduction) due to the exchange degeneracy, it amounts to the correction for "correct Boltzman counting", because a state with particle 1 in state 1 and particle 2 in state 2 is the same microstate as the state with particle 2 in state 1 and particle 1 in state 2. Even deeper, you cannot label particles so there is no such thing as "particle 1" and "particle 2", just two particles. Its the usual division by N! so hex=-log(N!).

Add them all up, multiply by N and you get Sackur-Tetrode. PAR (talk) 22:59, 28 January 2016 (UTC)[reply]

Looking at the above, I don't think its good. If you set your equation equal to Sacker-Tetrode, the σ correction factor is (8(6 π)^3/2)^1/6 = 0.339... which is coincidentally close to your (1/2) Erf(Sqrt(1/2))=0.341..., but the 0.341 is not good, its a hack from Heisenberg's uncertainty principle assuming a normal distribution. We shouldn't be mixing variances and entropies as measures of uncertainty. Second, in your program, I can make no connection between the long drawn-out expression for the sum in the program to the simple sum you listed at the top. I also worry about the introduction of particles standing still. This never happens (i.e. happens with measure zero) in the Maxwell-Boltzmann distribution, which is what is assumed in this derivation of the STE. PAR (talk) 01:30, 29 January 2016 (UTC)[reply]

Concerning the empty states and particles standing still, I was following information theory without regard to physics. Any excess should cancel as it is not throwing extra information in. It should cancel, and apparently did. QM distributions are actually normal distributions which I heard Feynman point out, so I do not think it is a hack. I do not think it was an accident that my view of the physics came out so close. If I plug in the ST constants instead, the error increases 0.39% which is due to my sum being more accurate by not using the Stirling approximation. My hack is more accurate than the error in the Stirling approximation, giving my method more accuracy than the Stirling-adjusted ST. The program is long because it calculates all 3. "SS" is the variable that does the first calculation. I think you're looking for this that does the first equation. I should put it in javascript.

 $O = ($x*$p/($h/(4*3.142*0.341**2)))**3;
 for ($i=1;$i<$N;$i++) {  $Oi=$O*($N/$i)**1.5; $SS += log($Oi/($i)); }

An interesting simplification occurs if I divide out the N^3/2 in the 1st equation and thereby let Ω be based on the momentum as if only 1 atom is carry all the energy:

S=k_{b}\left[N\ln(\Omega _{1})-{\frac {5}{2}}\sum _{i=1}^{N}\ln(i)\right]

I changed some of my explanations above, and mostly took out the first explanation. It's really hard to see out it came out that simple in terms of Shannon entropy without the longer equation.

Let energy = society's resources = total happiness, let the p vector be different ways people (atoms) can direct the energy (happiness) (buying choice vector like maybe services, products, and leisure), let volume determine how far it is between people changing each other's p vector (for an increase or decrease in happiness), i.e., market transactions. High entropy results from longer distance between transactions, fewer people, and more energy. High entropy would be median happiness. Low entropy is wealth concentration. Just a thought.

Ywaz (talk) 12:43, 29 January 2016 (UTC)[reply]

Have you checked out http://www.eoht.info/page/Human+thermodynamics ? PAR (talk) 20:55, 26 February 2016 (UTC)[reply]

First sentence second paragraph needs editing.

Currently the sentence reads "A key measure in information theory is "entropy" but this sentence is inconsistent with the definition of "measure" in measure theory. For consistency with measure theory, I recommend that the "measure" of information theory be called "Shannon's measure." The "entropy" is then "Shannon's measure of the set difference between two state spaces" while the "mutual information" is "Shannon's measure of the intersection of the same two state spaces. 199.73.112.65 (talk) 01:56, 7 November 2016 (UTC)[reply]

^ Cite error: The named reference ShannonWeaver1964 was invoked but never defined (see the help page).

[ShannonWeaver1964-1] Shannon, Claude; Weaver, Warren (1964). The Mathematical Theory of Communication (PDF). Urbana, Illinois: University of Illinois Press. p. 29ff. LCCN 49-11922.

[ShannonWeaver1968-2] Cite error: The named reference ShannonWeaver1968 was invoked but never defined (see the help page).

[ShannonWeaver1964-3] Cite error: The named reference ShannonWeaver1964 was invoked but never defined (see the help page).

[1]

[2]

[1]

@@ Line 251: / Line 251: @@
 :Have you checked out http://www.eoht.info/page/Human+thermodynamics ? [[User:PAR|PAR]] ([[User talk:PAR|talk]]) 20:55, 26 February 2016 (UTC)
+== First sentence second paragraph needs editing. ==
+Currently the sentence reads "A key measure in information theory is "entropy" but this sentence is inconsistent with the definition of "measure" in measure theory. For consistency with measure theory, I recommend that the "measure" of information theory be called "Shannon's measure." The "entropy" is then "Shannon's measure of the set difference between two state spaces" while the "mutual information" is "Shannon's measure of the intersection of the same two state spaces.
+[[Special:Contributions/199.73.112.65|199.73.112.65]] ([[User talk:199.73.112.65|talk]]) 01:56, 7 November 2016 (UTC)