Information theory

This article is not to be confused with library and information science or information technology.

You must add a |reason= parameter to this Cleanup template – replace it with {{Cleanup|reason=<Fill reason here>}}, or remove the Cleanup template.

Introduction

Information theory is the mathematical theory of data communication and storage generally considered to have been founded in 1948 by Claude E. Shannon. The central paradigm of classic information theory is the engineering problem of the transmission of information over a noisy channel. The main result of this theory is Shannon's fundamental theorem of information theory. Briefly the theorem states that, when the appropriate terms are properly defined, messages can be transmitted over a noisy channel at any rate less than (but arbitrarily close to) the channel capacity with an arbitrarily small probability of error.

Actually finding ways to encode information so it can be transmitted as efficently as the theorem shows possible is the subject of coding theory, which is concerned with data compression and error-correction. Some of the most sophisticated and efficient codes ever developed for the transmission of information are used today in cell phones. The codes of coding theory are unrelated to and not to be confused with cryptographic ciphers, but nevertheless concepts from coding theory and information theory are much used in cryptography and cryptanalysis. See the article on deciban for an interesting historical application.

Mathematical preliminaries

Information theory is based on and intimately connected with probability theory.

In information theory, logarithms are used extensively. The base of the logarithms used here is deliberately left unspecified, following standard practice in information theory. As in many other applications of logarithms, it really does not matter much what base is used as long as one is consistent. In the case of information theory, however, the choice of logarithmic base determines the unit that is used to measure information. Common choices of logarithmic base, and the resulting units of information are as follows:

Base	Unit
2	bit
e	nat
10	ban

In what follows, expressions of the form

p\log p\,

will frequently occur. Such an expression should be understood to be equal to zero whenever $p=0$ , rather than being left undefined. (Note: This usage may be non-standard in general mathematics, but it is the standard in information theory, since it does simplify matters greatly in this field.)

Basic Concepts for Discrete Channels

Self-information

Shannon defined a measure of information content, called the self-information or surprisal of a message m:

I(m)=-\log p(m),\,

where $p(m)=Pr(M=m)$ is the probability of the message m taken from an implicit message space $M$ .

Note: Strictly speaking, $p(m)$ is the probability density function (with respect to the counting measure) for the the discrete random variable $M$ . Throughout this article, $p(x)$ , for example, denotes the marginal probability density function for the random variable $X$ , and $p(x,y)$ the joint probability density function for $X$ and $Y$ . This usage is rather informal, though, and only the context makes it clear what probability density functions we are talking about.

Entropy

The entropy, or uncertainty, of a discrete message space $M$ is the expected self-information of a message $m$ from that message space:

H(M)=\mathbb {E} \{I(m)\}=\sum _{m\in M}p(m)I(m)=-\sum _{m\in M}p(m)\log p(m)

that, when applied to an information source, could determine the capacity of the channel required to transmit the source. If the logarithm in the formula is taken to base 2, then it gives a measures of message information and entropy in bits. Shannon's measure of entropy came to be taken as a measure of the information contained in a message, as opposed to the portion of the message that is strictly determined (hence predictable) by inherent structures, such as redundancy in the structure of languages or the statistical properties of a language relating to the frequencies of occurrence of different letter or word pairs, triplets, etc. See Markov chains.

The entropy is maximized when all the messages in the message space are equiprobable. In this case $H(M)=\log |M|$ .

Note: A message space is the same as a random variable, and will hereinafter be referred to as such.

Joint Entropy

The joint entropy of two discrete random variables $X$ and $Y$ is defined as:

H(X,Y)=-\sum _{x,y}p(x,y)\log p(x,y)\,

where $x$ and $y$ range over all the values that $X$ and $Y$ can jointly take on. If $X$ and $Y$ are independent, the the joint entropy of $X$ and $Y$ is just the sum of their entropies taken individually. The joint entropy is actually the entropy of the joint distribution of $X$ and $Y$ .

(Note: The joint entropy is not to be confused with the cross entropy, despite similar notation.)

Conditional Entropy (Equivocation)

The conditional entropy of a discrete random variable $X$ given the value of a discrete random variable $Y=y$ is defined as:

H(X|y)=-\sum _{x\in X}p(x|y)\log p(x|y)

where $p(x|y)={\frac {p(x,y)}{p(y)}}$ is the conditional probability of $x$ given $y$ .

The (expected) conditional entropy of $X$ given $Y$ , or the equivocation of $X$ about $Y$ is then given by:

H(X|Y)=\mathbb {E} \{H(X|y)\}=-\sum _{y\in Y}p(y)\sum _{x\in X}p(x|y)\log p(x|y)=\sum _{x,y}p(x,y)\log {\frac {p(y)}{p(x,y)}}.

If $X$ represents the space of messages received and $Y$ the space of messages transmitted through a communications channel, then $H(X|Y)$ is called the equivocation of the channel.

A basic property of the conditional entropy or equivocation is that:

H(X|Y)=H(X,Y)-H(Y)\,

Mutual Information (Transinformation)

It turns out that one of the most useful and important measures of information is the mutual information, or transinformation. This is a measure of how much information can be obtained about one random variable by observing another. The transinformation of $X$ relative to $Y$ (which represents conceptually the amount of information about $Y$ that can be learned by observing $X$ ) is given by:

I(X;Y)=\sum _{x,y}p(y)\,p(x|y)\log {\frac {p(x|y)}{p(x)}}=\sum _{x,y}p(x,y)\log {\frac {p(x,y)}{p(x)\,p(y)}}.

A basic property of the transinformation is that:

I(X;Y)=H(X)-H(X|Y).\,

If $X$ represents the space of messages received and $Y$ the space of messages transmitted through a communications channel, then $I(X;Y)$ is called the transinformation of the channel. Rewriting the last equation as

H(X)=I(X;Y)+H(X|Y),\,

we can see that the entropy of the space of messages received through a communications channel consists of two portions: the transinformation and the equivocation. That portion due to the transinformation represents the signal, and that portion due to the equivocation represents the noise.

The transinformation is symmetric:

I(X;Y)=I(Y;X)=H(X)+H(Y)-H(X,Y),\,

and it is often called simply the mutual information of $X$ and $Y$ because it represents the information that is "shared" between X and Y.

Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the Multinomial distribution and to Pearson's χ² test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution. Also, mutual information can be expressed through the Kullback-Leibler divergence:

I(X;Y)=D(P(X,Y)\|P(X)P(Y)).\,

Continuous Channels

Shannon information is appropriate for measuring uncertainty over a discrete space. Its basic measures have been extended by analogy to continuous spaces. The continuous entropy, joint entropy, and conditional entropy are not quite as meaningful as the corresponding measures in the discrete case. Here they can take on negative as well as positive values, and they are not invariant under linear transformation. The mutual information has the distinction of being the only one of Shannon's basic measures of information that remains fundamentally meaningful in the continuous case, in the sense of remaining invariant under linear transformation, and furthermore it is non-negative. (Note that in the discrete case, all these quantities are always non-negative.) The self-information is undefined in the continuous case.

When the random variables in question are continuous rather than discrete, the sums can be replaced with integrals and densities used in place of probability mass functions. By analogy with the discrete case, entropy, joint entropy, conditional entropy, and mutual information can be defined as follows, with the appropriate probability density funtions f(x), g(y), and h(x,y):

H(X)=-\int _{X}f(x)\log f(x)\,dx

H(X,Y)=-\int _{Y}\int _{X}h(x,y)\log h(x,y)\,dx\,dy

H(X|y)=\int _{X}{\frac {h(x,y)}{g(y)}}\log {\frac {g(y)}{h(x,y)}}\,dx

H(X|Y)=\int _{Y}\int _{X}h(x,y)\log {\frac {g(y)}{h(x,y)}}\,dx\,dy

I(X;Y)=\int _{Y}\int _{X}h(x,y)\log {\frac {h(x,y)}{f(x)g(y)}}\,dx\,dy

The same basic relationships hold among these as among the corresponding quantities in the discrete case.

An alternative measure of information was created by Fisher for measuring uncertainty over a continouous space. For information about the value of a continuous parameter such as a person's height, Fisher information is used, as estimated heights do have a well-defined distance.

Channel Capacity

Let us return for the time being to our consideration of the communications process over a discrete channel. At this time it will be helpful to have a simple model of the process:

                      +-------+
                      | Noise |
                      +-------+
                          |
                          V
+-----------+    Y    +-------+    X    +--------+
|Transmitter|-------->|Channel|-------->|Receiver|
+-----------+         +-------+         +--------+

Here again X represents the space of messages received, and Y the space of messages transmitted during a unit time over our channel. However, this time we shall be much more explicit about our use of probability distribution functions. Let $p(x|y)$ be the conditional probability distribution function of X given Y. We will consider $p(x|y)$ to be an inherent fixed property of our communications channel (representing the nature of the noise of our channel). Then the joint distribution of X and Y is completely determined by our channel and by our choice of $g(y)$ , the marginal distribution of messages we choose to send over the channel. Under these constraints, we would like to maximize the amount of information, or the signal, we can communicate over the channel. The appropriate measure for this is the transinformation, and this maximum transinformation is called the channel capacity and is given by:

C=\max _{g}I(X;Y).\,

Source Theory

Now we will set aside the problem of the transmission of information over a channel and instead consider the nature of its source. Let M be the space of messages generated in unit time by a source of information.

Rate

The rate of a source of information is simply $r=H(M)$ , the expected, or average, self-information per message (i.e. per unit time). It is very common in information theory to speak of the "rate of a language." This is appropriate, for example, when the source of information is English prose.

Absolute Rate

The absolute rate of a source is the maximum rate, maximized over all probability distributions on the message space. Recall that the entropy is maximized for a uniform distribution in which all the messages are equiprobable. Thus the absolute rate is simply $R=\log |M|$ .

Absolute Redundancy

The absolute redundancy is the difference between the rate and the absolute rate of a source and is defined $D=R-r$ .

Relative Redundancy

Efficiency

Fundamental Theorem

By now, enough concepts should have been defined that we can give a reasonably formal statement of the fundamental theorem of information theory.

Related Concepts and Applications

Coding Theory

Coding theory is the most important and direct application of information theory. It can be subdivided into data compression theory and error correction theory. Using a statistical description for data, information theory quantifies the number of bits needed to describe the data. There are two formulations for the compression problem -- in lossless data compression the data must be reconstructed exactly, whereas lossy data compression examines how many bits are needed to reconstruct the data to within a specified fidelity level. This fidelity level is measured by a function called a distortion function. In information theory this is called rate distortion theory. Both lossless and lossy source codes produce bits at the output which can be used as the inputs to the channel codes mentioned above.

The idea is to first compress the data, i.e. remove as much of its redundancy as possible, and then add just the right kind of redundancy (i.e. error correction) needed to transmit the data efficiently and faithfully across a noisy channel.

This division of coding theory into compression and transmission is justified by the information transmission theorems, or source-channel separation theorems that justify the use of bits as the universal currency for information in many contexts. However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user. In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the broadcast channel) or intermediary "helpers" (the relay channel), or more general networks, compression followed by transmission may no longer be optimal. Network information theory refers to these multi-agent communication models.

Measure Theory

Here is an interesting and illuminating connection between information theory and measure theory:

If to arbitrary discrete random variables X and Y we associate the existence of sets ${\tilde {X}}$ and ${\tilde {Y}}$ , somehow representing the information borne by X and Y, respectively, such that:

${\tilde {X}}$ and ${\tilde {Y}}$ are disjoint whenever X and Y are independent, and
${\tilde {X}}={\tilde {Y}}$ whenever X and Y are such that either one is completely determined by the other (i.e. by a bijection);

and $\mu$ is a measure over these sets, and we set:

H(X)=\mu ({\tilde {X}}),

H(Y)=\mu ({\tilde {Y}}),

H(X,Y)=\mu ({\tilde {X}}\cup {\tilde {Y}}),

H(X|Y)=\mu ({\tilde {X}}\,\backslash \,{\tilde {Y}}),

I(X;Y)=\mu ({\tilde {X}}\cap {\tilde {Y}});

we find that Shannon's "measure" of information content satisfies all the postulates and basic properties of a formal measure over sets. This can be a handy mnemonic device in some situations. Certain extensions to the definitions of Shannon's basic measures of information are necessary to deal with the σ-algebra generated by the sets that would be associated to three or more arbitrary random variables. These are discussed informally in Reza pp. 106-108.

This connection is important for two reasons: first, it reiterates and clarifies the fundamental properties of these basic concepts of information theory, and second, it justifies, in a certain formal sense, the practice of calling Shannon's entropy a "measure" of information.

Cryptography and Cryptanalysis

Information theoretic concepts are widely used in making and breaking cryptographic ciphers. For an interesting historical example, see the article on deciban.

Detection and Estimation Theory

(See those articles.)

Music

Composer James Tenney, among others such as his teacher Lejaren Hiller, has used information theory in the composition of musical works such as Ergodos.

Kolmogorov Complexity

A. N. Kolmogorov introduced an alternative information measure that is based on the length of the shortest algorithm to produce a message; see Kolmogorov complexity.

Relation with thermodynamic entropy

See main article: Entropy in thermodynamics and information theory.

There are close links between information entropy as defined by Shannon and thermodynamic entropy, and in particular its statistical interpretation, established by Ludwig Boltzmann and J. Willard Gibbs in the 1870s. A quantity defined by the entropy formula was first introduced by Boltzmann in 1872, in the context of his H-theorem.

The quantity of information defined by Shannon for entropy refers to the relative rate of information transmitted over a channel through time, whereas the Boltzmann equation defines the "disorder" of a physical system at a specific point in time. Therefore, the two concepts can not be considered equivalent for one is measuring probability through the domain of time, and the other the probability through the domain of space.

Furthermore, the Boltzmann equation refers to the probability of a macrostate being in one of many microstates, whereas Shannon defines the probability of a symbol occuring from one among many symbols. In the Boltzmann equation, the actual microstate probability is not equal to the entropy of the system- the entropy refers to the probability of a microstate existing as one among a set of microstates corresponding to a macrostate. The information content of a physical system obeying stachistic behavior is equal to the information content of a Markov process.

At a theoretical level, as E. T. Jaynes pointed out in 1957, equilibrium statistical mechanics gives the prescription that the probability distribution which should be assigned for the unknown microstate of a thermodynamic system is that which has maximum Shannon entropy, given that it must also satisfy the macroscopic description of the system. But this is just an application of a quite general rule in information theory, if one wishes to a maximally uninformative distribution. The thermodynamic entropy, measuring the disorder of this equilibrium distribution, is then just this maximum Shannon entropy in nats, multiplied for historical reasons by Boltzmann's constant. (See article: MaxEnt thermodynamics).

At a physical level, the connection between physics and information theory establishes physical limits to computation: for example, it is not possible to physically erase one bit of information without releasing $k_{B}T\ln 2$ of heat at an absolute temperature $T$ . This has been applied to the paradox of Maxwell's demon which would need to process information to reverse thermodynamic entropy; but erasing that information, to begin again, exactly balances out the thermodynamic gain that the demon would otherwise achieve.

Landauer [1961] shows that a couterpart of this process can occur as well: an array of ordered bits of memory can become "thermalized," or populated with random data, and in the process cool off its surroundings. N bits would then increase the entropy of the system by $Nk_{B}\ln 2$ as they thermalize.

All these connections are based on an implicit correspondence between the information-theoretic entropy of Shannon and Hartley, usually expressed as H, and the thermodynamic entropy of Clausius and Carnot, usually denoted by S, of a physical system that can be expressed simply as

S=k_{B}H\,

provided H is expressed in nats.

In a sense it is not surprising that computers must obey the same physical laws that steam engines do, even though they are radically different devices.

See Brillouin for an earlier treatment of this connection, which Landauer also referred to. Moreover, Stephen Hawking often speaks of the thermodynamic entropy of black holes in terms of their information content.

Quantum Information Theory

See this tutorial.

History

Claude E. Shannon (1916–2001) founded information theory with his classic paper "A Mathematical Theory of Communication", published in the Bell System Technical Journal in July and October of 1948. This work drew on earlier publications by Harry Nyquist and Ralph Hartley. At the beginning of his paper, Shannon asserted that

"The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point."

His theory for the first time considered communication as a rigorously stated mathematical problem in statistics and gave communications engineers a way to determine the capacity of a communication channel in terms of the common currency of bits. This problem is called the channel coding problem. The transmission part of the theory is not concerned with the meaning (semantics) of the message conveyed.

Several early notions of information entropy predated Shannon's paper.

One version was defined and used during the Second World War by Alan Turing at Bletchley Park. Turing named it "weight of evidence" and measured it in units called bans and decibans. (This is not to be confused with the weight of evidence defined by I.J. Good and described in the article statistical inference, which Turing also tackled and named "log-odds" or "lods".) Turing and Shannon collaborated during the war but it appears that they independently created the concept.

A unit of entropy called the hartley, after Ralph Hartley, was introduced by him in 1928. (The hartley coincides with Turing's "ban" since both are based on base 10 logarithms.)

A very similar concept was developed by Boltzmann during the 1800's in his work on statistical mechanics. Shannon himself did not realize the deep connection between his work and Boltzmann's earlier work, but John von Neumann did. The story goes that when Shannon was deciding what to call the quantity, fearing that 'information' was already over-used, von Neumann told him: "You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage." This connection, between information-theoretic entropy and thermodynamic entropy, was later studied by Rolf Landauer and is described in this article.

References

The Classic Paper

Claude E. Shannon. A Mathematical Theory of Communication. Bell System Technical Journal, July/October 1948.

Textbooks on Information Theory

Claude E. Shannon, Warren Weaver. The Mathematical Theory of Communication. Univ of Illinois Press, 1963. ISBN 0252725484
Robert B. Ash. Information Theory. New York: Dover 1990. ISBN 0486665216
Thomas M. Cover, Joy A. Thomas. Elements of information theory New York: Wiley, 1991. ISBN 0471062596
Stanford Goldman. Information Theory. Mineola, N.Y.: Dover 2005 ISBN 0486442713
Fazlollah M. Reza. An Introduction to Information Theory. New York: Dover 1994. ISBN 048668210
David J. C. MacKay. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. ISBN 0521642981

Other Books

Leon Brillouin, Science and Information Theory, Mineola, N.Y.: Dover, [1956, 1962] 2004. ISBN 0486439186
H. S. Leff and A. F. Rex, Editors, Maxwell's Demon: Entropy, Information, Computing, Princeton University Press, Princeton, NJ (1990). ISBN 069108727X

External links

Journal of Chemical Education, Shuffled Cards, Messy Desks, and Disorderly Dorm Rooms - Examples of Entropy Increase? Nonsense!
Claude E. Shannon's original paper(pdf) and Notes and others formats
IEEE Information Theory Society and the review articles.
On-line textbook: Information Theory, Inference, and Learning Algorithms, by David MacKay - gives an entertaining and thorough introduction to Shannon theory, including state-of-the-art methods from coding theory, such as arithmetic coding, low-density parity-check codes, and Turbo codes.
A tutorial on quantum information