# Quantities of information

A simple information diagram illustrating the relationships among some of Shannon's basic quantities of information.

The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, based on the binary logarithm. Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.

In what follows, an expression of the form $p \log p \,$ is considered by convention to be equal to zero whenever p is zero. This is justified because $\lim_{p \rightarrow 0+} p \log p = 0$ for any logarithmic base.

## Self-information

Shannon derived a measure of information content called the self-information or "surprisal" of a message m:

$I(m) = \log \left( \frac{1}{p(m)} \right) = - \log( p(m) ) \,$

where $p(m) = \mathrm{Pr}(M=m)$ is the probability that message m is chosen from all possible choices in the message space $M$. The base of the logarithm only affects a scaling factor and, consequently, the units in which the measured information content is expressed. If the logarithm is base 2, the measure of information is expressed in units of bits.

Information is transferred from a source to a recipient only if the recipient of the information did not already have the information to begin with. Messages that convey information that is certain to happen and already known by the recipient contain no real information. Infrequently occurring messages contain more information than more frequently occurring messages. This fact is reflected in the above equation - a certain message, i.e. of probability 1, has an information measure of zero. In addition, a compound message of two (or more) unrelated (or mutually independent) messages would have a quantity of information that is the sum of the measures of information of each message individually. That fact is also reflected in the above equation, supporting the validity of its derivation.

An example: The weather forecast broadcast is: "Tonight's forecast: Dark. Continued darkness until widely scattered light in the morning." This message contains almost no information. However, a forecast of a snowstorm would certainly contain information since such does not happen every evening. There would be an even greater amount of information in an accurate forecast of snow for a warm location, such as Miami. The amount of information in a forecast of snow for a location where it never snows (impossible event) is the highest (infinity).

## Entropy

The entropy of a discrete message space $M$ is a measure of the amount of uncertainty one has about which message will be chosen. It is defined as the average self-information of a message $m$ from that message space:

$H(M) = \mathbb{E} \{I(M)\} = \sum_{m \in M} p(m) I(m) = -\sum_{m \in M} p(m) \log p(m).$

where

$\mathbb{E} \{ \}$ denotes the expected value operation.

An important property of entropy is that it is maximized when all the messages in the message space are equiprobable (e.g. $p(m) = 1/M$). In this case $H(M) = \log |M|.$.

Sometimes the function H is expressed in terms of the probabilities of the distribution:

$H(p_1, p_2, \ldots , p_k) = -\sum_{i=1}^k p_i \log p_i,$ where each $p_i \geq 0$ and $\sum_{i=1}^k p_i = 1.$

An important special case of this is the binary entropy function:

$H_\mbox{b}(p) = H(p, 1-p) = - p \log p - (1-p)\log (1-p).\,$

## Joint entropy

The joint entropy of two discrete random variables $X$ and $Y$ is defined as the entropy of the joint distribution of $X$ and $Y$:

$H(X, Y) = \mathbb{E}_{X,Y} [-\log p(x,y)] = - \sum_{x, y} p(x, y) \log p(x, y) \,$

If $X$ and $Y$ are independent, then the joint entropy is simply the sum of their individual entropies.

(Note: The joint entropy should not be confused with the cross entropy, despite similar notations.)

## Conditional entropy (equivocation)

Given a particular value of a random variable $Y$, the conditional entropy of $X$ given $Y=y$ is defined as:

$H(X|y) = \mathbb{E}_{{X|Y}} [-\log p(x|y)] = -\sum_{x \in X} p(x|y) \log p(x|y)$

where $p(x|y) = \frac{p(x,y)}{p(y)}$ is the conditional probability of $x$ given $y$.

The conditional entropy of $X$ given $Y$, also called the equivocation of $X$ about $Y$ is then given by:

$H(X|Y) = \mathbb E_Y \{H(X|y)\} = -\sum_{y \in Y} p(y) \sum_{x \in X} p(x|y) \log p(x|y) = \sum_{x,y} p(x,y) \log \frac{p(y)}{p(x,y)}.$

A basic property of the conditional entropy is that:

$H(X|Y) = H(X,Y) - H(Y) .\,$