# Information content

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome. As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal source coding of the random variable.

The Shannon information is closely related to entropy, which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average". This is the average amount of self-information an observer would expect to gain about a random variable when measuring it.

The information content can be expressed in various units of information, of which the most common is the "bit" (more correctly called the shannon), as explained below.

## Definition

Claude Shannon's definition of self-information was chosen to meet several axioms:

1. An event with probability 100% is perfectly unsurprising and yields no information.
2. The less probable an event is, the more surprising it is and the more information it yields.
3. If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.

The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly, given a real number $b>1$ and an event $x$ with probability $P$ , the information content is defined as follows:

$\mathrm {I} (x):=-\log _{b}{\left[\Pr {\left(x\right)}\right]}=-\log _{b}{\left(P\right)}.$ The base b corresponds to the scaling factor above. Different choices of b correspond to different units of information: when b = 2, the unit is the shannon (symbol Sh), often called a 'bit'; when b = e, the unit is the natural unit of information (symbol nat); and when b = 10, the unit is the hartley (symbol Hart).

Formally, given a random variable $X$ with probability mass function $p_{X}{\left(x\right)}$ , the self-information of measuring $X$ as outcome $x$ is defined as

$\operatorname {I} _{X}(x):=-\log {\left[p_{X}{\left(x\right)}\right]}=\log {\left({\frac {1}{p_{X}{\left(x\right)}}}\right)}.$ The use of the notation $I_{X}(x)$ for self-information above is not universal. Since the notation $I(X;Y)$ is also often used for the related quantity of mutual information, many authors use a lowercase $h_{X}(x)$ for self-entropy instead, mirroring the use of the capital $H(X)$ for the entropy.

## Properties

### Monotonically decreasing function of probability

For a given probability space, the measurement of rarer events are intuitively more "surprising", and yield more information content, than more common values. Thus, self-information is a strictly decreasing monotonic function of the probability, or sometimes called an "antitonic" function.

While standard probabilities are represented by real numbers in the interval $[0,1]$ , self-informations are represented by extended real numbers in the interval $[0,\infty ]$ . In particular, we have the following, for any choice of logarithmic base:

• If a particular event has a 100% probability of occurring, then its self-information is $-\log(1)=0$ : its occurrence is "perfectly non-surprising" and yields no information.
• If a particular event has a 0% probability of occurring, then its self-information is $-\log(0)=\infty$ : its occurrence is "infinitely surprising".

From this, we can get a few general properties:

• Intuitively, more information is gained from observing an unexpected event—it is "surprising".
• For example, if there is a one-in-a-million chance of Alice winning the lottery, her friend Bob will gain significantly more information from learning that she won than that she lost on a given day. (See also Lottery mathematics.)
• This establishes an implicit relationship between the self-information of a random variable and its variance.

### Relationship to log-odds

The Shannon information is closely related to the log-odds. In particular, given some event $x$ , suppose that $p(x)$ is the probability of $x$ occurring, and that $p(\lnot x)=1-p(x)$ is the probability of $x$ not occurring. Then we have the following definition of the log-odds:

${\text{log-odds}}(x)=\log \left({\frac {p(x)}{p(\lnot x)}}\right)$ This can be expressed as a difference of two Shannon informations:

${\text{log-odds}}(x)=\mathrm {I} (\lnot x)-\mathrm {I} (x)$ In other words, the log-odds can be interpreted as the level of surprise when the event doesn't happen, minus the level of surprise when the event does happen.

The information content of two independent events is the sum of each event's information content. This property is known as additivity in mathematics, and sigma additivity in particular in measure and probability theory. Consider two independent random variables ${\textstyle X,\,Y}$ with probability mass functions $p_{X}(x)$ and $p_{Y}(y)$ respectively. The joint probability mass function is

$p_{X,Y}\!\left(x,y\right)=\Pr(X=x,\,Y=y)=p_{X}\!(x)\,p_{Y}\!(y)$ because ${\textstyle X}$ and ${\textstyle Y}$ are independent. The information content of the outcome $(X,Y)=(x,y)$ is

{\begin{aligned}\operatorname {I} _{X,Y}(x,y)&=-\log _{2}\left[p_{X,Y}(x,y)\right]=-\log _{2}\left[p_{X}\!(x)p_{Y}\!(y)\right]\\[5pt]&=-\log _{2}\left[p_{X}{(x)}\right]-\log _{2}\left[p_{Y}{(y)}\right]\\[5pt]&=\operatorname {I} _{X}(x)+\operatorname {I} _{Y}(y)\end{aligned}} See § Two independent, identically distributed dice below for an example.

The corresponding property for likelihoods is that the log-likelihood of independent events is the sum of the log-likelihoods of each event. Interpreting log-likelihood as "support" or negative surprisal (the degree to which an event supports a given model: a model is supported by an event to the extent that the event is unsurprising, given the model), this states that independent events add support: the information that the two events together provide for statistical inference is the sum of their independent information.

## Relationship to entropy

The Shannon entropy of the random variable $X$ above is defined as

{\begin{alignedat}{2}\mathrm {H} (X)&=\sum _{x}{-p_{X}{\left(x\right)}\log {p_{X}{\left(x\right)}}}\\&=\sum _{x}{p_{X}{\left(x\right)}\operatorname {I} _{X}(x)}\\&{\overset {\underset {\mathrm {def} }{}}{=}}\ \operatorname {E} {\left[\operatorname {I} _{X}(X)\right]},\end{alignedat}} by definition equal to the expected information content of measurement of $X$ .: 11 : 19–20  The expectation is taken over the discrete values over its support.

Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies $\mathrm {H} (X)=\operatorname {I} (X;X)$ , where $\operatorname {I} (X;X)$ is the mutual information of $X$ with itself.

For continuous random variables the corresponding concept is differential entropy.