# Self-information

In information theory, self-information or surprisal is a measure of the information content[clarification needed] associated with an event in a probability space or with the value of a discrete random variable. It is expressed in a unit of information, for example bits, nats, or hartleys, depending on the base of the logarithm used in its calculation.

The term self-information is also sometimes used as a synonym of the related information-theoretic concept of entropy. These two meanings are not equivalent, and this article covers the first sense only.

## Definition

By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information a priori. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero.

For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin, “Weather forecast for tonight: dark. Continued dark overnight, with widely scattered light by morning.” Assuming one not residing near the Earth's poles or polar circles, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.

When the content of a message is known a priori with certainty, with probability of 1, there is no actual information conveyed in the message. Only when the advanced knowledge of the content of the message by the receiver is less certain than 100% does the message actually convey information.

Accordingly, the amount of self-information contained in a message conveying content informing an occurrence of event, ${\displaystyle \omega _{n}}$, depends only on the probability of that event.

${\displaystyle \operatorname {I} (\omega _{n})=f(\operatorname {P} (\omega _{n}))}$

for some function ${\displaystyle f(\cdot )}$ to be determined below. If ${\displaystyle \operatorname {P} (\omega _{n})=1}$, then ${\displaystyle \operatorname {I} (\omega _{n})=0}$. If ${\displaystyle \operatorname {P} (\omega _{n})<1}$, then ${\displaystyle \operatorname {I} (\omega _{n})>0}$.

Further, by definition, the measure of self-information is nonnegative and additive. If a message informing of event ${\displaystyle C}$ is the intersection of two independent events ${\displaystyle A}$ and ${\displaystyle B}$, then the information of event ${\displaystyle C}$ occurring is the compound message of both independent events ${\displaystyle A}$ and ${\displaystyle B}$ occurring. The quantity of information of compound message ${\displaystyle C}$ would be expected to equal the sum of the amounts of information of the individual component messages ${\displaystyle A}$ and ${\displaystyle B}$ respectively:

${\displaystyle \operatorname {I} (C)=\operatorname {I} (A\cap B)=\operatorname {I} (A)+\operatorname {I} (B)}$.

Because of the independence of events ${\displaystyle A}$ and ${\displaystyle B}$, the probability of event ${\displaystyle C}$ is

${\displaystyle \operatorname {P} (C)=\operatorname {P} (A\cap B)=\operatorname {P} (A)\cdot \operatorname {P} (B)}$.

However, applying function ${\displaystyle f(\cdot )}$ results in

{\displaystyle {\begin{aligned}\operatorname {I} (C)&=\operatorname {I} (A)+\operatorname {I} (B)\\f(\operatorname {P} (C))&=f(\operatorname {P} (A))+f(\operatorname {P} (B))\\&=f{\big (}\operatorname {P} (A)\cdot \operatorname {P} (B){\big )}\\\end{aligned}}}

The class of function ${\displaystyle f(\cdot )}$ having the property such that

${\displaystyle f(x\cdot y)=f(x)+f(y)}$

is the logarithm function of any base. The only operational difference between logarithms of different bases is that of different scaling constants.

${\displaystyle f(x)=K\log(x)}$

Since the probabilities of events are always between 0 and 1 and the information associated with these events must be nonnegative, that requires that ${\displaystyle K<0}$.

Taking into account these properties, the self-information ${\displaystyle \operatorname {I} (\omega _{n})}$ associated with outcome ${\displaystyle \omega _{n}}$ with probability ${\displaystyle \operatorname {P} (\omega _{n})}$ is defined as:

${\displaystyle \operatorname {I} (\omega _{n})=-\log(\operatorname {P} (\omega _{n}))=\log \left({\frac {1}{\operatorname {P} (\omega _{n})}}\right)}$

The smaller the probability of event ${\displaystyle \omega _{n}}$, the larger the quantity of self-information associated with the message that the event indeed occurred. If the above logarithm is base 2, the unit of ${\displaystyle \displaystyle I(\omega _{n})}$ is bits. This is the most common practice. When using the natural logarithm of base ${\displaystyle \displaystyle e}$, the unit will be the nat. For the base 10 logarithm, the unit of information is the hartley.

As a quick illustration, the information content associated with an outcome of 4 heads (or any specific outcome) in 4 consecutive tosses of a coin would be 4 bits (probability 1/16), and the information content associated with getting a result other than the one specified would be 0.09 bits (probability 15/16). See below for detailed examples.

This measure has also been called surprisal, as it represents the "surprise" of seeing the outcome (a highly improbable outcome is very surprising). This term was coined by Myron Tribus in his 1961 book Thermostatics and Thermodynamics.

The information entropy of a random event is the expected value of its self-information.

Self-information is an example of a proper scoring rule.

## Examples

• On tossing a coin, the chance of 'tail' is 0.5. When it is proclaimed that indeed 'tail' occurred, this amounts to
I('tail') = log2 (1/0.5) = log2 2 = 1 bit of information.
• When throwing a fair die, the probability of 'four' is 1/6. When it is proclaimed that 'four' has been thrown, the amount of self-information is
I('four') = log2 (1/(1/6)) = log2 (6) = 2.585 bits.
• When, independently, two dice are thrown, the amount of information associated with {throw 1 = 'two' & throw 2 = 'four'} equals
I('throw 1 is two & throw 2 is four') = log2 (1/P(throw 1 = 'two' & throw 2 = 'four')) = log2 (1/(1/36)) = log2 (36) = 5.170 bits.
This outcome equals the sum of the individual amounts of self-information associated with {throw 1 = 'two'} and {throw 2 = 'four'}; namely 2.585 + 2.585 = 5.170 bits.
• In the same two dice situation we can also consider the information present in the statement "The sum of the two dice is five"
I('The sum of throws 1 and 2 is five') = log2 (1/P('throw 1 and 2 sum to five')) = log2 (1/(4/36)) = 3.17 bits. The (4/36) is because there are four ways out of 36 possible to sum two dice to 5. This shows how more complex or ambiguous events can still carry information.

## Self-information of a partitioning

The self-information of a partitioning of elements within a set (or clustering) is the expectation of the information of a test object; if we select an element at random and observe in which partition/cluster it exists, what quantity of information do we expect to obtain? The information of a partitioning ${\displaystyle C}$ with ${\displaystyle \operatorname {P} (k)}$ denoting the fraction of elements within partition ${\displaystyle k}$ is [1]

${\displaystyle \operatorname {I} (C)=\operatorname {E} (-\log(\operatorname {P} (C)))=-\sum _{k=1}^{n}\operatorname {P} (k)\log(\operatorname {P} (k))}$

## Relationship to entropy

The entropy is the expected value of the self-information of the values of a discrete random variable. Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies ${\displaystyle \operatorname {H} (X)=\operatorname {I} (X;X)}$, where ${\displaystyle \operatorname {I} (X;X)}$ is the mutual information of ${\displaystyle X}$ with itself.[2]

## References

1. ^ Marina Meilă; Comparing clusterings—an information based distance; Journal of Multivariate Analysis, Volume 98, Issue 5, May 2007
2. ^ Thomas M. Cover, Joy A. Thomas; Elements of Information Theory; p. 20; 1991.
• C.E. Shannon, A Mathematical Theory of Communication, Bell Syst. Techn. J., Vol. 27, pp 379–423, (Part I), 1948.