Categorical distribution
In probability theory and statistics, a categorical distribution (occasionally termed the "discrete distribution",[citation needed] which properly refers to a general class of distributions) is a probability distribution that describes the result of a random event that can take on one of K possible outcomes, with the probability of each outcome separately specified. There is not necessarily an underlying ordering of these outcomes, but numerical labels are attached for convenience in describing the distribution, often in the range 1 to K. Note that the K-dimensional categorical distribution is the most general distribution over a K-way event; any other discrete distribution over a size-K sample space is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1.
Note that, in some fields, such as natural language processing, the categorical and multinomial distributions are conflated, and it is common to speak of a "multinomial distribution" when a categorical distribution is actually meant. This stems from the fact that it is sometimes convenient to express the outcome of a categorical distribution as a "1-of-K" vector (a vector with one element containing a 1 and all other elements containing a 0) rather than as an integer in the range 1 to K; in this form, a categorical distribution is equivalent to a multinomial distribution for a single observation (see below).
Contents |
[edit] Introduction
A categorical distribution is a discrete probability distribution whose sample space is the set of n individually identified items. It is the generalization of the Bernoulli distribution for a categorical random variable.
In one formulation of the distribution, the sample space is taken to be a finite sequence of integers. The items might be encoded as {0, 1, ..., n-1} or {1, 2, ..., n} for example: the latter is used here, although the former is the used for the Bernoulli distribution. In this case, the probability mass function f is:
where pi represents the probability of seeing element i and
.
In another formulation, the categorical distribution is a special case of the multinomial distribution in which the parameter n of the multinomial distribution is fixed at 1. In this formulation, the sample space can be considered to be the set of 1-of-N encoded (also known as 1-of-K encoded)[1] random vectors x of dimension n having the property that exactly one element has the value 1 and the others have the value 0. The probability mass function f in this formulation is:
where pi represents the probability of seeing element i and
. This is the formulation adopted by Bishop [1][nb 1].
[edit] Properties
- The distribution is completely given by the probabilities associated with each number k: pk = P(X = xk), k = 1,...,n, where
. The possible probabilities are exactly the standard (n − 1)-dimensional simplex; for n = 2 this reduces to the possible probabilities of the Bernoulli distribution being the 1-simplex,
.
- The distribution is a special case of a "multivariate Bernoulli distribution"[2] in which exactly one of the n 0-1 variables takes the value one.
- Let
be the realisation from a categorical distribution. Define the random vector Y as composed of the elements:
-
- where I is the indicator function. Then Y has a distribution which is a special case of the multinomial distribution with parameter n = 1. The sum of n independent and identically distributed such random variables Y constructed from a categorical distribution with parameter
is multinomially distributed with parameters n and 
- The sufficient statistic from n independent observations is the set of counts (or, equivalently, proportion) of observations in each category, where the total number of trials (=n) is fixed.
- The conjugate prior is the Dirichlet distribution.
- The indicator function of an observation, xk, is Bernoulli distributed with parameter pk.
[edit] See also
[edit] Related distributions
[edit] Notes
- ^ However, Bishop does not explicitly use the term categorical distribution
[edit] References
- ^ a b Bishop, C. (2006) Pattern Recognition and Machine Learning, Springer. ISBN 0387310738
- ^ Johnson, N.L., Kotz, S., Balakrishnan, N. (1997) Discrete Multivariate Distributions, Wiley. ISBN 0-471-12844-9 (p.105)


. The possible probabilities are exactly the
.![\mathbb{E} \left[ \mathbf{x} \right] = \boldsymbol{p}](http://upload.wikimedia.org/wikipedia/en/math/1/8/1/1819de99a9a399147d4a63cb3970b245.png)
be the realisation from a categorical distribution. Define the random vector Y as composed of the elements:
is