Jump to content

Dirichlet process: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
→‎Introduction: Introduce the \phi variable sooner in the Introduction, in order to give readers a headsup as to where this article is going. The "Introduction" is currently too long.
Line 31: Line 31:
\boldsymbol\beta &\sim& \operatorname{Dirichlet}(\alpha/K,..., \alpha/K) \\
\boldsymbol\beta &\sim& \operatorname{Dirichlet}(\alpha/K,..., \alpha/K) \\
z_{1,\dots,N} &\sim& \operatorname{Categorical}(\boldsymbol\beta) \\
z_{1,\dots,N} &\sim& \operatorname{Categorical}(\boldsymbol\beta) \\
x_{i=1,\dots,N} &\sim& F(\theta_{z_i})
\phi_i &=& \theta_{z_i} \\
x_{i=1,\dots,N} &\sim& F(\phi_i)
\end{array}
\end{array}
</math>
</math>


This is a basic [[generative model]] where the observations <math>x_1,\dots,x_N</math> are distributed according to a mixture of <math>K</math> components, where each component is distributed according to a single [[parametric family]] <math>F(\theta)</math> but where different components have different values of <math>\theta</math>, which is drawn in turn from a distribution <math>H</math>. Typically, <math>H</math> will be the [[conjugate prior]] distribution of <math>F</math>. In addition, the [[prior probability]] of each component is specified by <math>\boldsymbol\beta</math>, which is a size-<math>K</math> vector of probabilities, all of which add up to 1.
This is a basic [[generative model]] where the observations <math>x_1,\dots,x_N</math> are distributed according to a mixture of <math>K</math> components, where each component is distributed according to a single [[parametric family]] <math>F(\theta)</math> but where different components have different values of <math>\theta</math>, which is drawn in turn from a distribution <math>H</math>. Typically, <math>H</math> will be the [[conjugate prior]] distribution of <math>F</math>. In addition, the [[prior probability]] of each component is specified by <math>\boldsymbol\beta</math>, which is a size-<math>K</math> vector of probabilities, all of which add up to 1. Also, the <math>\phi_i</math> are non-random variables, their purpose is to record the value of <math>\theta</math> that is associated with observation <math>i</math>. As <math>K\to\infty</math>, the distribution of the vector <math>\phi_{1..N}</math> (where <math>\theta,\beta,z</math> have been marginalized out) becomes the Dirichlet process.


For example, if the observations are apartment prices and the <math>K</math> components represent different neighborhoods, then <math>F</math> might be a [[Gaussian distribution]] with unknown [[mean]] and unknown [[variance]], with the mean and variance specifying the distribution of prices in that neighborhood. Then the parameter <math>\theta</math> will be a vector of two values, a mean drawn from a [[Gaussian distribution]] and a variance drawn from an [[inverse gamma distribution]], which are the conjugate priors of the mean and variance, respectively, of a Gaussian distribution.
For example, if the observations are apartment prices and the <math>K</math> components represent different neighborhoods, then <math>F</math> might be a [[Gaussian distribution]] with unknown [[mean]] and unknown [[variance]], with the mean and variance specifying the distribution of prices in that neighborhood. Then the parameter <math>\theta</math> will be a vector of two values, a mean drawn from a [[Gaussian distribution]] and a variance drawn from an [[inverse gamma distribution]], which are the conjugate priors of the mean and variance, respectively, of a Gaussian distribution.

Revision as of 17:20, 6 June 2013

In probability theory, a Dirichlet process is a random process that is a probability distribution whose domain is itself a random distribution.

Given a Dirichlet process , where (the base distribution or base measure) is an arbitrary distribution and (the concentration parameter) is a positive real number, a draw from will return a random distribution (the output distribution) over values drawn from . That is, the support of the output distribution is the same as the base distribution. The output distribution will be discrete, meaning that individual values drawn from the output distribution will sometimes repeat themselves even if the base distribution is continuous (i.e., if two different draws from the base distribution will be distinct with probability one). The extent to which values will repeat is determined by , with higher values causing less repetition. If the base distribution is continuous, so that separate draws from it always return distinct values, then the infinite set of probabilities corresponding to the frequency of each possible value that the output distribution can return are distributed according to a stick-breaking process.

Note that the Dirichlet process is a stochastic process, meaning that technically speaking it is an infinite sequence of random variables, rather than a single random distribution. The relation between the two is as follows. Consider the Dirichlet process as defined above, as a distribution over random distributions, and call this process . We can call this the distribution-centered view of the Dirichlet process. First, draw a random output distribution from this process, and then consider an infinite sequence of random variables representing values drawn from this distribution. Note that, conditioned on the output distribution, the variables are independent identically distributed. Now, consider instead the distribution of the random variables that results from marginalizing out (integrating over) the random output distribution. (This makes all the variables dependent on each other. However, they are still exchangeable, meaning that the marginal distribution of one variable is the same as that of all other variables. That is, they are "identically distributed" but not "independent".) The resulting infinite sequence of random variables with the given marginal distributions is another view onto the Dirichlet process, denoted here . We can call this the process-centered view of the Dirichlet process. The conditional distribution of one variable given all the others, or given all previous variables, is defined by the Chinese restaurant process (see below).

Another way to think of a Dirichlet process is as an infinite-dimensional generalization of the Dirichlet distribution. The Dirichlet distribution returns a finite-dimensional set of probabilities (for some size , specified by the parameters of the distribution), all of which sum to 1. This can be thought of as a finite-dimensional discrete distribution; i.e. a Dirichlet distribution can be thought of as a distribution over -dimensional discrete distributions. Imagine generalizing a symmetric Dirichlet distribution, defined by a dimension and concentration parameter , to an infinite set of probabilities; the resulting distribution over infinite-dimensional discrete distributions is called the stick-breaking process (see below). Imagine then using this set of probabilities to create an infinite-dimensional mixture model, with each separate probability from the set associated with a mixture component, and the value of each component drawn separately from a base distribution ; then draw an infinite number of samples from this mixture model. The infinite set of random variables corresponding to the marginal distribution of these samples is a Dirichlet process with parameters and .

The Dirichlet process was formally introduced by Thomas Ferguson in 1973.[1]

Introduction

Consider a simple mixture model:

This is a basic generative model where the observations are distributed according to a mixture of components, where each component is distributed according to a single parametric family but where different components have different values of , which is drawn in turn from a distribution . Typically, will be the conjugate prior distribution of . In addition, the prior probability of each component is specified by , which is a size- vector of probabilities, all of which add up to 1. Also, the are non-random variables, their purpose is to record the value of that is associated with observation . As , the distribution of the vector (where have been marginalized out) becomes the Dirichlet process.

For example, if the observations are apartment prices and the components represent different neighborhoods, then might be a Gaussian distribution with unknown mean and unknown variance, with the mean and variance specifying the distribution of prices in that neighborhood. Then the parameter will be a vector of two values, a mean drawn from a Gaussian distribution and a variance drawn from an inverse gamma distribution, which are the conjugate priors of the mean and variance, respectively, of a Gaussian distribution.

Meanwhile, if the observations are words and the components represent different topics, then might be a categorical distribution over a vocabulary of size , with unknown frequencies of each word in the vocabulary, specifying the distribution of words in each particular topic. Then the parameter will be a vector of values, each representing a probability and all summing to one, drawn from a Dirichlet distribution, which is the conjugate prior of the categorical distribution.

Now imagine we consider the limit as . Conceptually this means that we have no idea how many components are present. The result will be as follows:

In this model, conceptually speaking there are an infinite number of components, each with a separate parameter value, and a correspondingly infinite number of prior probabilities for each component, drawn from a stick-breaking process (see section below). Note that a practical application of such a model would not actually store an infinite number of components. Instead, it would generate the component prior probabilities one at a time from the stick-breaking process, which by construction tends to return the largest probability values first. As each component probability is drawn, a corresponding parameter value is also drawn. At any one time, some of the prior probability mass will be assigned to components and some unassigned. To generate a new observation, a random number between 0 and 1 is drawn uniformly, and if it lands in the unassigned mass, new components are drawn as necessary (each one reducing the amount of unassigned mass) until enough mass has been allocated to place this number in an existing component. Each time a new component probability is generated by the stick-breaking process, a corresponding parameter value is drawn from .

Sometimes, the stick-breaking process is denoted as , after the authors of this process, instead of .

Another view of this model comes from looking back at the finite-dimensional mixture model with mixing probabilities drawn from a Dirichlet distribution and considering the distribution of a particular component assignment conditioned on all previous components, with the mixing probabilities integrated out. This distribution is a Dirichlet-multinomial distribution. Note that, conditioned on a particular value of , each is independent of the others, but marginalizing over introduces dependencies among the component assignments. It can be shown (see the Dirichlet-multinomial distribution article) that

where is a particular value of and is the number of times a topic assignment in the set has the value , i.e. probability of assigning an observation to a particular component is roughly proportional to the number of previous observations already assigned to this component.

Now consider the limit as . For a particular previously observed component ,

That is, the probability of seeing a previously observed component is directly proportional to the number of times the component has already been seen. This is often expressed as the rich get richer.

For an unseen component , , and as the probability of seeing this component goes to 0. However, the number of unseen components approaches infinity. Consider instead the set of all unseen components . Note that, if there are components seen so far, the number of unseen components . Then, consider the probability of seeing any of these components:

In other words:

  1. The probability of seeing an already-seen component is proportional to the number of times that component has been seen.
  2. The probability of seeing any unseen component is proportional to the concentration parameter .

This process is termed a Chinese restaurant process (CRP). In terms of the CRP, the infinite-dimensional model can equivalently be written:

Note that we have marginalized out the mixing probabilities , and thereby produced a more compact representation of the model.

Now imagine further that we also marginalize out the component assignments , and instead we look directly at the distribution of . Then, we can write the model directly in terms of the Dirichlet process:

represents one view (the distribution-centered view) of the Dirichlet process as producing a random, infinite-dimensional discrete distribution with values drawn from .

An alternative view of the Dirichlet process (the process-centered view), adhering more closely to its definition as a stochastic process, sees it as directly producing an infinite stream of values. Notating this view as , we can write the model as

In this view, although the Dirichet process generates an infinite stream of parameter values, we only care about the first N values. Note that some of these values will be the same as previously seen values, in a "rich get richer" scheme, as determined by the Chinese restaurant process.

Formal definition

A Dirichlet process over a set S is a stochastic process whose sample path (i.e. an infinite-dimensional set of random variates drawn from the process) is a probability distribution on S. The finite dimensional distributions are from the Dirichlet distribution: If H is a finite measure on S, is a positive real number and X is a sample path drawn from a Dirichlet process, written as

 

then for any measureable partition of S, say , we have that

The Chinese restaurant process

As shown above, a simple distribution, the so-called Chinese restaurant process, results from considering the conditional distribution of one component assignment given all previous ones in a Dirichlet distribution mixture model with components, and then taking the limit as goes to infinity. It can be shown, using the above formal definition of the Dirichlet process and considering the process-centered view of the process, that the conditional distribution of the component assignment of one sample from the process given all previous samples follows a Chinese restaurant process.

Suppose that samples, have already been obtained. According to the Chinese Restaurant Process, the sample should be drawn from

where is an atomic distribution centered on . Interpreting this, two properties are clear:

  1. Even if is a countable set, there is a finite probability that two samples will have exactly the same value. Samples from a Dirichlet process are therefore discrete.
  2. The Dirichlet process exhibits a self-reinforcing property; the more often a given value has been sampled in the past, the more likely it is to be sampled again.

The name "Chinese restaurant process" is derived from the following analogy: imagine an infinitely large restaurant containing an infinite number of tables, and able to serve an infinite number of dishes. The restaurant in question operates a somewhat unusual seating policy whereby new diners are seated either at a currently occupied table with probability proportional to the number of guests already seated there, or at an empty table with probability proportional to a constant. Guests who sit at an occupied table must order the same dish as those currently seated, whereas guests allocated a new table are served a new dish at random. The distribution of dishes after guests are served is a sample drawn as described above. The Chinese Restaurant Process is related to the Polya Urn sampling scheme for finite Dirichlet distributions.

The stick-breaking process

A third approach to the Dirichlet process is provided by the so-called stick-breaking process, which can be used to provide a constructive algorithm (the stick-breaking construction) for generating a Dirichlet process. Let be a set of random variables such that

where is the normalisation constant for the measure , so that . Define according to

and let be a set of samples from . The distribution given by the density (where is the Dirac delta measure, here used as an indicator function which evaluates to except for ), is then a sample from the corresponding Dirichlet process. This method provides an explicit construction of the non-parametric sample, and makes clear the fact that the samples are discrete.

The name 'stick-breaking' comes from the interpretation of as the length of the piece of a unit-length stick assigned to the kth value. After the first k − 1 values have their portions assigned, the length of the remainder of the stick, , is broken according to a sample from a beta distribution. In this analogy, indicates the portion of the remainder to be assigned to the k-th value. The smaller is, the less of the stick will be left for subsequent values (on average).

The Polya urn scheme

Yet another way to visualize the Dirichlet process and Chinese restaurant process is as a modified Polya urn scheme. Imagine that we start with an urn filled with black balls. Then we proceed as follows:

  1. Each time we need an observation, we draw a ball from the urn.
  2. If the ball is black, we generate a new (non-black) color uniformly, label a new ball this color, drop the new ball into the urn along with the ball we drew, and return the color we generated.
  3. Otherwise, label a new ball with the color of the ball we drew, drop the new ball into the urn along with the ball we drew, and return the color we observed.

The resulting distribution over colors is the same as the distribution over tables in the Chinese restaurant process. Furthermore, when we draw a black ball, if rather than generating a new color, we instead pick a random value from a base distribution and use that value to label the new ball, the resulting distribution over labels will be the same as the distribution over values in a Dirichlet process.

Applications of the Dirichlet process

Dirichlet processes are frequently used in Bayesian nonparametric statistics. "Nonparametric" here does not mean a parameter-less model, rather a model in which representations grow as more data are observed. Bayesian nonparametric models have gained considerable popularity in the field of machine learning because of the above-mentioned flexibility, especially in unsupervised learning. In a Bayesian nonparametric model, the prior and posterior distributions are not parametric distributions, but stochastic processes.[2] The fact that the Dirichlet distribution is a probability distribution on the simplex of non-negative numbers that sum to one makes it a good candidate to model distributions of distributions or distributions of functions. Additionally, the non-parametric nature of this model makes it an ideal candidate for clustering problems where the distinct number of clusters is unknown beforehand.

As draws from a Dirichlet process are discrete, an important use is as a prior probability in infinite mixture models. In this case, is the parametric set of component distributions. The generative process is therefore that a sample is drawn from a Dirichlet process, and for each data point in turn a value is drawn from this sample distribution and used as the component distribution for that data point. The fact that there is no limit to the number of distinct components which may be generated makes this kind of model appropriate for the case when the number of mixture components is not well-defined in advance. For example, the infinite mixture of Gaussians model.[3]

The infinite nature of these models also lends them to natural language processing applications, where it is often desirable to treat the vocabulary as an infinite, discrete set.

References

  1. ^ Ferguson, Thomas (1973). "Bayesian analysis of some nonparametric problems". Annals of Statistics. 1 (2): 209–230. doi:10.1214/aos/1176342360. MR 0350949.
  2. ^ Nils Lid Hjort, Chris Holmes, Peter Müller and Stephen G. Walker (2010). Bayesian Nonparametrics. Cambridge University Press. ISBN 0-521-51346-4.{{cite book}}: CS1 maint: multiple names: authors list (link)
  3. ^ Rasmussen, Carl (2000). "The Infinite Gaussian Mixture Model" (PDF). Advances in Neural Information Processing Systems (NIPS). 12: 554–560.