Dirichlet process

In probability theory, a Dirichlet process is a way of assigning a probability distribution over probability distributions. That is, a Dirichlet process is a probability distribution whose domain is itself a set of probability distributions. The probability distributions in the domain are almost surely discrete and may be infinite dimensional. Assigning an arbitrary probability distribution over a domain of infinite dimensional probability distributions would require an infinite amount of computational resources. The main function of the Dirichlet process is that it allows the specification of a distribution over infinite dimensional distributions in a way that uses only finite resources.

Given a Dirichlet process $\mathrm{DP}\left(H, \alpha\right)$, where $H$ (the base distribution or base measure) is an arbitrary distribution and $\alpha$ (the concentration parameter) is a positive real number, a draw from $\mathbf{DP}$ will return a random distribution over some of the values that can be drawn from $H$. That is, the support of each draw of the output distribution is always a subset of the support of base distribution. The output distribution will be discrete, meaning that individual values drawn from the output distribution will sometimes repeat themselves even if the base distribution is continuous. The extent to which values will repeat is determined by $\alpha$, with higher values causing less repetition.

A Dirichlet process is most easily defined implicitly by the distribution it induces on certain finite dimensional statistics. A common model for data is that observations $X_{1},X_{2},\dots$ are assumed to be identically and independently distributed according to some unknown distribution $P$. We suppose that the unknown distribution $P$ is itself drawn randomly according to $\mathrm{DP}\left(H,\alpha\right)$. If we wish to simulate observations $X_{1},\dots X_{n}$ a straightforward algorithm for doing so is:

1. Draw a distribution $P$ from $\mathrm{DP}\left(H,\alpha\right)$
2. Draw observations $X_{1},\dots,X_{n}$ independently from $P$.

The issue is that in many problems of interest the distribution $P$ may require an infinite number of parameters to specify. To avoid this we can simulate observations according the algorithm:

1. Draw $X_{1}$ from measure $H\left(x\right)$.
2. For $n>1$:
1. With probability $\frac{\alpha}{\alpha+n-1}$ draw $X_{n}$ from $H\left(x\right)$.
2. For each distinct $x$ in $\left\{ X_{1},\dots,X_{n-1}\right\}$ let $n_{x}$ be the number of $X_{j}$ such that $X_{j}=x$.
3. Let $X_{n}=x$ with probability $\frac{n_{x}}{\alpha+n-1}$.

It can be shown that observations $X_{1},X_{2},\dots$ drawn according to this process are exchangeable and thus by de Finetti’s representation theorem the $X_{1},X_{2},\dots$ are identically and independently distributed according to some unknown distribution $P$. This second algorithm is thus equivalent to first drawing a random distribution $P$ and then drawing observations $X_{1},\dots,X_{n}$ independently from $P$. The Dirichlet process with parameters $H$ and $\alpha$, $\mathrm{DP}\left(H,\alpha\right)$, is defined to be the distribution over distributions from which $P$ is drawn; i.e. $\mathrm{DP}\left(H,\alpha\right)$ is the distribution over distributions which makes the two algorithms for simulating $X_{1},\dots,X_{n}$ equivalent. Notice that by defining the Dirichlet process implicitly in this fashion we have sidestepped the impossible problem of explicitly defining a distribution over the infinite dimensional distributions $P$.

Dirichlet processes frequently appear in the context of Bayesian Non-parametric statistics where a typical task is to learn distributions on function spaces, which involve effectively infinitely many parameters. The key insight is that in many applications the infinite dimensional distributions appear only as an intermediary computational device and are in particular not required for either the initial specification of prior beliefs or for the statement of the final inference. The Dirichlet process can be used to circumvent infinite computational requirements as described above. A particularly important application of the Dirichlet process is as a prior probability in infinite mixture models; this is discussed in detail below.

There are several equivalent views of the Dirichlet process. It can be formally defined as a stochastic process over the sigma algebra generated by $X_{1}$, where the induced marginal distributions over any finite partition of the domain of $X_{1}$ are Dirichlet distributions. Alternatively the Dirichlet process may be defined implicitly through de Finetti’s theorem as described above; this is often called the Chinese restaurant process. The stick-breaking process defines the Dirichlet process constructively by writing a distribution sampled from the process as $f\left(x\right)=\sum_{k=1}^{\infty}\beta_{k}\delta_{x_{k}}\left(x\right)$, where $\left\{ x_{k}\right\} _{k=1}^{\infty}$ are samples from the base distribution $H\left(x\right)$ and the $\beta_{k}$ are defined by a recursive scheme that repeatedly samples from a $\mathrm{Beta}\left(1,\alpha\right)$ distribution.

The Dirichlet process was formally introduced by Thomas Ferguson in 1973.[1]

Dirichlet mixture models

Simulation of 1000 observations drawn from a Dirichlet mixture model. Each observation within a cluster is drawn iid from $N(\mu_{k},1/4)$ where the cluster means $\mu_{k}$ are drawn from a distribution G drawn from a Dirichlet Process with with concentration parameter $\alpha=0.5$ and base distribution $H=N(2,16)$. Each row is a new simulation.

To understand what Dirichlet processes are and the problem they solve we consider the example of clustered data. It is a common situation that data points might be distributed in a hierarchical fashion where each data point belongs to a (random) cluster and is further distributed randomly within that cluster. For example, we might be interested in how people will vote on a number of propositions in an upcoming election. A reasonable model for this situation might be to classify each voter as a liberal, a conservative or a moderate and then model the event that a voter says “Yes” to any particular item as a Bernoulli random variable with probability dependent on which political cluster they belong to. By looking at how votes were cast in previous years on similar pieces of legislation one could fit a reasonable predictive model using an ad-hoc technique such as k-means clustering. However, this depends critically on knowing in advance the number of clusters that generated the data. In many situations it is not possible to determine this ahead of time, and even when there is a reasonable model for the clusters we would still like to be able to check our assumptions. For example, in the voting example above the division into liberal, conservative and moderate might not be finely tuned enough; considerations such as a religion, class or race could also be critical for understanding voter behavior. As another example, we might be interested in modeling the velocities of galaxies using a simple model assuming that the velocities are clustered, for instance by assuming each velocity is distributed as $v_i\sim N(\mu_k,\sigma^2)$, where the $i$th observation belongs to the $k$th cluster of galaxies with common expected velocity. In this case it is far from obvious how to determine a priori how many clusters (of common velocities) there should be and any model for this would be highly suspect and should be checked against the data. By using a Dirichlet process prior for the clusters we circumvent the need to explicitly specify ahead of time how many clusters there are.

We consider the last example in more detail. A first naive model is to presuppose that there are $K$ clusters of normally distributed velocities with common known fixed variance $\sigma^{2}$. Denoting the event that the $i$th observation is in the $k$th cluster as $z_{i}=k$ we can write this model as:

\begin{align} v_i\mid z_i=k,\mu_k & \sim N(\mu_k,\sigma^2) \\ \mathrm{P} (z_i=k) & = \pi_k \\ \boldsymbol{\pi}\mid \alpha & \sim \mathrm{Dir}(\alpha/K\mathrm{1}_K) \\ \mu_k & \sim H(\lambda) \end{align}

That is, we assume that the data belongs to $K$ distinct clusters with means $\mu_{k}$ and that $\pi_{k}$ is the (unknown) probability of a data point belonging to the $k$th cluster. We assume that we have no initial information distinguishing the clusters, which is captured by the symmetric prior $\mathrm{Dir}\left(\alpha/K\mathrm{1}_{K}\right)$. We further assign independent and identical prior distributions $H(\lambda)$ to each of the cluster means. Here the hyper-parameters $\alpha$ and $\lambda$ are taken to be known fixed constants, chosen to reflect our prior beliefs about the system. To understand the connection to Dirichlet process priors we rewrite this model in an equivalent but more suggestive form:

\begin{align} v_i \mid \tilde{\mu}_i & \sim N(\tilde{\mu}_i,\sigma^2) \\ \tilde{\mu}_i & \sim G=\sum_{k=1}^K \pi_k \delta_{\mu_k} (\tilde{\mu}_i) \\ \boldsymbol{\pi}\mid \alpha & \sim \mathrm{Dir} (\alpha/K\mathrm{1}_K) \\ \mu_k & \sim H(\lambda) \end{align}

Instead of imagining that each data point is first assigned a cluster and then drawn from the distribution associated to that cluster we now think of each observation being associated with parameter $\tilde{\mu}_{i}$ drawn from some discrete distribution $G$ with support on the $K$ means. That is, we are now treating the $\tilde{\mu}_{i}$ as being drawn from the random distribution $G$ and our prior information is incorporated into the model by the distribution over distributions $G$.

We would now like to extend this model to work without pre-specifying a fixed number of clusters $K$. Mathematically, this means we would like to select a random prior distribution $G=\sum_{k=1}^{\infty}\pi_k \delta_{\mu_k}(\tilde{\mu}_i)$ where the values of the clusters means $\mu_{k}$ are again i.i.d. $H\left(\lambda\right)$ and the distribution over $\pi_k$ is symmetric over the infinite set of clusters. This is exactly what is accomplished by the model:

\begin{align} v_i \mid \tilde{\mu}_i & \sim N(\tilde{\mu}_i,\sigma^2)\\ \tilde{\mu}_i & \sim G \\ G & \sim \mathrm{DP}(H(\lambda),\alpha) \end{align}

With this in hand we can better understand the computational merits of the Dirichlet Process. Suppose that we wanted to draw $n$ observations from the naive model with exactly $K$ clusters. A simple algorithm for doing this would be to draw $K$ values of $\mu_k$ from $H(\lambda)$, a distribution $\pi$ from $\mathrm{Dir}(\alpha/K\mathrm{1}_{K})$ and then for each observation independently sample the cluster $k$ with probability $\pi_{k}$ and the value of the observation according to $N\left(\mu_{k},\sigma^{2}\right)$. It is easy to see that this sort of algorithm can not be extended to case where we allow infinite clusters because this would require sampling an infinite dimensional parameter $\boldsymbol{\pi}$. However, as described above it is still possible to sample observations $v_{i}$ using the Chinese Restaurant algorithm, which makes use of de Finetti’s representation theorem to avoid having to explicitly specify $\boldsymbol{\pi}$.

Fitting the model described above based on observed data $D$ means finding the posterior distribution $p\left(\boldsymbol{\pi},\boldsymbol{\mu}\mid D\right)$ over cluster probabilities and their associated means. In the infinite dimensional case it is obviously impossible to write down the posterior explicitly. However, because it is nevertheless possible to sample data points $v_{i}$ it is possible to sample from the posterior distribution using a modified Gibbs sampler.[2] This is the critical fact that makes the Dirichlet process prior useful for inference.

Formal definition

A Dirichlet process over a set S is a stochastic process whose sample path (i.e. an infinite-dimensional set of random variates drawn from the process) is a probability distribution on S. The finite-dimensional distributions are from the Dirichlet distribution: If H is a probability distribution on S, $\alpha$ is a positive real number and X is a sample path drawn from a Dirichlet process, written as

$X \sim \mathrm{DP}\left(\alpha, H\right)$

then for any measureable partition of S, say $\left\{B_i\right\}_{i=1}^{n}$, we have that

$\left(X\left(B_1\right),\dots,X\left(B_n\right)\right) \sim \mathrm{Dirichlet}\left(\alpha H\left(B_1\right),\dots, \alpha H\left(B_n\right)\right).$

The Chinese restaurant process

As shown above, a simple distribution, the so-called Chinese restaurant process, results from considering the conditional distribution of one component assignment given all previous ones in a Dirichlet distribution mixture model with $K$ components, and then taking the limit as $K$ goes to infinity. It can be shown, using the above formal definition of the Dirichlet process and considering the process-centered view of the process, that the conditional distribution of the component assignment of one sample from the process given all previous samples follows a Chinese restaurant process.

Suppose that $J$ samples, $\left\{\theta_j\right\}_{j=1}^{J}$ have already been obtained. According to the Chinese Restaurant Process, the $\left(J+1\right)^{\mathrm{th}}$ sample should be drawn from

$\theta_{J+1} \sim \frac{1}{H\left(S\right)+J} \left( H + \sum_{j=1}^{J}\delta_{\theta_j}\right)$

where $\delta_{\theta}$ is an atomic distribution centered on $\theta$. Interpreting this, two properties are clear:

1. Even if $S$ is a countable set, there is a finite probability that two samples will have exactly the same value. Samples from a Dirichlet process are therefore discrete.
2. The Dirichlet process exhibits a self-reinforcing property; the more often a given value has been sampled in the past, the more likely it is to be sampled again.

The name "Chinese restaurant process" is derived from the following analogy: imagine an infinitely large restaurant containing an infinite number of tables, and able to serve an infinite number of dishes. The restaurant in question operates a somewhat unusual seating policy whereby new diners are seated either at a currently occupied table with probability proportional to the number of guests already seated there, or at an empty table with probability proportional to a constant. Guests who sit at an occupied table must order the same dish as those currently seated, whereas guests allocated a new table are served a new dish at random. The distribution of dishes after $J$ guests are served is a sample drawn as described above. The Chinese Restaurant Process is related to the Polya Urn sampling scheme for finite Dirichlet distributions.

The stick-breaking process

A third approach to the Dirichlet process is provided by the so-called stick-breaking process, which can be used to provide a constructive algorithm (the stick-breaking construction) for generating a Dirichlet process. Let $\left\{\beta'_k\right\}_{k=1}^\infty$ be a set of random variables such that

$\beta'_k \sim \mathrm{Beta}\left(1,\alpha\right)$

Define $\left\{\beta_k\right\}_{k=1}^\infty$ according to

$\beta_k = \beta'_k\cdot\prod_{i=1}^{k-1}\left(1-\beta'_i\right)$

and let $\left\{\theta_k\right\}_{k=1}^{\infty}$ be a set of samples from $H$. The distribution given by the density $f(\theta) = \sum_{k=1}^{\infty}\beta_k\cdot\delta_{\theta_k}(\theta)$ (where $\delta$ is the Dirac delta measure, here used as an indicator function which evaluates to $0$ except for $\delta_{\theta_k}(\theta_k)=1$), is then a sample from the corresponding Dirichlet process. This method provides an explicit construction of the non-parametric sample, and makes clear the fact that the samples are discrete.

The name 'stick-breaking' comes from the interpretation of $\beta_k$ as the length of the piece of a unit-length stick assigned to the kth value. After the first k − 1 values have their portions assigned, the length of the remainder of the stick, $\prod_{i=1}^{k-1}\left(1-\beta'_i\right)$, is broken according to a sample $\beta'_k$ from a beta distribution. In this analogy, $\beta'_k$ indicates the portion of the remainder to be assigned to the k-th value. The smaller $\alpha$ is, the less of the stick will be left for subsequent values (on average).

The Polya urn scheme

Yet another way to visualize the Dirichlet process and Chinese restaurant process is as a modified Polya urn scheme. Imagine that we start with an urn filled with $\alpha$ black balls. Then we proceed as follows:

1. Each time we need an observation, we draw a ball from the urn.
2. If the ball is black, we generate a new (non-black) color uniformly, label a new ball this color, drop the new ball into the urn along with the ball we drew, and return the color we generated.
3. Otherwise, label a new ball with the color of the ball we drew, drop the new ball into the urn along with the ball we drew, and return the color we observed.

The resulting distribution over colors is the same as the distribution over tables in the Chinese restaurant process. Furthermore, when we draw a black ball, if rather than generating a new color, we instead pick a random value from a base distribution $H$ and use that value to label the new ball, the resulting distribution over labels will be the same as the distribution over values in a Dirichlet process.

Applications of the Dirichlet process

Dirichlet processes are frequently used in Bayesian nonparametric statistics. "Nonparametric" here does not mean a parameter-less model, rather a model in which representations grow as more data are observed. Bayesian nonparametric models have gained considerable popularity in the field of machine learning because of the above-mentioned flexibility, especially in unsupervised learning. In a Bayesian nonparametric model, the prior and posterior distributions are not parametric distributions, but stochastic processes.[3] The fact that the Dirichlet distribution is a probability distribution on the simplex of sets of non-negative numbers that sum to one makes it a good candidate to model distributions of distributions or distributions of functions. Additionally, the non-parametric nature of this model makes it an ideal candidate for clustering problems where the distinct number of clusters is unknown beforehand.

As draws from a Dirichlet process are discrete, an important use is as a prior probability in infinite mixture models. In this case, $S$ is the parametric set of component distributions. The generative process is therefore that a sample is drawn from a Dirichlet process, and for each data point in turn a value is drawn from this sample distribution and used as the component distribution for that data point. The fact that there is no limit to the number of distinct components which may be generated makes this kind of model appropriate for the case when the number of mixture components is not well-defined in advance. For example, the infinite mixture of Gaussians model.[4]

The infinite nature of these models also lends them to natural language processing applications, where it is often desirable to treat the vocabulary as an infinite, discrete set.

The Dirichlet Process can also be used for nonparametric hypothesis testing, i.e., to develop Bayesian nonparametric versions of the classical nonparametric hypothesis tests, e.g., sign test, Wilcoxon rank sum test, Wilcoxon signed-rank test etc.. For instance, Bayesian nonparametric versions of the Wilcoxon rank sum test and the Wilcoxon signed-rank test have been developed by using the Imprecise Dirichlet process, a prior ignorance Dirichlet process.