User:Skinnerd/Simplex Point Picking

Copied from [1]

Random sampling[edit]

(Also called Simplex Point Picking) There are at least two efficient ways to generate uniform random samples from the unit simplex.

The first method is based on the fact that sampling from the K-dimensional unit simplex is equivalent to sampling from a Dirichlet distribution with parameters α = (α₁, ..., α_K) all equal to one. The exact procedure would be as follows:

Generate K unit-exponential distributed random draws x₁, ..., x_K.
- This can be done by generating K uniform random draws y_i from the open interval (0,1) and setting x_i=-ln(y_i).
Set S to be the sum of all the x_i.
The K coordinates t₁, ..., t_K of the final point on the unit simplex are given by t_i=x_i/S.

The second method to generate a random point on the unit simplex is based on the order statistics of the uniform distribution on the unit interval (see Devroye, p. 568). The algorithm is as follows:

Set p₀ = 0 and p_K=1.
Generate K-1 uniform random draws p_i from the open interval (0,1).
Sort into ascending order the K+1 points p₀, ..., p_K.
The K coordinates t₁, ..., t_K of the final point on the unit simplex are given by t_i=p_i-p_i-1.

Although Smith and Tromble^[1] have raised some concerns regarding the validity of this algorithm, these can be ignored in computer implementations, as they manifest themselves only if the pseudo-random number generator used to generate the y_i generates an exact zero or two identical numbers - both events with such low probability that they can be ignored in virtually all implementations.

Random walk[edit]

Sometimes, rather than picking a point on the simplex at random we need to perform a uniform random walk on the simplex. Such random walks are frequently required for Monte Carlo method computations such as Markov chain Monte Carlo over the simplex domain. ^[2]

An efficient algorithm to do the walk can be derived from the fact that the normalized sum of K unit-exponential random variables is distributed uniformly over the simplex. We begin by defining a univariate function that "walks" a given sample over the positive real line such that the stationary distribution of its samples is the unit-exponential distribution. The function makes use of the Metropolis-Hastings algorithm to sample the new point given the old point. Such a function can be written as the following, where h is the relative step-size:

next_point <- function(x_old)
{
    repeat {
        x_new <- x_old * exp( Random_Normal(0,h) )
        metropolis_ratio <- exp(-x_new) / exp(-x_old)
        hastings_ratio <- ( x_new / x_old )
        acceptance_probability <- min( 1 , metropolis_ratio * hastings_ratio )
        if ( acceptance_probability > Random_Uniform(0,1) ) break
    }
    return(x_new)
}

Then to perform a random walk over the simplex:

Begin by drawing each element x_i, i= 1, 2, ..., K, from a unit-exponential distribution.
For each i= 1, 2, ..., K
- x_i ← next_point(x_i)
Set S to the sum of all the x_i
Set t_i = x_i /S for all i= 1, 2, ..., K

The set of t_i will be restricted to the simplex, and will walk ergodically over the domain with a uniform stationary density. Note that it is important not to re-normalize the x_i at each step; doing so will result in a non-uniform stationary distribution. Instead, think of the x_i as "hidden" parameters, with the simplex coordinates given by the set of t_i.

This procedure effectively samples x_new from a gamma random variable with mean of x_old and standard deviation of h*x_old. If library routines are available to generate the requisite gamma variate directly, they may be used instead. The Hastings ratio for the MCMC step (which is different and independent of the Hastings-ratio in the next_point function) can then be computed given the gamma density function. Although it is theoretically possible to sample from a gamma density directly, experience shows that doing so is numerically unstable. In contrast, the next_point function is numerically stable even after many, many iterations.

Notes[edit]

^ Noah A. Smith; Roy W. Tromble (August 2004), Sampling uniformly from the unit simplex (PDF), Department of computer science, Center for Language and Speech processing, Johns Hopkins University, retrieved 2009-11-20
^ Andrew D. Fernandes; William R. Atchley (October 1, 2008), Site-specific evolutionary rates in proteins are better modeled as non-independent and strictly relative, retrieved 2010-3-28 {{citation}}: Check date values in: |accessdate= (help)

[1] Noah A. Smith; Roy W. Tromble (August 2004), Sampling uniformly from the unit simplex (PDF), Department of computer science, Center for Language and Speech processing, Johns Hopkins University, retrieved 2009-11-20

[2] Andrew D. Fernandes; William R. Atchley (October 1, 2008), Site-specific evolutionary rates in proteins are better modeled as non-independent and strictly relative, retrieved 2010-3-28 {{citation}}: Check date values in: |accessdate= (help)

[1]

[2]