User:Skinnerd/Simplex Point Picking

From Wikipedia, the free encyclopedia

Copied from [1]

Random sampling[edit]

(Also called Simplex Point Picking) There are at least two efficient ways to generate uniform random samples from the unit simplex.

The first method is based on the fact that sampling from the K-dimensional unit simplex is equivalent to sampling from a Dirichlet distribution with parameters α = (α1, ..., αK) all equal to one. The exact procedure would be as follows:

  • Generate K unit-exponential distributed random draws x1, ..., xK.
    • This can be done by generating K uniform random draws yi from the open interval (0,1) and setting xi=-ln(yi).
  • Set S to be the sum of all the xi.
  • The K coordinates t1, ..., tK of the final point on the unit simplex are given by ti=xi/S.

The second method to generate a random point on the unit simplex is based on the order statistics of the uniform distribution on the unit interval (see Devroye, p. 568). The algorithm is as follows:

  • Set p0 = 0 and pK=1.
  • Generate K-1 uniform random draws pi from the open interval (0,1).
  • Sort into ascending order the K+1 points p0, ..., pK.
  • The K coordinates t1, ..., tK of the final point on the unit simplex are given by ti=pi-pi-1.

Although Smith and Tromble[1] have raised some concerns regarding the validity of this algorithm, these can be ignored in computer implementations, as they manifest themselves only if the pseudo-random number generator used to generate the yi generates an exact zero or two identical numbers - both events with such low probability that they can be ignored in virtually all implementations.

Random walk[edit]

Sometimes, rather than picking a point on the simplex at random we need to perform a uniform random walk on the simplex. Such random walks are frequently required for Monte Carlo method computations such as Markov chain Monte Carlo over the simplex domain. [2]

An efficient algorithm to do the walk can be derived from the fact that the normalized sum of K unit-exponential random variables is distributed uniformly over the simplex. We begin by defining a univariate function that "walks" a given sample over the positive real line such that the stationary distribution of its samples is the unit-exponential distribution. The function makes use of the Metropolis-Hastings algorithm to sample the new point given the old point. Such a function can be written as the following, where h is the relative step-size:

next_point <- function(x_old)
{
    repeat {
        x_new <- x_old * exp( Random_Normal(0,h) )
        metropolis_ratio <- exp(-x_new) / exp(-x_old)
        hastings_ratio <- ( x_new / x_old )
        acceptance_probability <- min( 1 , metropolis_ratio * hastings_ratio )
        if ( acceptance_probability > Random_Uniform(0,1) ) break
    }
    return(x_new)
}

Then to perform a random walk over the simplex:

  • Begin by drawing each element xi, i= 1, 2, ..., K, from a unit-exponential distribution.
  • For each i= 1, 2, ..., K
    • xi ← next_point(xi)
  • Set S to the sum of all the xi
  • Set ti = xi /S for all i= 1, 2, ..., K

The set of ti will be restricted to the simplex, and will walk ergodically over the domain with a uniform stationary density. Note that it is important not to re-normalize the xi at each step; doing so will result in a non-uniform stationary distribution. Instead, think of the xi as "hidden" parameters, with the simplex coordinates given by the set of ti.

This procedure effectively samples x_new from a gamma random variable with mean of x_old and standard deviation of h*x_old. If library routines are available to generate the requisite gamma variate directly, they may be used instead. The Hastings ratio for the MCMC step (which is different and independent of the Hastings-ratio in the next_point function) can then be computed given the gamma density function. Although it is theoretically possible to sample from a gamma density directly, experience shows that doing so is numerically unstable. In contrast, the next_point function is numerically stable even after many, many iterations.

Notes[edit]

  1. ^ Noah A. Smith; Roy W. Tromble (August 2004), Sampling uniformly from the unit simplex (PDF), Department of computer science, Center for Language and Speech processing, Johns Hopkins University, retrieved 2009-11-20
  2. ^ Andrew D. Fernandes; William R. Atchley (October 1, 2008), Site-specific evolutionary rates in proteins are better modeled as non-independent and strictly relative, retrieved 2010-3-28 {{citation}}: Check date values in: |accessdate= (help)