In statistics and in statistical physics, the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. This sequence can be used to approximate the distribution (i.e., to generate a histogram), or to compute an integral (such as an expected value). Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high. For single-dimensional distributions, other methods are usually available (e.g. adaptive rejection sampling) that can directly return independent samples from the distribution, and are free from the problem of auto-correlated samples that is inherent in MCMC methods.
Another strategy is to consider a sequence of interacting Metropolis-Hasting algorithms associated with an interpolating sequence of target probability distributions with an increasing level of sampling complexity (including increasing constraint levels, decreasing temperature schedule, and others). In contrast to traditional MCMC methods, the precision parameter is only related to the number of interacting Metropolis-Hasting samplers. These advanced particle methodologies belong to the class of Feynman-Kac mean field particle models, also called Sequential Monte Carlo or particle filter methods. They can also be interpreted as a simple mutation-selection genetic algorithm with Metropolis-Hastings mutations.
The algorithm was named after Nicholas Metropolis, who was an author along with Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller of the 1953 paper Equation of State Calculations by Fast Computing Machines which first proposed the algorithm for the specific case of the canonical ensemble; and W. K. Hastings who extended it to the more general case in 1970. There is controversy over the credit for discovery of the algorithm. Edward Teller states in his memoirs that the five authors of the 1953 paper worked together for "days (and nights)". M. Rosenbluth, in an oral history recorded shortly before his death credits E. Teller with posing the original problem, himself with solving it, and A.W. Rosenbluth (his wife) with programming the computer. According to M. Rosenbluth, neither Metropolis nor A.H. Teller participated in any way. Rosenbluth's account of events is supported by other contemporary recollections. According to Roy Glauber and Emilio Segrè, the original algorithm was invented by Enrico Fermi and reinvented by Stan Ulam.
The Metropolis–Hastings algorithm can draw samples from any probability distribution P(x), provided you can compute the value of a function f(x) which is proportional to the density of P. The lax requirement that f(x) should be merely proportional to the density, rather than exactly equal to it, makes the Metropolis–Hastings algorithm particularly useful, because calculating the necessary normalization factor is often extremely difficult in practice.
The Metropolis–Hastings algorithm works by generating a sequence of sample values in such a way that, as more and more sample values are produced, the distribution of values more closely approximates the desired distribution, P(x). These sample values are produced iteratively, with the distribution of the next sample being dependent only on the current sample value (thus making the sequence of samples into a Markov chain). Specifically, at each iteration, the algorithm picks a candidate for the next sample value based on the current sample value. Then, with some probability, the candidate is either accepted (in which case the candidate value is used in the next iteration) or rejected (in which case the candidate value is discarded, and current value is reused in the next iteration)−the probability of acceptance is determined by comparing the values of the function f(x) of the current and candidate sample values with respect to the desired distribution P(x).
For the purpose of illustration, the Metropolis algorithm, a special case of the Metropolis–Hastings algorithm where the proposal function is symmetric, is described below.
Metropolis algorithm (symmetric proposal distribution)
Let f(x) be a function that is proportional to the desired probability distribution P(x) (a.k.a. a target distribution).
- Initialization: Choose an arbitrary point x0 to be the first sample, and choose an arbitrary probability density which suggests a candidate for the next sample value x, given the previous sample value y. For the Metropolis algorithm, Q must be symmetric; in other words, it must satisfy . A usual choice is to let be a Gaussian distribution centered at y, so that points closer to y are more likely to be visited next—making the sequence of samples into a random walk. The function Q is referred to as the proposal density or jumping distribution.
- For each iteration t:
- Generate a candidate x' for the next sample by picking from the distribution .
- Calculate the acceptance ratio α = f(x')/f(xt), which will be used to decide whether to accept or reject the candidate. Because f is proportional to the density of P, we have that α = f(x')/f(xt) = P(x')/P(xt).
- If α ≥ 1, then the candidate is more likely than xt; automatically accept the candidate by setting xt+1 = x'. Otherwise, accept the candidate with probability α; if the candidate is rejected, set xt+1 = xt, instead.
This algorithm proceeds by randomly attempting to move about the sample space, sometimes accepting the moves and sometimes remaining in place. Note that the acceptance ratio indicates how probable the new proposed sample is with respect to the current sample, according to the distribution . If we attempt to move to a point that is more probable than the existing point (i.e. a point in a higher-density region of ), we will always accept the move. However, if we attempt to move to a less probable point, we will sometimes reject the move, and the more the relative drop in probability, the more likely we are to reject the new point. Thus, we will tend to stay in (and return large numbers of samples from) high-density regions of , while only occasionally visiting low-density regions. Intuitively, this is why this algorithm works, and returns samples that follow the desired distribution .
Compared with an algorithm like adaptive rejection sampling that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages:
- The samples are correlated. Even though over the long term they do correctly follow , a set of nearby samples will be correlated with each other and not correctly reflect the distribution. This means that if we want a set of independent samples, we have to throw away the majority of samples and only take every nth sample, for some value of n (typically determined by examining the auto-correlation between adjacent samples). Auto-correlation can be reduced by increasing the jumping width (the average size of a jump, which is related to the variance of the jumping distribution), but this will also increase the likelihood of rejection of the proposed jump. Too large or too small a jumping size will lead to a slow-mixing Markov chain, i.e. a highly correlated set of samples, so that a very large number of samples will be needed to get a reasonable estimate of any desired property of the distribution.
- Although the Markov chain eventually converges to the desired distribution, the initial samples may follow a very different distribution, especially if the starting point is in a region of low density. As a result, a burn-in period is typically necessary, where an initial number of samples (e.g. the first 1,000 or so) are thrown away.
On the other hand, most simple rejection sampling methods suffer from the "curse of dimensionality", where the probability of rejection increases exponentially as a function of the number of dimensions. Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree, and thus are often the only solutions available when the number of dimensions of the distribution to be sampled is high. As a result, MCMC methods are often the methods of choice for producing samples from hierarchical Bayesian models and other high-dimensional statistical models used nowadays in many disciplines.
In multivariate distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point. When the number of dimensions is high, finding the right jumping distribution to use can be difficult, as the different individual dimensions behave in very different ways, and the jumping width (see above) must be "just right" for all dimensions at once to avoid excessively slow mixing. An alternative approach that often works better in such situations, known as Gibbs sampling, involves choosing a new sample for each dimension separately from the others, rather than choosing a sample for all dimensions at once. This is especially applicable when the multivariate distribution is composed out of a set of individual random variables in which each variable is conditioned on only a small number of other variables, as is the case in most typical hierarchical models. The individual variables are then sampled one at a time, with each variable conditioned on the most recent values of all the others. Various algorithms can be used to choose these individual samples, depending on the exact form of the multivariate distribution: some possibilities are adaptive rejection sampling, a one-dimensional Metropolis–Hastings step, or slice sampling.
Formal derivation of the Metropolis-Hastings algorithm
The purpose of the Metropolis–Hastings algorithm is to generate a collection of states according to a desired distribution P(x). To accomplish this, the algorithm uses a Markov process which asymptotically reaches a unique stationary distribution π(x) such that π(x)=P(x) .
A Markov process is uniquely defined by its transition probabilities, the probability of transitioning between any two states x to x'. It has a unique stationary distribution π(x) when the following two conditions are met:
- existence of stationary distribution: there must exist a stationary distribution π(x). A sufficient but not necessary condition is detailed balance which requires that each transition x→x' is reversible: for every pair of states x, x', the probability of being in state x and transit to the state x' must be equal to the probability of being in state x' and transit to the state x, .
- uniqueness of stationary distribution: the stationary distribution π(x) must be unique. This is guaranteed by ergodicity of the Markov process, which requires that every state must (1) be aperiodic—the system does not return to the same state at fixed intervals; and (2) be positive recurrent—the expected number of steps for returning to the same state is finite.
Metropolis–Hastings algorithm resides in designing a Markov process (by constructing transition probabilities) which fulfils the two above conditions, such that its stationary distribution π(x) is chosen to be P(x). The derivation of the algorithm starts with the condition of detailed balance:
which is re-written as
The approach is to separate the transition in two sub-steps; the proposal and the acceptance-rejection. The proposal distribution is the conditional probability of proposing a state x' given x, and the acceptance distribution the conditional probability to accept the proposed state x'. The transition probability can be written as the product of them:
(Note that strictly speaking, this is not a proper transition probability since it does not sum to 1 over all x'. This is easily fixed by setting the remaining probability to propose x'=x.)
Inserting this relation the previous equation, we have
The next step in the derivation is to choose an acceptance that fulfills detailed balance. One common choice is the Metropolis choice:
i.e., we always accept when the acceptance is bigger than 1, and we reject accordingly when the acceptance is smaller than 1. This constitutes the required quantity for implementing the algorithm.
The Metropolis–Hastings algorithm thus consists in the following:
- Initialisation: pick an initial state x at random;
- randomly pick a state x' according to ;
- accept the state according to . If not accepted, that means that x' = x, and so there is no need to update anything. Else, the system transits to x';
- go to 2 until T states were generated;
- save the state x, go to 2.
The saved states are in principle drawn from the distribution , as step 4 ensures they are de-correlated. The value of T must be chosen according to different factors such as the proposal distribution and, formally, it has to be of the order of the autocorrelation time of the Markov process.
It is important to notice that it is not clear, in a general problem, which distribution one should use; it is a free parameter of the method which has to be adjusted to the particular problem in hand.
Suppose the most recent value sampled is . To follow the Metropolis–Hastings algorithm, we next draw a new proposal state with probability density , and calculate a value
is the probability (e.g., Bayesian posterior) ratio between the proposed sample and the previous sample , and
is the ratio of the proposal density in two directions (from to and vice versa). This is equal to 1 if the proposal density is symmetric. Then the new state is chosen according to the following rules.
The Markov chain is started from an arbitrary initial value and the algorithm is run for many iterations until this initial state is "forgotten". These samples, which are discarded, are known as burn-in. The remaining set of accepted values of represent a sample from the distribution .
The algorithm works best if the proposal density matches the shape of the target distribution from which direct sampling is difficult, that is . If a Gaussian proposal density is used the variance parameter has to be tuned during the burn-in period. This is usually done by calculating the acceptance rate, which is the fraction of proposed samples that is accepted in a window of the last samples. The desired acceptance rate depends on the target distribution, however it has been shown theoretically that the ideal acceptance rate for a one-dimensional Gaussian distribution is approx 50%, decreasing to approx 23% for an -dimensional Gaussian target distribution.
If is too small the chain will mix slowly (i.e., the acceptance rate will be high but successive samples will move around the space slowly and the chain will converge only slowly to ). On the other hand, if is too large the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density, so will be very small and again the chain will converge very slowly.
Interacting Metropolis-Hastings methodology
The central idea is to consider an interpolating sequence of target distributions with an increasing complexity of sampling and such that . We let be a function that is proportional to , for each . In this notation, for any k we clearly have the product formulafor some normalizing constants When the state space is finite, for any we have
For probability densities w.r.t. the Lebesgue measure dx on , for some dimension , we have
We assume that it is easy to sample independent random variables with common probability, and the functions take values in and they can be computed easily for any state x.
We illustrate these rather abstract models with a couple of examples:
- Boltzmann-Gibbs distributions associated with an energy function on some finite set and some inverse temperature parameter are defined byIn this situation, we can choose the interpolating sequence By construction, we have
- Let be the probability density of some real valued random variable . Let be some subset such that and set with the indicator function of a set . In this situation, we can choose the interpolating sequence By construction, we have
For any denote by the transition probability of the Metropolis–Hastings algorithm with target invariant measure . To simplify the presentation, we assume that the state space s finite.
- Initially, we sample a large number of independent random variables with common probability distribution . By the law of large numbers, we have
- Acceptance-rejection with recycling: The objective is to design a new population of samples with distribution . With a probability we accept the state and we set . Otherwise, we sample a new indidual in the current population with the probability measure
and we set Using the fact that we have the estimates
- Proposal-exploration: The objective is to design a new population with distribution using the Metropolis-Hastings transitions . From each selected states we sample independently a transition By the law of large numbers, we haveThe next acceptance-rejection transition with recycling is defined as above by replacing by ; and the second proposal-exploration is defined by moving the selected states with the Metropolis-Hastings transition , and so on.
The interacting Metropolis-Hastings defined above belong to the class of mean field interacting particle methods and Feynman-Kac particle models (a.k.a. Sequential Monte Carlo methodologies). In contrast to traditional Markov Chain Monte Carlo methods relying on the stability of Markov chain, the precision interacting Metropolis-Hastings only depends on the size of the system, in the sense that for any time we have
In addition, we have the unbiased estimates of the normalizing constantsIn addition, for any function f(.) on S we have the Feynman-Kac formulae
where stands for a non homogenous Markov chain with initial distribution and Markov transitions .
The analysis of the convergence of Feynman-Kac particle methods has been started in 1996 and in 2000 in the book and the series of articles. More recent developments can be found in the books. Applications of these Feynman-Kac methodologies to interacting Metropolis–Hastings algorithms are discussed in the series of articles. The unbiased properties of the particle estimate of the ratio are proved in in the context of filtering problems. Online adaptive choices of the temperature parameters and the decreasing subsets in the above examples can also be found in. In Operation Research, the interacting MCMC algorithm associated with target distributions defined in terms of indicator functions is sometimes referred as subset simulation.
- Simulated annealing
- Detailed balance
- Multiple-try Metropolis
- Metropolis light transport
- Gibbs sampling
- Parallel tempering
- Sequential Monte Carlo
- Genetic algorithms
- Mean field particle methods
- Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. (1953). "Equations of State Calculations by Fast Computing Machines". Journal of Chemical Physics 21 (6): 1087–1092. Bibcode:1953JChPh..21.1087M. doi:10.1063/1.1699114.
- Rosenthal, Jeffrey (March 2004). "W.K. Hastings, Statistician and Developer of the Metropolis-Hastings Algorithm". Retrieved 2009-06-02.
- Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications". Biometrika 57 (1): 97–109. doi:10.1093/biomet/57.1.97. JSTOR 2334940. Zbl 0219.65008.
- Teller, Edward. Memoirs: A Twentieth-Century Journey in Science and Politics. Perseus Publishing, 2001, p. 328
- Rosenbluth, Marshall. "Oral History Transcript". American Institute of Physics
- J.E. Gubernatis (2005). "Marshall Rosenbluth and the Metropolis Algorithm". Physics of Plasmas 12 (5): 057303. Bibcode:2005PhPl...12e7303G. doi:10.1063/1.1887186.
- Robert, Christian; Casella, George (2004). Monte Carlo Statistical Methods. Springer. ISBN 0387212396.
- Newman, M. E. J.; Barkema, G. T. (1999). Monte Carlo Methods in Statistical Physics. USA: Oxford University Press. ISBN 0198517971.
- Roberts, G.O.; Gelman, A.; Gilks, W.R. (1997). "Weak convergence and optimal scaling of random walk Metropolis algorithms". Ann. Appl. Probab. 7 (1): 110–120. doi:10.1214/aoap/1034625254.
- Del Moral, Pierre (2013). Mean field simulation for Monte Carlo integration. Chapman & Hall/CRC Press. p. 626.
Monographs on Statistics & Applied Probability
- Del Moral, Pierre (2004). Feynman-Kac formulae. Genealogical and interacting particle approximations. Springer. p. 575.
Series: Probability and Applications
- Del Moral, Pierre; Miclo, Laurent (2000). Branching and Interacting Particle Systems Approximations of Feynman-Kac Formulae with Applications to Non-Linear Filtering. (PDF). Lecture Notes in Mathematics 1729. pp. 1–145. doi:10.1007/bfb0103798.
- "Sequential Monte Carlo samplers - P. Del Moral - A. Doucet - A. Jasra - 2006 - Journal of the Royal Statistical Society: Series B (Statistical Methodology) - Wiley Online Library". onlinelibrary.wiley.com. Retrieved 2015-06-11.
- Del Moral, Pierre (1996). "Non Linear Filtering: Interacting Particle Solution." (PDF). Markov Processes and Related Fields 2 (4): 555–580.
- Del Moral, Pierre (1998). "Measure Valued Processes and Interacting Particle Systems. Application to Non Linear Filtering Problems". Annals of Applied Probability (Publications du Laboratoire de Statistique et Probabilités, 96-15 (1996) ed.) 8 (2): 438–495. doi:10.1214/aoap/1028903535.
- Crisan, Dan; Del Moral, Pierre; Lyons, Terry (1999). "Discrete filtering using branching and interacting particle systems" (PDF). Markov Processes and Related Fields 5 (3): 293–318.
- Del Moral, Pierre; Guionnet, Alice (1999). "On the stability of Measure Valued Processes with Applications to filtering". C.R. Acad. Sci. Paris 39 (1): 429–434.
- Del Moral, Pierre; Guionnet, Alice (2001). "On the stability of interacting processes with applications to filtering and genetic algorithms". Annales de l'Institut Henri Poincaré, 37 (2): 155–194. doi:10.1016/s0246-0203(00)01064-5.
- Del Moral, P.; Guionnet, A. (1999). "Central limit theorem for nonlinear filtering and interacting particle systems". The Annals of Applied Probability 9 (2): 275–297. doi:10.1214/aoap/1029962742. ISSN 1050-5164.
- Del Moral, Pierre; Miclo, Laurent (2001). "Genealogies and Increasing Propagation of Chaos For Feynman-Kac and Genetic Models". The Annals of Applied Probability 11 (4): 1166–1198. doi:10.1214/aoap/1015345399. ISSN 1050-5164.
- Del Moral, Pierre; Rio, Emmanuel (2011). "Concentration inequalities for mean field particle models". The Annals of Applied Probability 21 (3): 1017–1052. doi:10.1214/10-AAP716. ISSN 1050-5164.
- Del Moral, Pierre (2004). Feynman-Kac formulae. Genealogical and interacting particle approximations. http://www.springer.com/gp/book/9780387202686: Springer. Series: Probability and Applications. p. 556. ISBN 978-0-387-20268-6.
- Del Moral, Pierre; Hu, Peng; Wu, Liming (2012). On the Concentration Properties of Interacting Particle Processes. Hanover, MA, USA: Now Publishers Inc. ISBN 1601985126.
- Del Moral, Pierre; Doucet, Arnaud; Jasra, Ajay (2012). "On Adaptive Resampling Procedures for Sequential Monte Carlo Methods" (PDF). Bernoulli 18 (1): 252–278. doi:10.3150/10-bej335.
- Bernd A. Berg. Markov Chain Monte Carlo Simulations and Their Statistical Analysis. Singapore, World Scientific, 2004.
- Siddhartha Chib and Edward Greenberg: "Understanding the Metropolis–Hastings Algorithm". American Statistician, 49(4), 327–335, 1995
- David D. L. Minh and Do Le Minh. "Understanding the Hastings Algorithm." Communications in Statistics - Simulation and Computation, 44:2 332-349, 2015
- Bolstad, William M. (2010) Understanding Computational Bayesian Statistics, John Wiley & Sons ISBN 0-470-04609-0