# Talk:Metropolis–Hastings algorithm

WikiProject Statistics (Rated C-class, Low-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C  This article has been rated as C-Class on the quality scale.
Low  This article has been rated as Low-importance on the importance scale.

## x prime or x_t?

Here are two conflicting comments on the case a>1, which I have cut and pasted for readability. --Rinconsoleao 15:12, 16 July 2007 (UTC)

There seems to be a typo in the last display. If a>1, we would accept the proposal, x', not x_t. --Youyifong 06:38, 28 July 2006 (UTC)
Hi, is there a mistake here? In the step-by-step instructions,
${\displaystyle x^{t+1}=\left\{{\begin{matrix}x'(x(t)?)&{\mbox{if }}a>1\\x'{\mbox{ with probability }}a,&{\mbox{if }}a<1\end{matrix}}\right.}$
(Postdoc 02:30, 16 July 2007 (UTC))

The algorithm always accepts if a>1. That is, ${\displaystyle x^{t+1}=x'}$ when a>1. Notice that this is consistent with the statements about the uniform random variable ${\displaystyle u}$ in the previous paragraph. I have rewritten the "step-by-step instructions" to make this clearer. --Rinconsoleao 15:12, 16 July 2007 (UTC)

Please check the page again. The pages mis-typed x' as x_t, as Youyifong mentioned again. Also, should it be "a >= 1" ? (Postdoc 05:38, 30 July 2007 (UTC))
Thanks! --Rinconsoleao 15:44, 31 July 2007 (UTC)

## Gibbs sampling

The article says that the Gibbs-sampler is a special case of the MH-algorithm. As far as I know it is a complementary MCMC-method which works quite differently. --Smeyen 01:07, 24 February 2007 (UTC)

In Gibbs sampler, its proposal distribution is the full conditional distribution of some part of parameter space conditional on the rest, which always resultes in a acceptance probability of 1. See [http:://www.stat.purdue.edu/~lebanon/notes/metropolis.pdf] (Postdoc 05:46, 30 July 2007 (UTC))

## Complexity

The Markov chain is started from a random initial value \displaystyle x^0 and the algorithm is run for many iterations until this initial state is "forgotten"

no bounds are known on the number of needed iterations, e.g. the mixing time of the Markov chain? Can we really claim in the introduction that this algorithm allow to compute something if we don't known how long it should be iterate? Levochik (talk) 07:43, 26 August 2009 (UTC)

## Example

First of all, thank you for this article. It's the best I could find in the web about Monte Carlo simualtion. But unfortunately, I still don't get it. How can I compute a=a_1a_2 if I don't know P(x)? Perhaps additionally, you could include an example. Something where one can see the step-by-step instruction applied to a simple example. Just like a record of a view steps from a real simulation of some simple case. —Preceding unsigned comment added by Askamaster (talkcontribs) 10:30, 25 March 2010 (UTC)

You don't need P(x), you only need a function proportional to P(x), as the normalizations cancel in the ratio. It is quite often the case that you can easily determine a function proportional to P(x), but the integral to normalize it is very difficult to perform. This comes up quite often in Bayesian computations of the posterior distribution. Rjljr2 (talk) 15:49, 11 April 2010 (UTC)

## Proposal density

Clearly the Markov chain doesn't converge for all proposal densities Q. Certainly Q > 0 is sufficient, but I don't believe it is necessary. What condition is required for the proposal density? — Preceding unsigned comment added by WuTheFWasThat (talkcontribs) 19:24, 16 May 2012 (UTC)

## Overview

The section Overview tries to give a formal derivation of the Algorithm. The foregoing sections are so well written that it was very easy to understand for me, even though I am not an expert in the field. Bringing the section Overview to the same didadactic level would be an asset. The questions I would like to have answered are: How can I see that the equation is still redundant and what does the term redundant mean in this context? How does one come to the idea of defining ${\displaystyle \displaystyle A(x\rightarrow x')}$ as the specified minimum function?

Some other proposals for improvements: The meaning of the variable ${\displaystyle \displaystyle x'}$ seems to be different from the one in the section Intuition. Would it be an option to use another notation? Furthermore, the section Overview repeats the definition of the algorithm in the section Intuition. However, at the same time it lists the steps of skipping samples to mitigate correlations. I suggest to put the complete list of steps into the previous section. Then, a proper name for the section Overview would e.g. be Formal Derivation.

Sorry, I am new to wikipedia and do not want to just poke your text. Please let me know if you wish me to do some of the mentioned improvements. —Contribo (talk) 20:20, 3 December 2012 (UTC)

I agree with you, "overview" is not the proper name; maybe "formal derivation" as you suggested sounds better. Nevertheless, the intuition section does not present the modern view of metropolis-hastings algorithm in the sense that it does not uses a general proposal (it is always assumed symmetrical, which is not the general case. However, I think the intuition part is to understand the idea of the method; that is why it skips the fact that the proposal also has to be considered in the acceptance in the general case.
Redundant in the sense that there can be another choices for A which also fulfills the equation you mention (maybe under-determined). I don't recall my statistical physics lectures, but I know there is at least another choice of A which also fulfills it and yet, it does not use the "min".Jorgecarleitao (talk) 20:53, 3 December 2012 (UTC)
Thank you for your explanations, I did not expect to get an Answer so quickly. I bring these issues into discussion not just because I would like to understand the matter but also to motivate you to work towards improving this article together with me. In my role as the learner I offer to ask you questions and you as the author could modify the article, such as to answer these questions. Of course I offer to do modifications as well — as far as I am qualified. Does this sound attractive?
In this sense let me point out some further aspects. You write the proposal also has to be considered in the acceptance in the general case. I am not sure if I get you right. Isn't this considered in the section Step-by-step Instructions? This brings me to another point: the Step-by-step Instructions should contain the skipping of T states to make the resulting sequence uncorrelated.
I noticed some more points regarding the section Overview: What is the role of ergodicity in this explanation. The text says that it can be shown that ergodicity is a requirement for the process to converge asymptotically. A reference to the proof would be great. How is it ensured that the process is ergodic? Another Point: The markov matrices in the wiki-reference have finite dimensions. However, the markov matrix for the process under discussion is infinte dimensional, this difference should at least be mentioned. --Contribo (talk) 21:24, 3 December 2012 (UTC)

The link which is labeled "in practice" and leads to Bayesian statistics should be rephrased so as to not surprise the reader, as per WP:EASTEREGG. There should probably be some brief remark explaining what Bayesian statistics are and how they relate, but I don't know how to write such a remark. — Preceding unsigned comment added by 132.65.153.93 (talk) 18:03, 12 November 2013 (UTC)

## Interacting Metropolis-Hastings methodology

(this was in the main article, and even though has a lot of information, it does not belong to Metropolis-Hastings. It is about sequential Monte Carlo and Markov Chain Monte Carlo as the text below itself claims. I'm storing it here so it is not lost in history.)

The central idea is to consider an interpolating sequence of target distributions ${\displaystyle P_{k}(x)~,k=0,\ldots ,n}$ with an increasing complexity of sampling and such that ${\displaystyle P_{n}(x)=P(x)}$ . We let ${\displaystyle h_{k}(x)}$ be a function that is proportional to ${\displaystyle P_{k}(x)/P_{k-1}(x)}$, for each ${\displaystyle k\geq 1.}$. In this notation, for any k we clearly have the product formula

${\displaystyle P_{k}(x)={\frac {1}{{\mathcal {Z}}_{k}}}~\left\{\prod _{1\leq l\leq k}h_{l}(x)\right\}~P_{0}(x)}$
for some normalizing constants ${\displaystyle {\mathcal {Z}}_{k}.}$ When the state space ${\textstyle S}$ is finite, for any ${\displaystyle k\geq 1}$ we have

${\displaystyle P_{k}(x)=\displaystyle {\frac {h_{k}(x)~P_{k-1}(x)}{\displaystyle \sum _{y\in S}h_{k}(y)~P_{k-1}(y)}}\quad {\mbox{and}}\quad {\frac {{\mathcal {Z}}_{k}}{{\mathcal {Z}}_{k-1}}}=\sum _{y\in S}h_{k}(x)~P_{k-1}(x)}$
For probability densities w.r.t. the Lebesgue measure dx on ${\textstyle S=\mathbb {R} ^{d}}$, for some dimension ${\displaystyle d\geq 1}$, we have

${\displaystyle P_{k}(x)=\displaystyle {\frac {h_{k}(x)~P_{k-1}(x)}{\displaystyle \int _{y\mathbb {R} ^{d}}h_{k}(y)~P_{k-1}(y)~dy}}\quad {\mbox{and}}\quad {\frac {{\mathcal {Z}}_{k}}{{\mathcal {Z}}_{k-1}}}=\int _{y\in \mathbb {R} ^{d}}h_{k}(x)~P_{k-1}(x)~dx}$
We assume that it is easy to sample independent random variables with common probability${\displaystyle P_{0}(x)}$, and the functions ${\displaystyle h_{k}(x)}$ take values in ${\displaystyle [0,1]}$ and they can be computed easily for any state x.

We illustrate these rather abstract models with a couple of examples:

• Boltzmann-Gibbs distributions associated with an energy function ${\displaystyle V~:~x\in S\mapsto V(x)\in [0,\infty ]}$ on some finite set ${\displaystyle S}$ and some inverse temperature parameter ${\displaystyle \beta >0}$ are defined by
${\displaystyle P(x)={\frac {1}{\mathcal {Z}}}~e^{-\beta V(x)}\quad {\mbox{with}}\quad {\mathcal {Z}}=\sum _{y\in S}~e^{-\beta V(y)}}$
In this situation, we can choose the interpolating sequence
${\displaystyle P_{k}(x)={\frac {1}{{\mathcal {Z}}_{k}}}~e^{-\beta _{k}V(x)}\quad {\mbox{with}}\quad {\mathcal {Z}}_{k}=\sum _{y\in S}~e^{-\beta _{k}V(y)}\quad {\mbox{and some parameters}}\quad \beta _{0}=0<\beta _{1}<\ldots <\beta _{n}=\beta }$
By construction, we have
${\displaystyle P_{k}(x)={\frac {1}{{\mathcal {Z}}_{k}}}~\left\{\prod _{1\leq l\leq k}h_{l}(x)\right\}~P_{0}(x)\quad {\mbox{with}}\quad h_{l}(x)=e^{-(\beta _{l}-\beta _{l-1})V(x)}}$
• Let ${\textstyle P_{0}(x)}$ be the probability density of some real valued random variable ${\displaystyle X}$. Let ${\textstyle A\subset \mathbb {R} }$ be some subset such that ${\displaystyle \mathbb {P} (X\in A)>0}$ and set
${\displaystyle P(x)={\frac {1}{\mathcal {Z}}}~1_{A}(x)~P_{0}(x)\quad {\mbox{with}}\quad {\mathcal {Z}}=\int _{y\in \mathbb {R} }~1_{A}(y)~P_{0}(y)~dy}$
with the indicator function ${\displaystyle 1_{A}(.)}$ of a set ${\textstyle A}$. In this situation, we can choose the interpolating sequence
${\displaystyle P_{k}(x)={\frac {1}{{\mathcal {Z}}_{k}}}~1_{A_{k}}(x)~P_{0}(x)\quad {\mbox{with}}\quad {\mathcal {Z}}_{k}=\int _{y\in \mathbb {R} }~1_{A_{k}}(y)~P_{0}(y)~dy>0\quad {\mbox{and some decreasing subsets}}\quad A_{0}=\mathbb {R} \supset A_{1}\supset \ldots \supset A_{n}=A}$
By construction, we have
${\displaystyle P_{k}(x)={\frac {1}{{\mathcal {Z}}_{k}}}~\left\{\prod _{1\leq l\leq k}h_{l}(x)\right\}~P_{0}(x)\quad {\mbox{with}}\quad h_{l}(x)=1_{A_{l}}(x)}$

For any ${\displaystyle k\geq 0}$ denote by ${\displaystyle P_{k}(x\mapsto x')}$ the transition probability of the Metropolis–Hastings algorithm with target invariant measure ${\textstyle P_{k}(x)}$. To simplify the presentation, we assume that the state space s finite.

• Initially, we sample a large number ${\displaystyle N}$ of independent random variables ${\textstyle \xi _{0}=\left(\xi _{0}^{i}\right)_{1\leq i\leq N}}$ with common probability distribution ${\displaystyle P_{0}(x)}$. By the law of large numbers, we have
${\displaystyle P_{0}^{N}(x):={\frac {1}{N}}\sum _{1\leq i\leq N}1_{\xi _{0}^{i}}(x)~\approx _{N\uparrow \infty }~P_{0}(x)\qquad {\mbox{and}}\qquad \sum _{y\in S}h_{1}(x)~P_{0}^{N}(x)={\frac {1}{N}}\sum _{1\leq i\leq N}h_{1}(\xi _{0}^{i})\approx _{N\uparrow \infty }~{\frac {{\mathcal {Z}}_{1}}{{\mathcal {Z}}_{0}}}}$
• Acceptance-rejection with recycling: The objective is to design a new population of samples ${\textstyle \xi _{0}~\longrightarrow ~{\widehat {\xi }}_{0}=\left({\widehat {\xi }}_{0}^{i}\right)_{1\leq i\leq N}}$ with distribution ${\textstyle P_{1}(x)}$. With a probability ${\displaystyle h_{1}(\xi _{0}^{i})}$ we accept the state ${\displaystyle \xi _{0}^{i}}$ and we set ${\displaystyle {\widehat {\xi }}_{0}^{i}=\xi _{0}^{i}}$. Otherwise, we sample a new indidual ${\displaystyle {\widetilde {\xi }}_{0}^{i}}$ in the current population with the probability measure

${\displaystyle Proba\left({\widetilde {\xi }}_{0}^{i}=\xi _{0}^{j}~|~\xi _{0}\right)={\frac {h_{1}(\xi _{0}^{j})}{\sum _{1\leq k\leq N}h_{1}(\xi _{0}^{k})}},~{\mbox{for each}}~j\in \{1,\ldots ,N\}}$
and we set ${\textstyle {\widehat {\xi }}_{0}^{i}={\widetilde {\xi }}_{0}^{i}.}$ Using the fact that
${\displaystyle \displaystyle {\frac {h_{1}(x)~P_{0}^{N}(x)}{\displaystyle \sum _{y\in S}h_{1}(y)~P_{0}^{N}(y)}}=\sum _{1\leq i\leq N}{\frac {h_{1}(\xi _{0}^{i})}{\sum _{1\leq j\leq N}h_{1}(\xi _{0}^{j})}}~1_{\xi _{0}^{i}}(x)~\approx _{N\uparrow \infty }P_{1}(x)}$
we have the estimates
${\displaystyle {\widehat {P}}_{0}^{N}(x):={\frac {1}{N}}\sum _{1\leq i\leq N}1_{{\widehat {\xi }}_{0}^{i}}(x)~\approx _{N\uparrow \infty }~P_{1}(x)}$

• Proposal-exploration: The objective is to design a new population ${\displaystyle {\widehat {\xi }}_{0}~\longrightarrow ~\xi _{1}=\left(\xi _{1}^{i}\right)_{1\leq i\leq N}}$ with distribution ${\textstyle P_{1}(x)}$ using the Metropolis-Hastings transitions ${\displaystyle P_{1}(x\mapsto x')}$. From each selected states ${\displaystyle {\widehat {\xi }}_{0}^{i}}$ we sample independently a transition
${\displaystyle {\widehat {\xi }}_{0}^{i}~\longrightarrow ~\xi _{1}^{i}~\sim ~P_{1}({\widehat {\xi }}_{0}^{i}\mapsto x')}$
By the law of large numbers, we have
${\displaystyle P_{1}^{N}(x^{\prime })={\frac {1}{N}}\sum _{1\leq i\leq N}1_{\xi _{1}^{i}}(x^{\prime })~\approx _{N\uparrow \infty }~\sum _{x\in S}P_{1}(x)~P_{1}(x\mapsto x^{\prime })=P_{1}(x^{\prime })\qquad {\mbox{and}}\qquad \sum _{y\in S}h_{2}(x)~P_{1}^{N}(x)={\frac {1}{N}}\sum _{1\leq i\leq N}h_{2}(\xi _{1}^{i})\approx _{N\uparrow \infty }~{\frac {{\mathcal {Z}}_{2}}{{\mathcal {Z}}_{1}}}}$
The next acceptance-rejection transition with recycling ${\textstyle \xi _{1}~\longrightarrow ~{\widehat {\xi }}_{1}=\left({\widehat {\xi }}_{1}^{i}\right)_{1\leq i\leq N}}$ is defined as above by replacing ${\textstyle h_{0}(.)}$ by ${\textstyle h_{1}(.)}$; and the second proposal-exploration ${\displaystyle {\widehat {\xi }}_{1}~\longrightarrow ~\xi _{2}=\left(\xi _{2}^{i}\right)_{1\leq i\leq N}}$ is defined by moving the selected states with the Metropolis-Hastings transition ${\displaystyle P_{2}(x\mapsto x')}$, and so on.

The interacting Metropolis-Hastings defined above belong to the class of mean field interacting particle methods[1] and Feynman-Kac particle models[2][3] (a.k.a. Sequential Monte Carlo methodologies[4]). In contrast to traditional Markov Chain Monte Carlo methods relying on the stability of Markov chain, the precision interacting Metropolis-Hastings only depends on the size ${\displaystyle N}$ of the system, in the sense that for any time ${\displaystyle k}$ we have

${\displaystyle P_{k}^{N}(x)={\frac {1}{N}}\sum _{1\leq i\leq N}1_{\xi _{k}^{i}}(x)~\approx _{N\uparrow \infty }~P_{k}(x)}$
In addition, we have the unbiased estimates of the normalizing constants
${\displaystyle \forall 1\leq k\leq n\qquad {\frac {{\mathcal {Z}}_{k}}{{\mathcal {Z}}_{0}}}=\prod _{1\leq l\leq k}{\frac {{\mathcal {Z}}_{l}}{{\mathcal {Z}}_{l-1}}}~\approx _{N\uparrow \infty }~\prod _{1\leq l\leq k}{\frac {1}{N}}\sum _{1\leq i\leq N}h_{l}(\xi _{l-1}^{i})}$
In addition, for any function f(.) on S we have the Feynman-Kac formulae

${\displaystyle \sum _{x\in S}~f(x)~P_{k}(x)={\frac {E\left(f(X_{k})~\prod _{1\leq l\leq k}h_{l}(X_{l-1})\right)}{E\left(\prod _{1\leq l\leq k}h_{l}(X_{l-1})\right)}}~\qquad {\mbox{and}}\qquad ~{{\mathcal {Z}}_{k}}/{{\mathcal {Z}}_{0}}=E\left(\prod _{1\leq l\leq k}h_{l}(X_{l-1})\right)}$
where ${\displaystyle X_{k}}$ stands for a non homogenous Markov chain with initial distribution ${\displaystyle P_{0}(x)}$ and Markov transitions ${\displaystyle \mathbb {P} \left(X_{k}=x'~|~X_{k-1}=x\right):=P_{k}(x\mapsto x')}$.

The analysis of the convergence of Feynman-Kac particle methods has been started in 1996[5][6] and in 2000 in the book[3] and the series of articles.[7][8][9][10][11][12] More recent developments can be found in the books.[1][13] Applications of these Feynman-Kac methodologies to interacting Metropolis–Hastings algorithms are discussed in the series of articles.[4][14] The unbiased properties of the particle estimate of the ratio ${\displaystyle {\mathcal {Z}}_{k}/{\mathcal {Z}}_{0}}$ are proved in[5] in the context of filtering problems. Online adaptive choices of the temperature parameters and the decreasing subsets in the above examples can also be found in.[15] In Operation Research, the interacting MCMC algorithm associated with target distributions defined in terms of indicator functions is sometimes referred as subset simulation.

1. ^ a b Del Moral, Pierre (2013). Mean field simulation for Monte Carlo integration. Chapman & Hall/CRC Press. p. 626. Monographs on Statistics & Applied Probability
2. ^ Del Moral, Pierre (2004). Feynman-Kac formulae. Genealogical and interacting particle approximations. Springer. p. 575. Series: Probability and Applications
3. ^ a b Del Moral, Pierre; Miclo, Laurent (2000). Branching and Interacting Particle Systems Approximations of Feynman-Kac Formulae with Applications to Non-Linear Filtering. (PDF). Lecture Notes in Mathematics. 1729. pp. 1–145. doi:10.1007/bfb0103798.
4. ^ a b "Sequential Monte Carlo samplers - P. Del Moral - A. Doucet - A. Jasra - 2006 - Journal of the Royal Statistical Society: Series B (Statistical Methodology) - Wiley Online Library". onlinelibrary.wiley.com. Retrieved 2015-06-11.
5. ^ a b Del Moral, Pierre (1996). "Non Linear Filtering: Interacting Particle Solution." (PDF). Markov Processes and Related Fields. 2 (4): 555–580.
6. ^ Del Moral, Pierre (1998). "Measure Valued Processes and Interacting Particle Systems. Application to Non Linear Filtering Problems". Annals of Applied Probability (Publications du Laboratoire de Statistique et Probabilités, 96-15 (1996) ed.). 8 (2): 438–495. doi:10.1214/aoap/1028903535.
7. ^ Crisan, Dan; Del Moral, Pierre; Lyons, Terry (1999). "Discrete filtering using branching and interacting particle systems" (PDF). Markov Processes and Related Fields. 5 (3): 293–318.
8. ^ Del Moral, Pierre; Guionnet, Alice (1999). "On the stability of Measure Valued Processes with Applications to filtering". C.R. Acad. Sci. Paris. 39 (1): 429–434.
9. ^ Del Moral, Pierre; Guionnet, Alice (2001). "On the stability of interacting processes with applications to filtering and genetic algorithms". Annales de l'Institut Henri Poincaré,. 37 (2): 155–194. doi:10.1016/s0246-0203(00)01064-5.
10. ^ Del Moral, P.; Guionnet, A. (1999). "Central limit theorem for nonlinear filtering and interacting particle systems". The Annals of Applied Probability. 9 (2): 275–297. doi:10.1214/aoap/1029962742. ISSN 1050-5164.
11. ^ Del Moral, Pierre; Miclo, Laurent (2001). "Genealogies and Increasing Propagation of Chaos For Feynman-Kac and Genetic Models". The Annals of Applied Probability. 11 (4): 1166–1198. doi:10.1214/aoap/1015345399. ISSN 1050-5164.
12. ^ Del Moral, Pierre; Rio, Emmanuel (2011). "Concentration inequalities for mean field particle models". The Annals of Applied Probability. 21 (3): 1017–1052. doi:10.1214/10-AAP716. ISSN 1050-5164.
13. ^ Del Moral, Pierre (2004). Feynman-Kac formulae. Genealogical and interacting particle approximations. http://www.springer.com/gp/book/9780387202686: Springer. Series: Probability and Applications. p. 556. ISBN 978-0-387-20268-6.
14. ^ Del Moral, Pierre; Hu, Peng; Wu, Liming (2012). On the Concentration Properties of Interacting Particle Processes. Hanover, MA, USA: Now Publishers Inc. ISBN 1601985126.
15. ^ Del Moral, Pierre; Doucet, Arnaud; Jasra, Ajay (2012). "On Adaptive Resampling Procedures for Sequential Monte Carlo Methods" (PDF). Bernoulli. 18 (1): 252–278. doi:10.3150/10-bej335.