# Pseudo-marginal Metropolis–Hastings algorithm

In computational statistics, the pseudo-marginal Metropolis–Hastings algorithm[1] is a Monte Carlo method to sample from a probability distribution. It is an instance of the popular Metropolis–Hastings algorithm that extends its use to cases where the target density is not available analytically. It relies on the fact that the Metropolis–Hastings algorithm can still sample from the correct target distribution if the target density in the acceptance ratio is replaced by an estimate. It is especially popular in Bayesian statistics, where it is applied if the likelihood function is not tractable (see example below).

## Algorithm description

The aim is to simulate from some probability density function ${\displaystyle \pi (\theta )}$. The algorithm follows the same steps as the standard Metropolis–Hastings algorithm except that the evaluation of the target density is replaced by a non-negative and unbiased estimate. For comparison, the main steps of a Metropolis–Hastings algorithm are outlined below.

### Metropolis–Hastings algorithm

Given a current state ${\displaystyle \theta _{n}}$ the Metropolis–Hastings algorithm proposes a new state according to some density ${\displaystyle \theta '\sim Q(\cdot \mid \theta _{n})}$. The algorithm then sets ${\displaystyle \theta _{n+1}=\theta '}$ with probability

${\displaystyle a(\theta _{n},\theta ')=\min \left(1,{\frac {\pi (\theta ')}{\pi (\theta _{n})}}{\frac {Q(\theta _{n}\mid \theta ')}{Q(\theta '\mid \theta _{n})}}\right)}$

otherwise the old state is kept, that is, ${\displaystyle \theta _{n+1}=\theta _{n}}$.

### Pseudo-marginal Metropolis–Hastings algorithm

If the density ${\displaystyle \pi }$ is not available analytically the above algorithm cannot be employed. The pseudo-marginal Metropolis–Hastings algorithm in contrast only assumes the existence of an estimator ${\displaystyle {\hat {\pi }}_{\theta }}$ with ${\displaystyle \mathbb {E} [{\hat {\pi }}_{\theta }]=\pi (\theta ).}$ Now, given ${\displaystyle \theta _{n}}$ and the respective estimate ${\displaystyle {\hat {\pi }}_{\theta _{n}}}$ the algorithm proposes a new state according to some density ${\displaystyle \theta '\sim Q(\cdot \mid \theta _{n})}$. Next, compute an estimate ${\displaystyle {\hat {\pi }}_{\theta '}}$ and set ${\displaystyle \theta _{n+1}=\theta '}$ with probability

${\displaystyle a(\theta _{n},\theta ')=\min \left(1,{\frac {{\hat {\pi }}_{\theta '}}{{\hat {\pi }}_{\theta _{n}}}}{\frac {Q(\theta _{n}\mid \theta ')}{Q(\theta '\mid \theta _{n})}}\right)}$

otherwise the old state is kept, that is, ${\displaystyle \theta _{n+1}=\theta _{n}}$.

## Application to Bayesian statistics

In Bayesian statistics the target of inference is the posterior distribution

${\displaystyle p(\theta \mid y)={\frac {p_{\theta }(y)p(\theta )}{p(y)}},}$

where ${\displaystyle p_{\theta }}$ denotes the likelihood function, ${\displaystyle p}$ is the prior and ${\displaystyle p(y)}$ is the prior predictive distribution. Since there is often no analytic expression of this quantity, one often relies on Monte Carlo methods to sample from the distribution instead. Monte Carlo methods often need the likelihood ${\displaystyle p_{\theta }(y)}$ to be accessible for every parameter value ${\displaystyle \theta }$. In some cases, however, the likelihood does not have an analytic expression. An example of such a case is outlined below.

### Example: Latent variable model[1][2]

Consider a model consisting of i.i.d. latent real-valued random variables ${\displaystyle Z_{1},\ldots ,Z_{n}}$ with ${\displaystyle Z_{i}\sim f_{\theta }(\cdot )}$ and suppose one can only observe these variables through some additional noise ${\displaystyle Y_{i}\mid Z_{i}=z\sim g_{\theta }(\cdot \mid z)}$ for some conditional density ${\displaystyle g}$. (This could be due to measurement error, for instance.) We are interested in Bayesian analysis of this model based on some observed data ${\displaystyle y_{1},\ldots ,y_{n}}$. Therefore, we introduce some prior distribution ${\displaystyle p(\theta )}$ on the parameter. In order to compute the posterior distribution

${\displaystyle p(\theta \mid y_{1},\ldots ,y_{n})\propto p_{\theta }(y_{1},\ldots ,y_{n})p(\theta )}$

we need to find the likelihood function ${\displaystyle p_{\theta }(y_{1},\ldots ,y_{n})}$. The likelihood contribution of any observed data point ${\displaystyle y}$ is then

${\displaystyle p_{\theta }(y)=\int g_{\theta }(y\mid z)f_{\theta }(z)\,dz}$

and the joint likelihood of the observed data ${\displaystyle y_{1},\ldots ,y_{n}}$ is

${\displaystyle p_{\theta }(y_{1},\ldots ,y_{n})=\prod _{i=1}^{n}p_{\theta }(y_{i})=\prod _{i=1}^{n}\int g_{\theta }(y_{i}\mid z_{i})f_{\theta }(z_{i})\,dz_{i}.}$

If the integral on the right-hand side is not analytically available, importance sampling can be used to estimate the likelihood. Introduce an auxiliary distribution ${\displaystyle q}$ such that ${\displaystyle g_{\theta }(y\mid z)f_{\theta }(z)>0\Rightarrow q(z)>0}$ for all ${\displaystyle z}$ then

${\displaystyle {\hat {p}}_{\theta }(y_{i})={\frac {1}{N}}\sum _{k=1}^{N}{\frac {g_{\theta }(y_{i}\mid Z_{k})f_{\theta }(Z_{k})}{q(Z_{k})}},\qquad Z_{k}{\overset {i.i.d.}{\sim }}q(\cdot )}$

is an unbiased estimator of ${\displaystyle p_{\theta }(y_{i})}$ and the joint likelihood can be estimated unbiasedly by

${\displaystyle {\hat {p}}_{\theta }(y_{1},\ldots ,y_{n})=\prod _{i=1}^{n}{\hat {p}}_{\theta }(y_{i})=\prod _{i=1}^{n}{\frac {1}{N}}\sum _{k=1}^{N}{\frac {g_{\theta }(y_{i}\mid Z_{i,k})f_{\theta }(Z_{i,k})}{q(Z_{i,k})}},\qquad Z_{i,k}{\overset {i.i.d.}{\sim }}q(\cdot ).}$

## References

1. ^ a b Christophe Andrieu and Gareth O. Roberts. "The pseudo-marginal approach for efficient Monte Carlo computations". Annals of Statistics. 37.2: 697–725 – via https://projecteuclid.org/euclid.aos/1236693147.
2. ^ "Pseudo-marginal MCMC – Building Intelligent Probabilistic Systems". hips.seas.harvard.edu. Retrieved 2018-02-08.