Marginal likelihood

A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evidence or simply evidence.

Concept

Given a set of independent identically distributed data points $\mathbf {X} =(x_{1},\ldots ,x_{n}),$ where $x_{i}\sim p(x|\theta )$ according to some probability distribution parameterized by $\theta$ , where $\theta$ itself is a random variable described by a distribution, i.e. $\theta \sim p(\theta \mid \alpha ),$ the marginal likelihood in general asks what the probability $p(\mathbf {X} \mid \alpha )$ is, where $\theta$ has been marginalized out (integrated out):

p(\mathbf {X} \mid \alpha )=\int _{\theta }p(\mathbf {X} \mid \theta )\,p(\theta \mid \alpha )\ \operatorname {d} \!\theta

The above definition is phrased in the context of Bayesian statistics in which case $p(\theta \mid \alpha )$ is called prior density and $p(\mathbf {X} \mid \theta )$ is the likelihood. The marginal likelihood quantifies the agreement between data and prior in a geometric sense made precise^[how?] in de Carvalho et al. (2019). In classical (frequentist) statistics, the concept of marginal likelihood occurs instead in the context of a joint parameter $\theta =(\psi ,\lambda )$ , where $\psi$ is the actual parameter of interest, and $\lambda$ is a non-interesting nuisance parameter. If there exists a probability distribution for $\lambda$ ^{[dubious – discuss]}, it is often desirable to consider the likelihood function only in terms of $\psi$ , by marginalizing out $\lambda$ :

{\mathcal {L}}(\psi ;\mathbf {X} )=p(\mathbf {X} \mid \psi )=\int _{\lambda }p(\mathbf {X} \mid \lambda ,\psi )\,p(\lambda \mid \psi )\ \operatorname {d} \!\lambda

Unfortunately, marginal likelihoods are generally difficult to compute. Exact solutions are known for a small class of distributions, particularly when the marginalized-out parameter is the conjugate prior of the distribution of the data. In other cases, some kind of numerical integration method is needed, either a general method such as Gaussian integration or a Monte Carlo method, or a method specialized to statistical problems such as the Laplace approximation, Gibbs/Metropolis sampling, or the EM algorithm.

It is also possible to apply the above considerations to a single random variable (data point) $x$ , rather than a set of observations. In a Bayesian context, this is equivalent to the prior predictive distribution of a data point.

Applications

Bayesian model comparison

In Bayesian model comparison, the marginalized variables $\theta$ are parameters for a particular type of model, and the remaining variable $M$ is the identity of the model itself. In this case, the marginalized likelihood is the probability of the data given the model type, not assuming any particular model parameters. Writing $\theta$ for the model parameters, the marginal likelihood for the model M is

p(\mathbf {X} \mid M)=\int p(\mathbf {X} \mid \theta ,M)\,p(\theta \mid M)\,\operatorname {d} \!\theta

It is in this context that the term model evidence is normally used. This quantity is important because the posterior odds ratio for a model M₁ against another model M₂ involves a ratio of marginal likelihoods, the so-called Bayes factor:

{\frac {p(M_{1}\mid \mathbf {X} )}{p(M_{2}\mid \mathbf {X} )}}={\frac {p(M_{1})}{p(M_{2})}}\,{\frac {p(\mathbf {X} \mid M_{1})}{p(\mathbf {X} \mid M_{2})}}

which can be stated schematically as

posterior odds = prior odds × Bayes factor

References

Charles S. Bos. "A comparison of marginal likelihood computation methods". In W. Härdle and B. Ronz, editors, COMPSTAT 2002: Proceedings in Computational Statistics, pp. 111–117. 2002. (Available as a preprint on the web: [1])
de Carvalho, Miguel; Page, Garritt; Barney, Bradley (2019). "On the geometry of Bayesian inference". Bayesian Analysis. 14 (4): 1013‒1036. (Available as a preprint on the web: [2])
Lambert, Ben (2018). "The devil is in the denominator". A Student's Guide to Bayesian Statistics. Sage. pp. 109–120. ISBN 978-1-4739-1636-4.
The on-line textbook: Information Theory, Inference, and Learning Algorithms, by David J.C. MacKay.

Concept

Applications

Bayesian model comparison

See also

References