Expectation–maximization algorithm

An expectation-maximization (EM) algorithm is used in statistics for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. EM alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and a maximization (M) step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E step. The parameters found on the M step are then used to begin another E step, and the process is repeated.

Applications

Example of applying the EM algorithm to estimate the parameters of a 12-component Bernoulli Mixture Model for the MNIST database of handwritten digits [1].

EM is frequently used for data clustering in machine learning and computer vision.

In psychometry, EM is almost indispensable for estimating item parameters and latent abilities of item response theory models.

Specification of the EM procedure

Let ${\textbf {y}}$ denote incomplete data consisting of values of observable variables and ${\textbf {x}}$ the missing data. Together, ${\textbf {x}}$ and ${\textbf {y}}$ form the complete data. ${\textbf {x}}$ can either be actual missing measurements or a hidden variable that would make the problem easier if its value were known. For instance, in mixture models, the likelihood formula would be much more convenient if mixture components that "generated" the samples were known (see example below).

Estimate unobservable data

Let $p\,$ be the joint probability distribution (continuous case) or probability mass function (discrete case) of the complete data with parameters given by the vector $\theta$ : $p(\mathbf {y} ,\mathbf {x} |\theta )$ . This function then also gives the complete data likelihood. Further, note that the conditional distribution of the missing data given the observed can be expressed as:

p(\mathbf {x} |\mathbf {y} ,\theta )={\frac {p(\mathbf {y} ,\mathbf {x} |\theta )}{p(\mathbf {y} |\theta )}}={\frac {p(\mathbf {y} |\mathbf {x} ,\theta )p(\mathbf {x} |\theta )}{\int p(\mathbf {y} |\mathbf {\hat {x}} ,\theta )p(\mathbf {\hat {x}} |\theta )d\mathbf {\hat {x}} }}

when using the Bayes rule and law of total probability. This formulation only requires knowledge of the observation likelihood given the unobservable data $p(\mathbf {y} |\mathbf {x} ,\theta )$ , as well as the probability of the unobservable data $p(\mathbf {x} |\theta )$ .

Maximize expected log-likelihood for the complete dataset

An EM algorithm will then iteratively improve an initial estimate $\theta _{0}$ and construct new estimates $\theta _{1},\dots ,\theta _{n},\dots$ . An individual re-estimation step that derives $\theta _{n+1}\,$ from $\theta _{n}\,$ takes the following form:

\theta _{n+1}=\arg \max _{\theta }E_{\mathbf {x} }\!\!\left[\log p\left(\mathbf {y} ,\mathbf {x} \,|\,\theta \right){\Big |}\mathbf {y} \right],

where $E_{\mathbf {x} }\!\!\left[\cdot \right]$ denotes the conditional expectation of $\log p\left(\mathbf {y} ,\mathbf {x} \,|\,\theta \right)$ being taken with $\theta$ in the conditional distribution of x fixed at $\theta _{n}$ . Log-likelihood $\log p\left(\mathbf {y} ,\mathbf {x} \,|\,\theta \right)$ is often used instead of true likelihood $p\left(\mathbf {y} ,\mathbf {x} \,|\,\theta \right)$ because it leads to easier formulas, but still attains its maximum at the same point as the likelihood.

In other words, $\theta _{n+1}$ is the value that maximizes (M) the conditional expectation (E) of the complete data log-likelihood given the observed variables under the previous parameter value. This expectation is usually denoted as $Q(\theta )$ . In the continuous case, it would be given by

Q(\theta )=E_{\mathbf {x} }\!\!\left[\log p\left(\mathbf {y} ,\mathbf {x} \,|\,\theta \right){\Big |}\mathbf {y} \right]=\int _{-\infty }^{\infty }p\left(\mathbf {x} \,|\,\mathbf {y} ,\theta _{n}\right)\log p\left(\mathbf {y} ,\mathbf {x} \,|\,\theta \right)d\mathbf {x}

Speaking of an expectation (E) step is a bit of a misnomer. What is calculated in the first step are the fixed, data-dependent parameters of the function Q. Once the parameters of Q are known, it is fully determined and is maximized in the second (M) step of an EM algorithm.

Properties

It can be shown that an EM iteration does not decrease the observed data likelihood function. However, there is no guarantee that the sequence converges to a maximum likelihood estimator. For multimodal distributions, this means that an EM algorithm will converge to a local maximum (or saddle point) of the observed data likelihood function, depending on starting values. There are a variety of heuristic approaches for escaping a local maximum such as using several different random initial estimates, $\theta _{0}$ , or applying simulated annealing.

EM is particularly useful when maximum likelihood estimation of a complete data model is easy. If closed-form estimators exist, the M step is often trivial. A classic example is maximum likelihood estimation of a finite mixture of Gaussians, where each component of the mixture can be estimated trivially if the mixing distribution is known.

"Expectation-maximization" is a description of a class of related algorithms, not a specific algorithm; EM is a recipe or meta-algorithm which is used to devise particular algorithms. The Baum-Welch algorithm is an example of an EM algorithm applied to hidden Markov models. Another example is the EM algorithm for fitting a mixture density model.

An EM algorithm can also find maximum a posteriori (MAP) estimates, by performing MAP estimation in the M step, rather than maximum likelihood.

There are other methods for finding maximum likelihood estimates, such as gradient descent, conjugate gradient or variations of the Gauss-Newton method. Unlike EM, such methods typically require the evaluation of first and/or second derivatives of the likelihood function.

Incremental versions

The classic EM procedure is to replace both Q and θ with their optimal possible (argmax) values at each iteration. However it can be shown (see Neal & Hinton, 1999) that simply finding Q and θ to give some improvement over their current value will also ensure successful convergence.

For example, to improve Q, we could restrict the space of possible functions to a computationally simple distribution such as a factorial distribution,

Q=\prod _{i}Q_{i}.\!

Thus at each E step we compute the variational approximation of Q.

To improve θ, we could use any hill-climbing method, and not worry about finding the optimal θ, just some improvement. This method is also known as Generalized EM (GEM).

Relation to variational Bayes methods

EM is a partially non-Bayesian, maximum likelihood method. Its final result gives a probability distribution over the latent variables (in the Bayesian style) together with a point estimate for θ (either a maximum likelihood estimate or a posterior mode). We may want a fully Bayesian version of this, giving a probability distribution over θ as well as the latent variables. In fact the Bayesian approach to inference is simply to treat θ as another latent variable. In this paradigm, the distinction between the E and M steps disappears. If we use the factorized Q approximation as described above (variational Bayes), we may iterate over each latent variable (now including θ) and optimize them one at a time. There are now k steps per iteration, where k is the number of latent variables. For graphical models this is easy to do as each variable's new Q depends only on its Markov blanket, so local message passing can be used for efficient inference.

Example: Mixture Gaussian

Assume that the samples $\mathbf {y} _{1},\dots ,{\textbf {y}}_{m}$ , where $\mathbf {y} _{j}\in \mathbb {R} ^{l}$ , are drawn from the gaussians $x_{1},\dots ,x_{n}$ , such that

$P(\mathbf {y} |x_{i},\theta )={\mathcal {N}}(\mu _{i},\sigma _{i})=(2\pi )^{-l/2}{\left|\sigma _{i}\right|}^{-1/2}\exp \left(-{\frac {1}{2}}(\mathbf {y} -\mathbf {\mu } _{i})^{T}\sigma _{i}^{-1}(\mathbf {y} -\mathbf {\mu } _{i})\right)$

The model you are trying to estimate is $\theta =\left\{\mu _{1},\dots ,\mu _{n},\sigma _{1},\dots ,\sigma _{n},P(x_{1}),\dots ,P(x_{n})\right\}$

E-step:

Estimation for unobserved event (which Gaussian is used), conditioned on the observation, using the values from the last maximisation step:

P(x_{i}|\mathbf {y} _{j},\theta _{t})={\frac {p(\mathbf {x} _{i},y_{j}|\theta _{t})}{p(\mathbf {y} _{j}|\theta _{t})}}={\frac {p(\mathbf {y} _{j}|x_{i},\theta _{t})P(x_{i}|\theta _{t})}{\sum _{k=1}^{n}p(\mathbf {y} _{j}|x_{k},\theta _{t})P(x_{k}|\theta _{t})}}

M-step

You want to maximise the expected log-likelihood of the joint event:

{\begin{aligned}Q(\theta )&=E_{x}\left[\ln \prod _{j=1}^{m}p\left(\mathbf {y} _{j},\mathbf {x} |\theta \right){\Big |}\mathbf {y} _{j}\right]\\&=E_{x}\left[\sum _{j=1}^{m}\ln p\left(\mathbf {y} _{j},\mathbf {x} |\theta \right){\Big |}\mathbf {y} _{j}\right]\\&=\sum _{j=1}^{m}E_{x}\left[\ln p\left(\mathbf {y} _{j},\mathbf {x} |\theta \right){\Big |}\mathbf {y} _{j}\right]\\&=\sum _{j=1}^{m}\sum _{i=1}^{n}P\left(x_{i}|\mathbf {y} _{j},\theta _{t}\right)\ln p\left(x_{i},\mathbf {y} _{j}|\theta \right)\\\end{aligned}}

If we expand the probability of the joint event, we get

Q(\theta )=\sum _{j=1}^{m}\sum _{i=1}^{n}P(x_{i}|\mathbf {y} _{j},\theta _{t})\ln \left(p(\mathbf {y} _{j}|x_{i},\theta )P(x_{i}|\theta )\right)

You have a constraint

\sum _{i=1}^{n}P(x_{i}|\theta )=1

If we add a Lagrange multiplier, and expand the pdf, we get

{\mathcal {L}}(\theta )=\left(\sum _{j=1}^{m}\sum _{i=1}^{n}P(x_{i}|\mathbf {y} _{j},\theta _{t})\left(-{\frac {l}{2}}\ln(2\pi )-{\frac {1}{2}}\ln \left|\sigma _{i}\right|-{\frac {1}{2}}(\mathbf {y} _{j}-\mathbf {\mu } _{i})^{T}\sigma _{i}^{-1}(\mathbf {y} _{j}-\mathbf {\mu } _{i})+\ln P(x_{i}|\theta )\right)\right)-\lambda \left(\sum _{i=1}^{n}P(x_{i}|\theta )-1\right)

To find the new estimate $\theta _{t+1}$ , you find a maxima where ${\frac {\partial {\mathcal {L}}(\theta )}{\partial \theta }}=0$ .

New estimate for mean (using some differentiation rules from matrix calculus):

{\begin{aligned}{\frac {\partial {\mathcal {L}}(\theta )}{\partial \mu _{i}}}&=\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\left(-{\frac {\partial }{\partial \mu _{i}}}{\frac {1}{2}}(\mathbf {y} _{j}-\mathbf {\mu } _{i})^{T}\sigma _{i}^{-1}(\mathbf {y} _{j}-\mathbf {\mu } _{i})\right)\\&=\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\left(-{\frac {1}{2}}(\sigma _{i}^{-1}+\sigma _{i}^{-T})(\mathbf {y} _{j}-\mathbf {\mu } _{i})(-1)\right)\\&=\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\left(\sigma _{i}^{-1}(\mathbf {y} _{j}-\mathbf {\mu } _{i})\right)\\&=0\\&\Downarrow \\\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\sigma _{i}^{-1}\mathbf {\mu } _{i}&=\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\sigma _{i}^{-1}\mathbf {y} _{j}\\&\Downarrow \\\mu _{i}\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})&=\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\mathbf {y} _{j}\\&\Downarrow \\\mu _{i}&={\frac {\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\mathbf {y} _{j}}{\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})}}\\\end{aligned}}

New estimate for covariance:

{\begin{aligned}{\frac {\partial {\mathcal {L}}(\theta )}{\partial \sigma _{i}}}&=\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\left(-{\frac {\partial }{\partial \sigma _{i}}}{\frac {1}{2}}\ln \left|\sigma _{i}\right|-{\frac {\partial }{\partial \sigma _{i}}}{\frac {1}{2}}(\mathbf {y} _{j}-\mathbf {\mu } _{i})^{T}\sigma _{i}^{-1}(\mathbf {y} _{j}-\mathbf {\mu } _{i})\right)\\&=\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\left(-{\frac {1}{2}}\sigma _{i}^{-T}+{\frac {1}{2}}\sigma _{i}^{-T}(\mathbf {y} _{j}-\mathbf {\mu } _{i})(\mathbf {y} _{j}-\mathbf {\mu } _{i})^{T}\sigma _{i}^{-T}\right)\\&=0\\&\Downarrow \\\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\sigma _{i}^{-1}&=\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\sigma _{i}^{-1}(\mathbf {y} _{j}-\mathbf {\mu } _{i})(\mathbf {y} _{j}-\mathbf {\mu } _{i})^{T}\sigma _{i}^{-1}\\&\Downarrow \\\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})&=\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\sigma _{i}^{-1}(\mathbf {y} _{j}-\mathbf {\mu } _{i})(\mathbf {y} _{j}-\mathbf {\mu } _{i})^{T}\\&\Downarrow \\\sigma _{i}&={\frac {\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})(\mathbf {y} _{j}-\mathbf {\mu } _{i})(\mathbf {y} _{j}-\mathbf {\mu } _{i})^{T}}{\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})}}\\\end{aligned}}

New estimate for class probability:

{\begin{aligned}{\frac {\partial {\mathcal {L}}(\theta )}{\partial P(x_{i}|\theta )}}&=\left(\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t}){\frac {\partial \ln P(x_{i}|\theta )}{\partial P(x_{i}|\theta )}}\right)-\lambda \left({\frac {\partial P(x_{i}|\theta )}{\partial P(x_{i}|\theta )}}\right)\\&=\left(\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t}){\frac {1}{P(x_{i}|\theta )}}\right)-\lambda \\&=0\\&\Downarrow \\\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t}){\frac {1}{P(x_{i}|\theta )}}&=\lambda \\&\Downarrow \\P(x_{i}|\theta )&={\frac {1}{\lambda }}\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\\\end{aligned}}

Inserting into the constraint:

{\begin{aligned}\sum _{i=1}^{n}P(x_{i}|\theta )&=\sum _{i=1}^{n}{\frac {1}{\lambda }}\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\\&=1\\&\Downarrow \\\lambda &=\sum _{i=1}^{n}\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\\\end{aligned}}

Inserting $\lambda$ into our estimate:

{\begin{matrix}P(x_{i}|\theta )&={\frac {1}{\lambda }}\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})\\&={\frac {\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})}{\sum _{i=1}^{n}\sum _{j=1}^{m}P(x_{i}|\mathbf {y} _{j},\theta _{t})}}\\\end{matrix}}

These estimates now become our $\theta _{t+1}$ , to be used in the next estimation step.

References

Arthur Dempster, Nan Laird, and Donald Rubin. "Maximum likelihood from incomplete data via the EM algorithm". Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977 [2].
Robert Hogg, Joseph McKean and Allen Craig. Introduction to Mathematical Statistics. pp. 359-364. Upper Saddle River, NJ: Pearson Prentice Hall, 2005.
Radford Neal, Geoffrey Hinton. "A view of the EM algorithm that justifies incremental, sparse, and other variants". In Michael I. Jordan (editor), Learning in Graphical Models pp 355-368. Cambridge, MA: MIT Press, 1999.
The on-line textbook: Information Theory, Inference, and Learning Algorithms, by David J.C. MacKay includes simple examples of the E-M algorithm such as clustering using the soft K-means algorithm, and emphasizes the variational view of the E-M algorithm.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, by J. Bilmes includes a simplified derivation of the EM equations for Gaussian Mixtures and Gaussian Mixture Hidden Markov Models.
Variational Algorithms for Approximate Bayesian Inference, by M. J. Beal includes comparisons of EM to Variational Bayesian EM and derivations of several models including Variational Bayesian HMMs.
The Expectation Maximization Algorithm, by Frank Dellaert, gives an easier explanation of EM algorithm in terms of lowerbound maximization.