# Variational autoencoder

In machine learning, a variational autoencoder (VAE), is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods.

Variational autoencoders are often associated with the autoencoder model because of its architectural affinity, but with significant differences in the goal and mathematical formulation. Variational autoencoders allow statistical inference problems (such as inferring the value of one random variable from another random variable) to be rewritten as statistical optimization problems (i.e. find the parameter values that minimize some objective function). They are meant to map the input variable to a multivariate latent distribution. Although this type of model was initially designed for unsupervised learning, its effectiveness has been proven for semi-supervised learning and supervised learning.

## Architecture

In a VAE the input data is sampled from a parametrized distribution (the prior, in Bayesian inference terms), and the encoder and decoder are trained jointly such that the output minimizes a reconstruction error in the sense of the Kullback–Leibler divergence between the true posterior and its parametric approximation (the so-called "variational posterior").

## Formulation The basic scheme of a variational autoencoder. The model receives $x$ as input. The encoder compresses it into the latent space. The decoder receives as input the information sampled from the latent space and produces ${x'}$ as similar as possible to $x$ .

From a formal perspective, given an input dataset $x$ characterized by an unknown probability distribution $P(x)$ , the objective is to model or approximate the data's true distribution $P$ using a parametrized distribution $p_{\theta }$ having parameters $\theta$ . Let $z$ be a random vector jointly-distributed with $x$ . Conceptually $z$ will represent a latent encoding of $x$ . Marginalizing over $z$ gives

$p_{\theta }(x)=\int _{z}p_{\theta }({x,z})\,dz,$ where $p_{\theta }({x,z})$ represents the joint distribution under $p_{\theta }$ of the observable data $x$ and its latent representation or encoding $z$ . According to the chain rule, the equation can be rewritten as

$p_{\theta }(x)=\int _{z}p_{\theta }({x|z})p_{\theta }(z)\,dz$ In the vanilla variational autoencoder, $z$ is usually taken to be a finite-dimensional vector of real numbers, and $p_{\theta }({x|z})$ to be a Gaussian distribution. Then $p_{\theta }(x)$ is a mixture of Gaussian distributions.

It is now possible to define the set of the relationships between the input data and its latent representation as

• Prior $p_{\theta }(z)$ • Likelihood $p_{\theta }(x|z)$ • Posterior $p_{\theta }(z|x)$ Unfortunately, the computation of $p_{\theta }(x)$ is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as

$q_{\phi }({z|x})\approx p_{\theta }({z|x})$ with $\phi$ defined as the set of real values that parametrize $q$ . This is sometimes called amortized inference, since by "investing" in finding a good $q_{\phi }$ , one can later infer $z$ from $x$ quickly without doing any integrals.

In this way, the problem is of finding a good probabilistic autoencoder, in which the conditional likelihood distribution $p_{\theta }(x|z)$ is computed by the probabilistic decoder, and the approximated posterior distribution $q_{\phi }(z|x)$ is computed by the probabilistic encoder.

## Evidence lower bound (ELBO)

As in every deep learning problem, it is necessary to define a differentiable loss function in order to update the network weights through backpropagation.

For variational autoencoders, the idea is to jointly optimize the generative model parameters $\theta$ to reduce the reconstruction error between the input and the output, and $\phi$ to make $q_{\phi }({z|x})$ as close as possible to $p_{\theta }(z|x)$ . As reconstruction loss, mean squared error and cross entropy are often used.

As distance loss between the two distributions the reverse Kullback–Leibler divergence $D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))$ is a good choice to squeeze $q_{\phi }({z|x})$ under $p_{\theta }(z|x)$ .

The distance loss just defined is expanded as

{\begin{aligned}D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))&=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }(z|x)}{p_{\theta }(z|x)}}\right]\\&=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }({z|x})p_{\theta }(x)}{p_{\theta }(x,z)}}\right]\\&=\ln p_{\theta }(x)+\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }({z|x})}{p_{\theta }(x,z)}}\right]\end{aligned}} Now define the evidence lower bound (ELBO):

$L_{\theta ,\phi }(x):=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]=\ln p_{\theta }(x)-D_{KL}(q_{\phi }({\cdot |x})\parallel p_{\theta }({\cdot |x}))$ Maximizing the ELBO
$\theta ^{*},\phi ^{*}={\underset {\theta ,\phi }{\operatorname {argmax} }}\,L_{\theta ,\phi }(x)$ is equivalent to simultaneously maximizing $\ln p_{\theta }(x)$ and minimizing $D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))$ . That is, maximizing the log-likelihood of the observed data, and minimizing the divergence of the approximate posterior $q_{\phi }(\cdot |x)$ from the exact posterior $p_{\theta }(\cdot |x)$ .

For a more detailed derivation and more interpretations of ELBO and its maximization, see its main page.

## Reparameterization The scheme of the reparameterization trick. The randomness variable ${\varepsilon }$ is injected into the latent space $z$ as external input. In this way, it is possible to backpropagate the gradient without involving stochastic variable during the update.

To efficient search for

$\theta ^{*},\phi ^{*}={\underset {\theta ,\phi }{\operatorname {argmax} }}\,L_{\theta ,\phi }(x)$ the typical method is gradient descent.

It is straightforward to find

$\nabla _{\theta }\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\nabla _{\theta }\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]$ However,
$\nabla _{\phi }\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]$ does not allow one to put the $\nabla _{\phi }$ inside the expectation, since $\phi$ appears in the probability distribution itself. The reparameterization trick (also known as stochastic backpropagation) bypasses this difficulty.

The most important example is when $z\sim q_{\phi }(\cdot |x)$ is normally distributed, as ${\mathcal {N}}(\mu _{\phi }(x),\Sigma _{\phi }(x))$ .

This can be reparametrized by letting ${\boldsymbol {\varepsilon }}\sim {\mathcal {N}}(0,{\boldsymbol {I}})$ be a "standard random number generator", and construct $z$ as $z=\mu _{\phi }(x)+L_{\phi }(x)\epsilon$ . Here, $L_{\phi }(x)$ is obtained by the Cholesky decomposition:

$\Sigma _{\phi }(x)=L_{\phi }(x)L_{\phi }(x)^{T}$ Then we have
$\nabla _{\phi }\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]=\mathbb {E} _{\epsilon }\left[\nabla _{\phi }\ln {\frac {p_{\theta }(x,\mu _{\phi }(x)+L_{\phi }(x)\epsilon )}{q_{\phi }(\mu _{\phi }(x)+L_{\phi }(x)\epsilon |x)}}\right]$ and so we obtained an unbiased estimator of the gradient, allowing stochastic gradient descent.

Since we reparametrized $z$ , we need to find $q_{\phi }(z|x)$ . Let $q_{0}$ by the probability density function for $\epsilon$ , then

$\ln q_{\phi }(z|x)=\ln q_{0}(\epsilon )-\ln |\det(\partial _{\epsilon }z)|$ where $\partial _{\epsilon }z$ is the Jacobian matrix of $\epsilon$ with respect to $z$ . Since $z=\mu _{\phi }(x)+L_{\phi }(x)\epsilon$ , this is
$\ln q_{\phi }(z|x)=-{\frac {1}{2}}\|\epsilon \|^{2}-\ln |\det L_{\phi }(x)|-{\frac {n}{2}}\ln(2\pi )$ ## Variations

Many variational autoencoders applications and extensions have been used to adapt the architecture to other domains and improve its performance.

$\beta$ -VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for $\beta$ values greater than one. This architecture can discover disentangled latent factors without supervision.

The conditional VAE (CVAE), inserts label information in the latent space to force a deterministic constrained representation of the learned data.

Some structures directly deal with the quality of the generated samples or implement more than one latent space to further improve the representation learning.

Some architectures mix VAE and generative adversarial networks to obtain hybrid models.