Flow-based generative model

A flow-based generative model is a generative model used in machine learning that explicitly models a probability distribution by leveraging normalizing flow,^[1] which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one.

The direct modeling of likelihood provides many advantages. For example, the negative log-likelihood can be directly computed and minimized as the loss function. Additionally, novel samples can be generated by sampling from the initial distribution, and applying the flow transformation.

In contrast, many alternative generative modeling methods such as variational autoencoder (VAE) and generative adversarial network do not explicitly represent the likelihood function.

Method

Scheme for normalizing flows

Let $z_{0}$ be a (possibly multivariate) random variable with distribution $p_{0}(z_{0})$ .

For $i=1,...,K$ , let $z_{i}=f_{i}(z_{i-1})$ be a sequence of random variables transformed from $z_{0}$ . The functions $f_{1},...,f_{K}$ should be invertible, i.e. the inverse function $f_{i}^{-1}$ exists. The final output $z_{K}$ models the target distribution.

The log likelihood of $z_{K}$ is (see derivation):

\log p_{K}(z_{K})=\log p_{0}(z_{0})-\sum _{i=1}^{K}\log \left|\det {\frac {df_{i}(z_{i-1})}{dz_{i-1}}}\right|

To efficiently compute the log likelihood, the functions $f_{1},...,f_{K}$ should be 1. easy to invert, and 2. easy to compute the determinant of its Jacobian. In practice, the functions $f_{1},...,f_{K}$ are modeled using deep neural networks, and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE,^[2] RealNVP,^[3] and Glow.^[4]

Derivation of log likelihood

Consider $z_{1}$ and $z_{0}$ . Note that $z_{0}=f_{1}^{-1}(z_{1})$ .

By the change of variable formula, the distribution of $z_{1}$ is:

p_{1}(z_{1})=p_{0}(z_{0})\left|\det {\frac {df_{1}^{-1}(z_{1})}{dz_{1}}}\right|

Where $\det {\frac {df_{1}^{-1}(z_{1})}{dz_{1}}}$ is the determinant of the Jacobian matrix of $f_{1}^{-1}$ .

By the inverse function theorem:

p_{1}(z_{1})=p_{0}(z_{0})\left|\det \left({\frac {df_{1}(z_{0})}{dz_{0}}}\right)^{-1}\right|

By the identity $\det(A^{-1})=\det(A)^{-1}$ (where $A$ is an invertible matrix), we have:

p_{1}(z_{1})=p_{0}(z_{0})\left|\det {\frac {df_{1}(z_{0})}{dz_{0}}}\right|^{-1}

The log likelihood is thus:

\log p_{1}(z_{1})=\log p_{0}(z_{0})-\log \left|\det {\frac {df_{1}(z_{0})}{dz_{0}}}\right|

In general, the above applies to any $z_{i}$ and $z_{i-1}$ . Since $\log p_{i}(z_{i})$ is equal to $\log p_{i-1}(z_{i-1})$ subtracted by a non-recursive term, we can infer by induction that:

\log p_{K}(z_{K})=\log p_{0}(z_{0})-\sum _{i=1}^{K}\log \left|\det {\frac {df_{i}(z_{i-1})}{dz_{i-1}}}\right|

Training method

Flow-based models are generally trained by maximum likelihood. A pseudocode is as follows:^[5]

INPUT. dataset $x_{1:n}$ , normalizing flow model $f_{\theta }(\cdot ),p_{0}$ .
SOLVE. $\max _{\theta }\sum _{j}\ln p_{\theta }(x_{j})$ by gradient descent
RETURN. ${\hat {\theta }}$

Variants

Planar Flow

The earliest example.^[1] Fix some activation function $h$ , and let $\theta =(u,w,b)$ with th appropriate dimensions, then

x=f_{\theta }(z)=z+uh(\langle w,z\rangle +b)

The inverse

f_{\theta }^{-1}

has no closed-form solution in general.

The Jacobian is $|\det(I+h'(\langle w,z\rangle +b)uw^{T})|=|1+h'(\langle w,z\rangle +b)\langle u,w\rangle |$ .

For it to be invertible everywhere, it must be nonzero everywhere. For example, $h=\tanh$ and $\langle u,w\rangle >-1$ satisfies the requirement.

Nonlinear Independent Components Estimation (NICE)

Let $x,z\in \mathbb {R} ^{2n}$ be even-dimensional, and split them in the middle.^[2] Then the normalizing flow functions are

x={\begin{bmatrix}x_{1}\\x_{2}\end{bmatrix}}=f_{\theta }(z)={\begin{bmatrix}z_{1}\\z_{2}\end{bmatrix}}+{\begin{bmatrix}0\\m_{\theta }(z_{1})\end{bmatrix}}

where

m_{\theta }

is any neural network with weights

\theta

.

$f_{\theta }^{-1}$ is just $z_{1}=x_{1},z_{2}=x_{2}-m_{\theta }(x_{1})$ , and the Jacobian is just 1, that is, the flow is volume-preserving.

When $n=1$ , this is seen as a curvy shearing along the $x_{2}$ direction.

Real Non-Volume Preserving (Real NVP)

The Real Non-Volume Preserving model generalizes NICE model by:^[3]

x={\begin{bmatrix}x_{1}\\x_{2}\end{bmatrix}}=f_{\theta }(z)={\begin{bmatrix}z_{1}\\e^{s_{\theta }(z_{1})}\odot z_{2}\end{bmatrix}}+{\begin{bmatrix}0\\m_{\theta }(z_{1})\end{bmatrix}}

Its inverse is $z_{1}=x_{1},z_{2}=e^{-s_{\theta }(x_{1})}\odot (x_{2}-m_{\theta }(x_{1}))$ , and its Jacobian is $\prod _{i=1}^{n}e^{s_{\theta }(z_{1,})}$ . The NICE model is recovered by setting $s_{\theta }=0$ . Since the Real NVP map keeps the first and second halves of the vector $x$ separate, it's usually required to add a permutation $(x_{1},x_{2})\mapsto (x_{2},x_{1})$ after every Real NVP layer.

Generative Flow (Glow)

In generative flow model,^[4] each layer has 3 parts:

channel-wise affine transform $y_{cij}=s_{c}(x_{cij}+b_{c})$ with Jacobian $\prod _{c}s_{c}^{HW}$ .
invertible 1x1 convolution $z_{cij}=\sum _{c'}K_{cc'}y_{cij}$ with Jacobian $\det(K)^{HW}$ . Here $K$ is any invertible matrix.
Real NVP, with Jacobian as described in Real NVP.

The idea of using the invertible 1x1 convolution is to permute all layers in general, instead of merely permuting the first and second half, as in Real NVP.

Masked autoregressive flow (MAF)

An autoregressive model of a distribution on $\mathbb {R} ^{n}$ is defined as the following stochastic process:^[6]

{\begin{aligned}x_{1}\sim &N(\mu _{1},\sigma _{1}^{2})\\x_{2}\sim &N(\mu _{2}(x_{1}),\sigma _{2}(x_{1})^{2})\\&\cdots \\x_{n}\sim &N(\mu _{n}(x_{1:n-1}),\sigma _{n}(x_{1:n-1})^{2})\\\end{aligned}}

where

\mu _{i}:\mathbb {R} ^{i-1}\to \mathbb {R}

and

\sigma _{i}:\mathbb {R} ^{i-1}\to (0,\infty )

are fixed functions that define the autoregressive model.

By the reparametrization trick, the autoregressive model is generalized to a normalizing flow:

{\begin{aligned}x_{1}=&\mu _{1}+\sigma _{1}z_{1}\\x_{2}=&\mu _{2}(x_{1})+\sigma _{2}(x_{1})z_{2}\\&\cdots \\x_{n}=&\mu _{n}(x_{1:n-1})+\sigma _{n}(x_{1:n-1})z_{n}\\\end{aligned}}

The autoregressive model is recovered by setting

z\sim N(0,I_{n})

.

The forward mapping is slow (because it's sequential), but the backward mapping is fast (because it's parallel).

The Jacobian matrix is lower-diagonal, so the Jacobian is $\sigma _{1}\sigma _{2}(x_{1})\cdots \sigma _{n}(x_{1:n-1})$ .

Reversing the two maps $f_{\theta }$ and $f_{\theta }^{-1}$ of MAF results in Inverse Autoregressive Flow (IAF), which has fast forward mapping and slow backward mapping.^[7]

Continuous Normalizing Flow (CNF)

Instead of constructing flow by function composition, another approach is to formulate the flow as a continuous-time dynamic.^[8] Let $z_{0}$ be the latent variable with distribution $p(z_{0})$ . Map this latent variable to data space with the following flow function:

x=F(z_{0})=z_{T}=z_{0}+\int _{0}^{t}f(z_{t},t)dt

where $f$ is an arbitrary function and can be modeled with e.g. neural networks.

The inverse function is then naturally:^[8]

z_{0}=F^{-1}(x)=z_{T}+\int _{t}^{0}-f(z_{t},t)dt

And the log-likelihood of $x$ can be found as:^[8]

\log(p(x))=\log(p(z_{0}))-\int _{0}^{T}{\text{Tr}}\left[{\frac {\partial f}{\partial z_{t}}}\right]dt

Since the trace depends only on the diagonal of the Jacobian $\partial _{z_{t}}f$ , this allows "free-form" Jacobian.^[9] Here, "free-form" means that there is no restriction on the Jacobian's form. It is contrasted with previous discrete models of normalizing flow, where the Jacobian is carefully designed to be only upper- or lower-diagonal, so that the Jacobian can be evaluated efficiently.

The trace can be estimated by "Hutchinson's trick":^[10]^[11]

Given any matrix $W\in \mathbb {R} ^{n\times n}$ , and any random $u\in \mathbb {R} ^{n}$ with $E[uu^{T}]=I$ , we have $E[u^{T}Wu]=tr(W)$ . (Proof: expand the expectation directly.)

Usually, the random vector is sampled from $N(0,I)$ (normal distribution) or $\{\pm n^{-1/2}\}^{n}$ (Radamacher distribution).

When $f$ is implemented as a neural network, neural ODE methods^[12] would be needed. Indeed, CNF was first proposed in the same paper that proposed neural ODE.

There are two main deficiencies of CNF, one is that a continuous flow must be a homeomorphism, thus preserve orientation and ambient isotopy (for example, it's impossible to flip a left-hand to a right-hand by continuous deforming of space, and it's impossible to turn a sphere inside out, or undo a knot), and the other is that the learned flow $f$ might be ill-behaved, due to degeneracy (that is, there are an infinite number of possible $f$ that all solve the same problem).

By adding extra dimensions, the CNF gains enough freedom to reverse orientation and go beyond ambient isotopy (just like how one can pick up a polygon from a desk and flip it around in 3-space, or unknot a knot in 4-space), yielding the "augmented neural ODE".^[13]

Any homeomorphism of $\mathbb {R} ^{n}$ can be approximated by a neural ODE operating on $\mathbb {R} ^{2n+1}$ , proved by combining Whitney embedding theorem for manifolds and the universal approximation theorem for neural networks.^[14]

To regularize the flow $f$ , one can impose regularization losses. The paper ^[10] proposed the following regularization loss based on optimal transport theory:

\lambda _{K}\int _{0}^{T}\left\|f(z_{t},t)\right\|^{2}dt+\lambda _{J}\int _{0}^{T}\left\|\nabla _{z}f(z_{t},t)\right\|_{F}^{2}dt

where

\lambda _{K},\lambda _{J}>0

are hyperparameters. The first term punishes the model for varying the flow field over time, and the second term punishes it for varying the flow field over space.

Applications

Flow-based generative models have been applied on a variety of modeling tasks, including:

Audio generation^[15]
Image generation^[4]
Molecular graph generation^[16]
Point-cloud modeling^[17]
Video generation^[18]

References

^ ^a ^b Danilo Jimenez Rezende; Mohamed, Shakir (2015). "Variational Inference with Normalizing Flows". arXiv:1505.05770 [stat.ML].
^ ^a ^b Dinh, Laurent; Krueger, David; Bengio, Yoshua (2014). "NICE: Non-linear Independent Components Estimation". arXiv:1410.8516 [cs.LG].
^ ^a ^b Dinh, Laurent; Sohl-Dickstein, Jascha; Bengio, Samy (2016). "Density estimation using Real NVP". arXiv:1605.08803 [cs.LG].
^ ^a ^b ^c Kingma, Diederik P.; Dhariwal, Prafulla (2018). "Glow: Generative Flow with Invertible 1x1 Convolutions". arXiv:1807.03039 [stat.ML].
^ Kobyzev, Ivan; Prince, Simon J.D.; Brubaker, Marcus A. (November 2021). "Normalizing Flows: An Introduction and Review of Current Methods". IEEE Transactions on Pattern Analysis and Machine Intelligence. 43 (11): 3964–3979. arXiv:1908.09257. doi:10.1109/TPAMI.2020.2992934. ISSN 1939-3539. PMID 32396070. S2CID 208910764.
^ Papamakarios, George; Pavlakou, Theo; Murray, Iain (2017). "Masked Autoregressive Flow for Density Estimation". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc. arXiv:1705.07057.
^ Kingma, Durk P; Salimans, Tim; Jozefowicz, Rafal; Chen, Xi; Sutskever, Ilya; Welling, Max (2016). "Improved Variational Inference with Inverse Autoregressive Flow". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc. arXiv:1606.04934.
^ ^a ^b ^c Grathwohl, Will; Chen, Ricky T. Q.; Bettencourt, Jesse; Sutskever, Ilya; Duvenaud, David (2018). "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models". arXiv:1810.01367 [cs.LG].
^ Grathwohl, Will; Chen, Ricky T. Q.; Bettencourt, Jesse; Sutskever, Ilya; Duvenaud, David (2018-10-22). "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models". arXiv:1810.01367 [cs.LG].
^ ^a ^b Finlay, Chris; Jacobsen, Joern-Henrik; Nurbekyan, Levon; Oberman, Adam (2020-11-21). "How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization". International Conference on Machine Learning. PMLR: 3154–3164. arXiv:2002.02798.
^ Hutchinson, M.F. (January 1989). "A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines". Communications in Statistics - Simulation and Computation. 18 (3): 1059–1076. doi:10.1080/03610918908812806. ISSN 0361-0918.
^ Chen, Ricky T. Q.; Rubanova, Yulia; Bettencourt, Jesse; Duvenaud, David (2018). "Neural Ordinary Differential Equations". arXiv:1806.07366 [cs.LG].
^ Dupont, Emilien; Doucet, Arnaud; Teh, Yee Whye (2019). "Augmented Neural ODEs". Advances in Neural Information Processing Systems. 32. Curran Associates, Inc.
^ Zhang, Han; Gao, Xi; Unterman, Jacob; Arodz, Tom (2019-07-30). "Approximation Capabilities of Neural ODEs and Invertible Residual Networks". arXiv:1907.12998 [cs.LG].
^ Ping, Wei; Peng, Kainan; Zhao, Kexin; Song, Zhao (2019). "WaveFlow: A Compact Flow-based Model for Raw Audio". arXiv:1912.01219 [cs.SD].
^ Shi, Chence; Xu, Minkai; Zhu, Zhaocheng; Zhang, Weinan; Zhang, Ming; Tang, Jian (2020). "GraphAF: A Flow-based Autoregressive Model for Molecular Graph Generation". arXiv:2001.09382 [cs.LG].
^ Yang, Guandao; Huang, Xun; Hao, Zekun; Liu, Ming-Yu; Belongie, Serge; Hariharan, Bharath (2019). "PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows". arXiv:1906.12320 [cs.CV].
^ Kumar, Manoj; Babaeizadeh, Mohammad; Erhan, Dumitru; Finn, Chelsea; Levine, Sergey; Dinh, Laurent; Kingma, Durk (2019). "VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation". arXiv:1903.01434 [cs.CV].

External links

[:0-1] Danilo Jimenez Rezende; Mohamed, Shakir (2015). "Variational Inference with Normalizing Flows". arXiv:1505.05770 [stat.ML].

[:1-2] Dinh, Laurent; Krueger, David; Bengio, Yoshua (2014). "NICE: Non-linear Independent Components Estimation". arXiv:1410.8516 [cs.LG].

[:2-3] Dinh, Laurent; Sohl-Dickstein, Jascha; Bengio, Samy (2016). "Density estimation using Real NVP". arXiv:1605.08803 [cs.LG].

[glow-4] Kingma, Diederik P.; Dhariwal, Prafulla (2018). "Glow: Generative Flow with Invertible 1x1 Convolutions". arXiv:1807.03039 [stat.ML].

[5] Kobyzev, Ivan; Prince, Simon J.D.; Brubaker, Marcus A. (November 2021). "Normalizing Flows: An Introduction and Review of Current Methods". IEEE Transactions on Pattern Analysis and Machine Intelligence. 43 (11): 3964–3979. arXiv:1908.09257. doi:10.1109/TPAMI.2020.2992934. ISSN 1939-3539. PMID 32396070. S2CID 208910764.

[6] Papamakarios, George; Pavlakou, Theo; Murray, Iain (2017). "Masked Autoregressive Flow for Density Estimation". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc. arXiv:1705.07057.

[7] Kingma, Durk P; Salimans, Tim; Jozefowicz, Rafal; Chen, Xi; Sutskever, Ilya; Welling, Max (2016). "Improved Variational Inference with Inverse Autoregressive Flow". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc. arXiv:1606.04934.

[ffjord-8] Grathwohl, Will; Chen, Ricky T. Q.; Bettencourt, Jesse; Sutskever, Ilya; Duvenaud, David (2018). "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models". arXiv:1810.01367 [cs.LG].

[9] Grathwohl, Will; Chen, Ricky T. Q.; Bettencourt, Jesse; Sutskever, Ilya; Duvenaud, David (2018-10-22). "FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models". arXiv:1810.01367 [cs.LG].

[Finlay_3154–3164-10] Finlay, Chris; Jacobsen, Joern-Henrik; Nurbekyan, Levon; Oberman, Adam (2020-11-21). "How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization". International Conference on Machine Learning. PMLR: 3154–3164. arXiv:2002.02798.

[11] Hutchinson, M.F. (January 1989). "A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines". Communications in Statistics - Simulation and Computation. 18 (3): 1059–1076. doi:10.1080/03610918908812806. ISSN 0361-0918.

[12] Chen, Ricky T. Q.; Rubanova, Yulia; Bettencourt, Jesse; Duvenaud, David (2018). "Neural Ordinary Differential Equations". arXiv:1806.07366 [cs.LG].

[13] Dupont, Emilien; Doucet, Arnaud; Teh, Yee Whye (2019). "Augmented Neural ODEs". Advances in Neural Information Processing Systems. 32. Curran Associates, Inc.

[14] Zhang, Han; Gao, Xi; Unterman, Jacob; Arodz, Tom (2019-07-30). "Approximation Capabilities of Neural ODEs and Invertible Residual Networks". arXiv:1907.12998 [cs.LG].

[15] Ping, Wei; Peng, Kainan; Zhao, Kexin; Song, Zhao (2019). "WaveFlow: A Compact Flow-based Model for Raw Audio". arXiv:1912.01219 [cs.SD].

[16] Shi, Chence; Xu, Minkai; Zhu, Zhaocheng; Zhang, Weinan; Zhang, Ming; Tang, Jian (2020). "GraphAF: A Flow-based Autoregressive Model for Molecular Graph Generation". arXiv:2001.09382 [cs.LG].

[17] Yang, Guandao; Huang, Xun; Hao, Zekun; Liu, Ming-Yu; Belongie, Serge; Hariharan, Bharath (2019). "PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows". arXiv:1906.12320 [cs.CV].

[18] Kumar, Manoj; Babaeizadeh, Mohammad; Erhan, Dumitru; Finn, Chelsea; Levine, Sergey; Dinh, Laurent; Kingma, Durk (2019). "VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation". arXiv:1903.01434 [cs.CV].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]