Diffusion model
| Part of a series on |
| Machine learning and data mining |
|---|
In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure.[1] The goal of diffusion models is to learn a diffusion process that generates the probability distribution of a given dataset. They learn the latent structure of a dataset by modeling the way in which data points diffuse through their latent space.[2]
In the case of computer vision, diffusion models can be applied to a variety of tasks, including image denoising, inpainting, super-resolution, and image generation. This typically involves training a neural network to sequentially denoise images blurred with Gaussian noise.[2][3] The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise for the network to iteratively denoise. Announced on 13 April 2022, OpenAI's text-to-image model DALL-E 2 is an example that uses diffusion models for both the model's prior (which produces an image embedding given a text caption) and the decoder that generates the final image.[4]
Diffusion models are typically formulated as markov chains and trained using variational inference.[5] Examples of generic diffusion modeling frameworks used in computer vision are denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.[6]
Denoising diffusion model[edit]
Non-equilibrium thermodynamics[edit]
Diffusion models were introduced in 2015 as a method to learn a model that can sample from a highly complex probability distribution. They used techniques from non-equilibrium thermodynamics, especially diffusion.[7]
Consider, for example, how one might model the distribution of all naturally-occurring photos. Each image is a point in the space of all images, and the distribution of naturally-occurring photos is a "cloud" in space, which, by repeatedly adding noise to the images, diffuses out to the rest of the image space, until the cloud becomes all but indistinguishable from a Gaussian distribution . A model that can approximately undo the diffusion can then be used to sample from the original distribution. This is studied in "non-equilibrium" thermodynamics, as the starting distribution is not in equilibrium, unlike the final distribution.
The equilibrium distribution is the Gaussian distribution , with pdf . This is just the Boltzmann distribution of particles in a potential well at temperature 1. The initial distribution, being very much out of equilibrium, would diffuse towards the equilibrium distribution, making biased random steps that are a sum of pure randomness (like a Brownian walker) and gradient descent down the potential well. The randomness is necessary: if the particles were to undergo only gradient descent, then they will all fall to the origin, collapsing the distribution.
Denoising Diffusion Probabilistic Model (DDPM)[edit]
The 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by variational inference.[5]
Forward diffusion[edit]
To present the model, we need some notation.
- are fixed constants.
- is the normal distribution with mean and variance , and is the probability density at .
- A vertical bar denotes conditioning.
A forward diffusion process starts at some starting point , where is the probability distribution to be learned, then repeatedly add noise to it by
The entire diffusion process then satisfies
For example, since
We know is a gaussian, and is another gaussian. We also know that these are independent. Thus we can perform a reparametrization:
There are 5 variables and two linear equations. The two sources of randomness are , which can be reparametrized by rotation, since the IID gaussian distribution is rotationally symmetric.
By plugging in the equations, we can solve for the first reparametrization:
To find the second one, we complete the rotational matrix:
Since rotational matrices are all of the form , we know the matrix must be
Plugging back, and simplifying, we have
Backward diffusion[edit]
The key idea of DDPM is to use a neural network parametrized by . The network takes in two arguments , and outputs a vector and a matrix , such that each step in the forward diffusion process can be approximately undone by . This then gives us a backward diffusion process defined by
Variational inference[edit]
The ELBO inequality states that , and taking one more expectation, we get
Define the loss function
Noise prediction network[edit]
Since , this suggests that we should use ; however, the network does not have access to , and so it has to estimate it instead. Now, since , we may write , where is some unknown gaussian noise. Now we see that estimating is equivalent to estimating .
Therefore, let the network output a noise vector , and let it predict
With this, the loss simplifies to
Score-based generative model[edit]
Score-based generative model is another formulation of diffusion modelling. They are also called noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD).[9][10]
Score matching[edit]
The idea of score functions[edit]
Consider the problem of image generation. Let represent an image, and let be the probability distribution over all possible images. If we have itself, then we can say for certain how likely a certain image is. However, this is intractable in general.
Most often, we are uninterested in knowing the absolute probability of a certain image. Instead, we are usually only interested in knowing how likely a certain image is compared to its immediate neighbors — e.g. how much more likely is an image of cat compared to some small variants of it? Is it more likely if the image contains two whiskers, or three, or with some Gaussian noise added?
Consequently, we are actually quite uninterested in itself, but rather, . This has two major effects:
- One, we no longer need to normalize , but can use any , where is any unknown constant that is of no concern to us.
- Two, we are comparing neighbors , by
Let the score function be ; then consider what we can do with .
As it turns out, allows us to sample from using thermodynamics. Specifically, if we have a potential energy function , and a lot of particles in the potential well, then the distribution at thermodynamic equilibrium is the Boltzmann distribution . At temperature , the Boltzmann distribution is exactly .
Therefore, to model , we may start with a particle sampled at any convenient distribution (such as the standard gaussian distribution), then simulate the motion of the particle forwards according to the Langevin equation
Learning the score function[edit]
Given a density , we wish to learn a score function approximation . This is score matching.[11] Typically, score matching is formalized as minimizing Fisher divergence function . By expanding the integral, and performing an integration by parts,
Annealing the score function[edit]
Suppose we need to model the distribution of images, and we want , a white-noise image. Now, most white-noise images do not look like real images, so for large swaths of . This presents a problem for learning the score function, because if there are no samples around a certain point, then we can't learn the score function at that point. If we do not know the score function at that point, then we cannot impose the time-evolution equation on a particle:
Continuous diffusion processes[edit]
Forward diffusion process[edit]
Consider again the forward diffusion process, but this time in continuous time:
Now, the equation is exactly a special case of the overdamped Langevin equation
Now the above equation is for the stochastic motion of a single particle. Suppose we have a cloud of particles distributed according to at time , then after a long time, the cloud of particles would settle into the stable distribution of . Let be the density of the cloud of particles at time , then we have
By Fokker-Planck equation, the density of the cloud evolves according to
Backward diffusion process[edit]
If we have solved for time , then we can exactly reverse the evolution of the cloud. Suppose we start with another cloud of particles with density , and let the particles in the cloud evolve according to
Noise conditional score network (NCSN)[edit]
At the continuous limit,
Now, define a certain probability distribution over , then the score-matching loss function is defined as the expected Fisher divergence:
The name "noise conditional score network" is explained thus:
- "network", because is implemented as a neural network.
- "score", because the output of the network is interpreted as approximating the score function .
- "noise conditional", because is equal to blurred by an added gaussian noise that increases with time, and so the score function depends on the amount of noise added.
Their equivalence[edit]
DDPM and score-based generative model are equivalent.[13] This means that a network trained using DDPM can be used as a NCSN, and vice versa.
We know that , so by Tweedie's formula, we have
Now, the continuous limit of the backward equation
Main variants[edit]
Denoising Diffusion Implicit Model (DDIM)[edit]
The original DDPM method for generating images is slow, since the forward diffusion process usually takes to make the distribution of to appear close to gaussian. However this means the backward diffusion process also take 1000 steps. Unlike the forward diffusion process, which can skip steps as is gaussian for all , the backward diffusion process does not allow skipping steps. For example, to sample requires the model to first sample . Attempting to directly sample would require us to marginalize out , which is generally intractable.
DDIM[14] is a method to take any model trained on DDPM loss, and use it to sample with some steps skipped, sacrificing an adjustable amount of quality. The original DDPM is a special case of DDIM.
Latent diffusion model (LDM)[edit]
Since the diffusion model is a general method for modelling probability distributions, if one wants to model a distribution over images, one can first encode the images into a lower-dimensional space by an encoder, then use a diffusion model to model the distribution over encoded images. Then to generate an image, one can sample from the diffusion model, then use a decoder to decode it into an image.[15]
The encoder-decoder pair is most often a variational autoencoder (VAE).
Classifier guidance[edit]
Suppose we wish to sample not from the entire distribution of images, but conditional on the image description. We don't want to sample a generic image, but an image that fits the description "black cat with red eyes". Generally, we want to sample from the distribution , where ranges over images, and ranges over classes of images (a description "black cat with red eyes" is just a very detailed class, and a class "cat" is just a very vague description).
Taking the perspective of the noisy channel model, we can understand the process as follows: To generate an image conditional on description , we imagine that the requester really had in mind an image , but the image is passed through a noisy channel and came out garbled, as . Image generation is then nothing but inferring which the requester had in mind.
In other words, conditional image generation is simply "translating from a textual language into a pictorial language". Then, as in noisy-channel model, we use Bayes theorem to get
With temperature[edit]
The classifier-guided diffusion model samples from , which is concentrated around the maximum a posteriori estimate . If we want to force the model to move towards the maximum likelihood estimate , we can use
This can be done simply by SGLD with
Classifier-free guidance (CFG)[edit]
If we do not have a classifier , we could still extract one out of the image model itself:[17]
Samplers[edit]
Given a diffusion model, one may regard it either as a continuous process, and sample from it by integrating a SDE, or one can regard it as a discrete process, and sample from it by iterating the discrete steps. The choice of the "noise schedule" can also affect the quality of samples. In the DDPM perspective, one can use the DDPM itself (with noise), or DDIM (with adjustable amount of noise). The case where one adds noise is sometimes called ancestral sampling.[18] One can interpolate between noise and no noise. The amount of noise is denoted ("eta value") in the DDIM paper, with denoting no noise (as in deterministic DDIM), and denoting full noise (as in DDPM).
In the perspective of SDE, one can use any of the numerical integration methods, such as Euler–Maruyama method, Heun's method, linear multistep methods, etc. Just as in the discrete case, one can add an adjustable amount of noise during the integration.
A survey and comparison of samplers in the context of image generation is in.[19]
Choice of architecture[edit]



Diffusion model[edit]
For generating images by DDPM, we need a neural network that takes a time and a noisy image , and predicts a noise from it. Since predicting the noise is the same as predicting the denoised image, then subtracting it from , denoising architectures tend to work well. For example, the most common architecture is U-Net, which is also good at denoising images.[20]
For non-image data, we can use other architectures. For example,[21] models human motion trajectory by DDPM. Each human motion trajectory is a sequence of poses, represented by either joint rotations or positions. It uses a Transformer network to generate a less noisy trajectory out of a noisy one.
Conditioning[edit]
The base diffusion model can only generate unconditionally from the whole distribution. For example, a diffusion model learned on ImageNet would generate images that look like a random image from ImageNet. To generate images from just one category, one would need to impose the condition. Whatever condition one wants to impose, one needs to first convert the conditioning into a vector of floating point numbers, then feed it into the underlying diffusion model neural network. However, one has freedom in choosing how to convert the conditioning into a vector.
Stable Diffusion, for example, imposes conditioning in the form of cross-attention mechanism, where the query is an intermediate representation of the image in the U-Net, and both key and value are the conditioning vectors.[22] The conditioning can be selectively applied to only parts of an image, and new kinds of conditionings can be finetuned upon the base model, as used in ControlNet.[22]
As a particularly simple example, consider image inpainting. The conditions are , the reference image, and , the inpainting mask. The conditioning is imposed at each step of the backward diffusion process, by first sampling , a noisy version of , then replacing with , where means elementwise multiplication.[23]
Conditioning is not limited to just generating images from a specific category, or according to a specific caption (as in text-to-image). For example,[21] demonstrated generating human motion, conditioned on an audio clip of human walking (allowing syncing motion to a soundtrack), or video of human running, or a text description of human motion, etc.
Upscaling[edit]
As generating an image takes a long time, one can try to generate a small image by a base diffusion model, then upscale it by other models. Upscaling can be done by GAN,[24] Transformer,[25] or signal processing methods like Lanczos resampling.
Diffusion models themselves can be used to perform upscaling. Cascading diffusion model stacks multiple diffusion models one after another, in the style of Progressive GAN. The lowest level is a standard diffusion model that generate 32x32 image, then the image would be upscaled by a diffusion model specifically trained for upscaling, and the process repeats.[20]
Examples[edit]
This section collects some notable diffusion models, and briefly describes their architecture.
OpenAI[edit]
The DALL-E series by OpenAI are text-conditional diffusion models of images.
The first version of DALL-E (2021) is not actually a diffusion model. Instead, it uses a Transformer architecture that generates a sequence of tokens, which is then converted to an image by the decoder of a discrete VAE. Released with DALL-E was the CLIP classifier, which was used by DALL-E to rank generated images according to how close the image fits the text.
GLIDE (2022-03)[26] is a 3.5-billion diffusion model, and a small version was released publicly.[4] Soon after, DALL-E 2 was released (2022-04).[27] DALL-E 2 is a 3.5-billion cascaded diffusion model that generates images from text by "inverting the CLIP image encoder", the technique which they termed "unCLIP".
Stability AI[edit]
Stable Diffusion (2022-08), released by Stability AI, consists of a latent diffusion model (860 million parameters), a VAE, and a text encoder. The diffusion model is a U-Net, with cross-attention blocks to allow for conditional image generation.[28][15]
Others[edit]
Google Imagen[29] and Imagen Video[30] are two cascaded diffusion models for generating images and videos.[31] They use a T5-XXL, a Transformer-based language model, to encode text for text-conditional generation.
Make-a-video by Meta AI[32] generates videos from text.
DreamFusion[33][34] generates 3D models from text.
See also[edit]
Further reading[edit]
- Guidance: a cheat code for diffusion models. Overview of classifier guidance and classifier-free guidance, light on mathematical details.
- Mathematical details omitted in the article.
- "Power of Diffusion Models". AstraBlog. 2022-09-25. Retrieved 2023-09-25.
- Weng, Lilian (2021-07-11). "What are Diffusion Models?". lilianweng.github.io. Retrieved 2023-09-25.
References[edit]
- ^ Chang, Ziyi; Koulieris, George Alex; Shum, Hubert P. H. (2023). "On the Design Fundamentals of Diffusion Models: A Survey". arXiv:2306.04542 [cs.LG].
- ^ a b Song, Yang; Sohl-Dickstein, Jascha; Kingma, Diederik P.; Kumar, Abhishek; Ermon, Stefano; Poole, Ben (2021-02-10). "Score-Based Generative Modeling through Stochastic Differential Equations". arXiv:2011.13456 [cs.LG].
- ^ Gu, Shuyang; Chen, Dong; Bao, Jianmin; Wen, Fang; Zhang, Bo; Chen, Dongdong; Yuan, Lu; Guo, Baining (2021). "Vector Quantized Diffusion Model for Text-to-Image Synthesis". arXiv:2111.14822 [cs.CV].
- ^ a b GLIDE, OpenAI, 2023-09-22, retrieved 2023-09-24
- ^ a b Ho, Jonathan; Jain, Ajay; Abbeel, Pieter (2020). "Denoising Diffusion Probabilistic Models". Advances in Neural Information Processing Systems. Curran Associates, Inc. 33: 6840–6851.
- ^ Croitoru, Florinel-Alin; Hondru, Vlad; Ionescu, Radu Tudor; Shah, Mubarak (2023). "Diffusion Models in Vision: A Survey". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (9): 10850–10869. arXiv:2209.04747. doi:10.1109/TPAMI.2023.3261988. PMID 37030794. S2CID 252199918.
- ^ Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (PDF). Proceedings of the 32nd International Conference on Machine Learning. PMLR. 37: 2256–2265.
- ^ Weng, Lilian (2021-07-11). "What are Diffusion Models?". lilianweng.github.io. Retrieved 2023-09-24.
- ^ "Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song". yang-song.net. Retrieved 2023-09-24.
- ^ Song, Yang; Sohl-Dickstein, Jascha; Kingma, Diederik P.; Kumar, Abhishek; Ermon, Stefano; Poole, Ben (2021-02-10). "Score-Based Generative Modeling through Stochastic Differential Equations". arXiv:2011.13456 [cs.LG].
- ^ "Sliced Score Matching: A Scalable Approach to Density and Score Estimation | Yang Song". yang-song.net. Retrieved 2023-09-24.
- ^ Anderson, Brian D.O. (May 1982). "Reverse-time diffusion equation models". Stochastic Processes and Their Applications. 12 (3): 313–326. doi:10.1016/0304-4149(82)90051-5. ISSN 0304-4149.
- ^ Luo, Calvin (2022). "Understanding Diffusion Models: A Unified Perspective". arXiv:2208.11970.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Song, Jiaming; Meng, Chenlin; Ermon, Stefano (2020). "Denoising Diffusion Implicit Models". arXiv:2010.02502.
{{cite journal}}: Cite journal requires|journal=(help) - ^ a b Rombach, Robin; Blattmann, Andreas; Lorenz, Dominik; Esser, Patrick; Ommer, Björn (2022). "High-Resolution Image Synthesis With Latent Diffusion Models": 10684–10695. arXiv:2112.10752.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Dhariwal, Prafulla; Nichol, Alex (2021-06-01). "Diffusion Models Beat GANs on Image Synthesis". arXiv:2105.05233 [cs.LG].
- ^ Ho, Jonathan; Salimans, Tim (2022-07-25). "Classifier-Free Diffusion Guidance". arXiv:2207.12598 [cs.LG].
- ^ Yang, Ling; Zhang, Zhilong; Song, Yang; Hong, Shenda; Xu, Runsheng; Zhao, Yue; Zhang, Wentao; Cui, Bin; Yang, Ming-Hsuan (2022). "Diffusion Models: A Comprehensive Survey of Methods and Applications". arXiv:2209.00796.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Karras, Tero; Aittala, Miika; Aila, Timo; Laine, Samuli (2022). "Elucidating the Design Space of Diffusion-Based Generative Models". arXiv:2206.00364.
{{cite journal}}: Cite journal requires|journal=(help) - ^ a b Ho, Jonathan; Saharia, Chitwan; Chan, William; Fleet, David J.; Norouzi, Mohammad; Salimans, Tim (2022-01-01). "Cascaded diffusion models for high fidelity image generation". The Journal of Machine Learning Research. 23 (1): 47:2249–47:2281. arXiv:2106.15282. ISSN 1532-4435.
- ^ a b Tevet, Guy; Raab, Sigal; Gordon, Brian; Shafir, Yonatan; Cohen-Or, Daniel; Bermano, Amit H. (2022). "Human Motion Diffusion Model". arXiv:2209.14916.
{{cite journal}}: Cite journal requires|journal=(help) - ^ a b Zhang, Lvmin; Rao, Anyi; Agrawala, Maneesh (2023). "Adding Conditional Control to Text-to-Image Diffusion Models". arXiv:2302.05543.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Lugmayr, Andreas; Danelljan, Martin; Romero, Andres; Yu, Fisher; Timofte, Radu; Van Gool, Luc (2022). "RePaint: Inpainting Using Denoising Diffusion Probabilistic Models": 11461–11471.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Wang, Xintao; Xie, Liangbin; Dong, Chao; Shan, Ying (2021). "Real-ESRGAN: Training Real-World Blind Super-Resolution With Pure Synthetic Data": 1905–1914. arXiv:2107.10833.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Liang, Jingyun; Cao, Jiezhang; Sun, Guolei; Zhang, Kai; Van Gool, Luc; Timofte, Radu (2021). "SwinIR: Image Restoration Using Swin Transformer": 1833–1844. arXiv:2108.10257.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Nichol, Alex; Dhariwal, Prafulla; Ramesh, Aditya; Shyam, Pranav; Mishkin, Pamela; McGrew, Bob; Sutskever, Ilya; Chen, Mark (2022-03-08). "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models". arXiv:2112.10741 [cs.CV].
- ^ Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv:2204.06125 [cs.CV].
- ^ Alammar, Jay. "The Illustrated Stable Diffusion". jalammar.github.io. Retrieved 2022-10-31.
- ^ Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Ghasemipour, Seyed Kamyar Seyed; Ayan, Burcu Karagol; Mahdavi, S. Sara; Lopes, Rapha Gontijo; Salimans, Tim; Ho, Jonathan; Fleet, David J.; Norouzi, Mohammad (2022-05-23). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487 [cs.CV].
- ^ Ho, Jonathan; Chan, William; Saharia, Chitwan; Whang, Jay; Gao, Ruiqi; Gritsenko, Alexey; Kingma, Diederik P.; Poole, Ben; Norouzi, Mohammad; Fleet, David J.; Salimans, Tim (2022). "Imagen Video: High Definition Video Generation with Diffusion Models". arXiv:2210.02303.
{{cite journal}}: Cite journal requires|journal=(help) - ^ "Imagen: Text-to-Image Diffusion Models". imagen.research.google. Retrieved 2023-09-24.
- ^ Singer, Uriel; Polyak, Adam; Hayes, Thomas; Yin, Xi; An, Jie; Zhang, Songyang; Hu, Qiyuan; Yang, Harry; Ashual, Oron; Gafni, Oran; Parikh, Devi; Gupta, Sonal; Taigman, Yaniv (2022). "Make-A-Video: Text-to-Video Generation without Text-Video Data". arXiv:2209.14792.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Poole, Ben; Jain, Ajay; Barron, Jonathan T.; Mildenhall, Ben (2022). "DreamFusion: Text-to-3D using 2D Diffusion". arXiv:2209.14988.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Poole, Ben; Jain, Ajay; Barron, Jonathan T.; Mildenhall, Ben (2022), DreamFusion: Text-to-3D using 2D Diffusion, arXiv:2209.14988, retrieved 2023-09-24