Stable Diffusion

From Wikipedia, the free encyclopedia
Stable Diffusion
Developer(s)CompVis group LMU Munich; Runway
Initial releaseAugust 22, 2022
Stable release
2.1 (model)[1] / December 7, 2022
Written inPython
Operating systemAny that support CUDA kernels
TypeText-to-image model
LicenseCreative ML OpenRAIL-M

Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt.[2]

Stable Diffusion is a latent diffusion model, a kind of deep generative neural network developed by the CompVis group at LMU Munich[3] and Runway[4]. The model has been released by a collaboration of CompVis LMU, Runway, and Stability AI with support from EleutherAI and LAION.[5][6][7]

Stable Diffusion's code and model weights have been released publicly,[8] and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM. This marked a departure from previous proprietary text-to-image models such as DALL-E and Midjourney which were accessible only via cloud services.[9][10]


Diagram of the latent diffusion architecture used by Stable Diffusion
The denoising process used by Stable Diffusion. The model generates images by iteratively denoising random noise until a configured number of steps have been reached, guided by the CLIP text encoder pretrained on concepts along with the attention mechanism, resulting in the desired image depicting a representation of the trained concept.


Stable Diffusion uses a kind of diffusion model (DM), called a latent diffusion model (LDM).[6] Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder.[11] The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image.[12] Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion.[11] The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space.[11] The denoising step can be flexibly conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism.[11] For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space.[6] Researchers point to increased computational efficiency for training and generation as an advantage of LDMs.[13][14]

Training data[edit]

Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from Common Crawl data scraped from the web, where 5 billion image-text pairs were classified based on language and filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted "aesthetic" score (e.g. subjective visual quality).[15] The dataset was created by LAION, a German non-profit which receives funding from Stability AI.[15][16] The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+.[15] A third-party analysis of the model's training data identified that out of a smaller subset of 12 million images taken from the original wider dataset used, approximately 47% of the sample size of images came from 100 different domains, with Pinterest taking up 8.5% of the subset, followed by websites such as WordPress, Blogspot, Flickr, DeviantArt and Wikimedia Commons.[17][15]

Training procedures[edit]

The model was initially trained on the laion2B-en and laion-high-resolution subsets, with the last few rounds of training done on LAION-Aesthetics v2 5+, a subset of 600 million captioned images which the LAION-Aesthetics Predictor V2 predicted that humans would, on average, give a score of at least 5 out of 10 when asked to rate how much they liked them.[18][15][19] The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images which LAION-5B-WatermarkDetection identified as carrying a watermark with greater than 80% probability.[15] Final rounds of training additionally dropped 10% of text conditioning to improve Classifier-Free Diffusion Guidance.[20]

The model was trained using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours, at a cost of $600,000.[21][22][23]


Stable Diffusion has issues with degradation and inaccuracies in certain scenarios. Initial releases of the model were trained on a dataset that consists of 512×512 resolution images, meaning that the quality of generated images noticeably degrades when user specifications deviate from its "expected" 512×512 resolution;[24] the version 2.0 update of the Stable Diffusion model later introduced the ability to natively generate images at 768×768 resolution.[25] Another challenge is in generating human limbs due to poor data quality of limbs in the LAION database.[26] The model is insufficiently trained to understand human limbs and faces due to the lack of representative features in the database, and prompting the model to generate images of such type can confound the model.[27]

Accessibility for individual developers can also be a problem. In order to customize the model for new use cases that are not included in the dataset such as generating anime characters ("waifu diffusion"),[28] new data and further training are required. Fine-tuned adaptations of Stable Diffusion created through additional retraining have been used for a variety of different use-cases, from medical imaging[29] to algorithmically-generated music.[30] However, this fine-tuning process is sensitive to the quality of new data; low resolution images or different resolutions from the original data can not only fail to learn the new task but degrade the overall performance of the model. Even when the model is additionally trained on high quality images, it is difficult for individuals to run models in consumer electronics. For example, the training process for waifu-diffusion requires a minimum 30 GB of VRAM,[31] which exceeds the usual resource provided in consumer GPUs, such as Nvidia’s GeForce 30 series having around 12 GB.[32]

The creators of Stable Diffusion acknowledge the potential for algorithmic bias, as the model was primarily trained on images with English descriptions.[22] As a result, generated images reinforce social biases and are from a western perspective as the creators note that the model lacks data from other communities and cultures. The model gives more accurate results for prompts that are written in English in comparison to those written in other languages with western or white cultures often being the default representation.[22]

End-user fine tuning[edit]

To address the limitations of the model's initial training, end-users may opt to implement additional training to fine-tune generation outputs to match more specific use-cases. There are three methods in which user-accessible fine-tuning can be applied to a Stable Diffusion model checkpoint:

  • An "embedding" can be trained from a collection of user-provided images, and allows the model to generate visually similar images whenever the name of the embedding is used within a generation prompt.[33] Embeddings are based on the "textual inversion" concept developed by researchers from Tel Aviv University in 2022 with support from Nvidia, where vector representations for specific tokens used by the model's text encoder are linked to new pseudo-words. Embeddings can be used to reduce biases within the original model, or mimic visual styles.[34]
  • A "hypernetwork" is a small pre-trained neural network that is applied to various points within a larger neural network, and refers to the technique created by NovelAI developer Kurumuz in 2021, originally intended for text-generation transformer models. Hypernetworks steer results towards a particular direction, allowing Stable Diffusion-based models to imitate the art style of specific artists, even if the artist is not recognised by the original model; they process the image by finding key areas of importance such as hair and eyes, and then patch these areas in secondary latent space.[35]
  • DreamBooth is a deep learning generation model developed by researchers from Google Research and Boston University in 2022 which can fine-tune the model to generate precise, personalised outputs that depict a specific subject, following training via a set of images which depict the subject.[36]


The Stable Diffusion model supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output.[6] Existing images can be re-drawn by the model to incorporate new elements described by a text prompt (a process known as "guided image synthesis"[37]) through its diffusion-denoising mechanism.[6] In addition, the model also allows the use of prompts to partially alter existing images via inpainting and outpainting, when used with an appropriate user interface that supports such features, of which numerous different open source implementations exist.[38]

Stable Diffusion is recommended to be run with 10 GB or more VRAM, however users with less VRAM may opt to load the weights in float16 precision instead of the default float32 to tradeoff model performance with lower VRAM usage.[24]

Text to image generation[edit]

Demonstration of the effect of negative prompts on image generation
  • Top: no negative prompt
  • Centre: "green trees"
  • Bottom: "round stones, round rocks"

The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt in addition to assorted option parameters covering sampling types, output image dimensions, and seed values. The script outputs an image file based on the model's interpretation of the prompt.[6] Generated images are tagged with an invisible digital watermark to allow users to identify an image as generated by Stable Diffusion,[6] although this watermark loses its efficacy if the image is resized or rotated.[39]

Each txt2img generation will involve a specific seed value which affects the output image. Users may opt to randomize the seed in order to explore different generated outputs, or use the same seed to obtain the same image output as a previously generated image.[24] Users are also able to adjust the number of inference steps for the sampler; a higher value takes a longer duration of time, however a smaller value may result in visual defects.[24] Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt.[20] More experimentative use cases may opt for a lower scale value, while use cases aiming for more specific outputs may use a higher value.[24]

Additional text2img features are provided by front-end implementations of Stable Diffusion, which allow users to modify the weight given to specific parts of the text prompt. Emphasis markers allow users to add or reduce emphasis to keywords by enclosing them with brackets.[40] An alternative method of adjusting weight to parts of the prompt are "negative prompts". Negative prompts are a feature included in some front-end implementations, including Stability AI's own DreamStudio cloud service, and allow the user to specify prompts which the model should avoid during image generation. The specified prompts may be undesirable image features that would otherwise be present within image outputs due to the positive prompts provided by the user, or due to how the model was originally trained, with mangled human hands being a common example.[38][1]

Image modification[edit]

Stable Diffusion also includes another sampling script, "img2img", which consumes a text prompt, path to an existing image, and strength value between 0.0 and 1.0. The script outputs a new image based on the original image that also features elements provided within the text prompt. The strength value denotes the amount of noise added to the output image. A higher strength value produces more variation within the image but may produce an image that is not semantically consistent with the prompt provided.[6]

The ability of img2img to add noise to the original image makes it potentially useful for data anonymization and data augmentation, in which the visual features of image data are changed and anonymized.[41] The same process may also be useful for image upscaling, in which the resolution of an image is increased, with more detail potentially being added to the image.[41] Additionally, Stable Diffusion has been experimented with as a tool for image compression. Compared to JPEG and WebP, the recent methods used for image compression in Stable Diffusion face limitations in preserving small text and faces.[42]

Additional use-cases for image modification via img2img are offered by numerous front-end implementations of the Stable Diffusion model. Inpainting involves selectively modifying a portion of an existing image delineated by a user-provided layer mask, which fills the masked space with newly generated content based on the provided prompt.[38] A dedicated model specifically fine-tuned for inpainting use-cases was created by Stability AI alongside the release of Stable Diffusion 2.0.[25] Conversely, outpainting extends an image beyond its original dimensions, filling the previously empty space with content generated based on the provided prompt.[38]

A depth-guided model, named "depth2img", was introduced with the release of Stable Diffusion 2.0 on November 24, 2022; this model infers the depth of the provided input image, and generates a new output image based on both the text prompt and the depth information, which allows the coherence and depth of the original input image to be maintained in the generated output.[25]

Usage and controversy[edit]

Stable Diffusion claims no rights on generated images and freely gives users the rights of usage to any generated images from the model provided that the image content is not illegal or harmful to individuals. The freedom provided to users over image usage has caused controversy over the ethics of ownership, as Stable Diffusion and other generative models are trained from copyrighted images without the owner’s consent.[43]

As visual styles and compositions are not subject to copyright, it is often interpreted that users of Stable Diffusion who generate images of artworks should not be considered to be infringing upon the copyright of visually similar works.[44] However, individuals depicted in generated images may be protected by personality rights if their likeness is used,[44] and intellectual property such as recognizable brand logos still remain protected by copyright. Nonetheless, visual artists have expressed concern that widespread usage of image synthesis software such as Stable Diffusion may eventually lead to human artists, along with photographers, models, cinematographers, and actors, gradually losing commercial viability against AI-based competitors.[45]

Stable Diffusion is notably more permissive in the types of content users may generate, such as violent or sexually explicit imagery, in comparison to other commercial products based on generative AI.[46] Addressing the concerns that the model may be used for abusive purposes, CEO of Stability AI, Emad Mostaque, explains that "[it is] peoples' responsibility as to whether they are ethical, moral, and legal in how they operate this technology",[10] and that putting the capabilities of Stable Diffusion into the hands of the public would result in the technology providing a net benefit, in spite of the potential negative consequences.[10] In addition, Mostaque argues that the intention behind the open availability of Stable Diffusion is to end corporate control and dominance over such technologies, who have previously only developed closed AI systems for image synthesis.[10][46] This is reflected by the fact that any restrictions Stability AI places on the content that users may generate can easily be bypassed due to the availability of the source code.[43]


In January of 2023, three artists: Sarah Andersen, Kelly McKernan, and Karla Ortiz filed a copyright infringement lawsuit against Stability AI, Midjourney, and DeviantArt, claiming that these companies have infringed the rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists.[47] The same month, Stability AI was also sued by Getty Images for using its images in the training data.[48]


Unlike models like DALL-E, Stable Diffusion makes its source code available,[49][6] along with the model (pretrained weights). It applies the Creative ML OpenRAIL-M license, a form of Responsible AI License (RAIL), to the model (M).[50] The licence prohibits certain use cases, including crime, libel, harassment, doxing, "exploiting ... minors", giving medical advice, automatically creating legal obligations, producing legal evidence, and "discriminating against or harming individuals or groups based on ... social behavior or ... personal or personality characteristics ... [or] legally protected characteristics or categories".[51][52] The user owns the rights to their generated output images, and is free to use them commercially.[53]

See also[edit]


  1. ^ a b "Stable Diffusion v2.1 and DreamStudio Updates 7-Dec 22". Archived from the original on December 10, 2022.
  2. ^ "Diffuse The Rest - a Hugging Face Space by huggingface". Archived from the original on 2022-09-05. Retrieved 2022-09-05.
  3. ^ Rombach; Blattmann; Lorenz; Esser; Ommer (June 2022). High-Resolution Image Synthesis with Latent Diffusion Models (PDF). International Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA. pp. 10684–10695. arXiv:2112.10752.
  4. ^ "High-Resolution Image Synthesis with Latent Diffusion Models| Runway Research". Runway. Retrieved 2023-03-19.
  5. ^ "Stable Diffusion Launch Announcement". Stability.Ai. Archived from the original on 2022-09-05. Retrieved 2022-09-06.
  6. ^ a b c d e f g h i "Stable Diffusion Repository on GitHub". CompVis - Machine Vision and Learning Research Group, LMU Munich. 17 September 2022. Retrieved 17 September 2022.
  7. ^ "Revolutionizing image generation by AI: Turning text into images". LMU Munich. Retrieved 17 September 2022.
  8. ^ Stable Diffusion, CompVis - Machine Vision and Learning LMU Munich, 2022-11-04, retrieved 2022-11-04
  9. ^ "The new killer app: Creating AI art will absolutely crush your PC". PCWorld. Archived from the original on 2022-08-31. Retrieved 2022-08-31.
  10. ^ a b c d Vincent, James (15 September 2022). "Anyone can use this AI art generator — that's the risk". The Verge.
  11. ^ a b c d Alammar, Jay. "The Illustrated Stable Diffusion". Retrieved 2022-10-31.
  12. ^ "High-Resolution Image Synthesis with Latent Diffusion Models". Machine Vision & Learning Group. Retrieved 2022-11-04.
  13. ^ "Stable Diffusion launch announcement". Stability.Ai. Retrieved 2022-11-02.
  14. ^ Rombach; Blattmann; Lorenz; Esser; Ommer (June 2022). High-Resolution Image Synthesis with Latent Diffusion Models (PDF). International Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA. pp. 10684–10695. arXiv:2112.10752.
  15. ^ a b c d e f Baio, Andy (2022-08-30). "Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion's Image Generator". Retrieved 2022-11-02.
  16. ^ "This artist is dominating AI-generated art. And he's not happy about it". MIT Technology Review. Retrieved 2022-11-02.
  17. ^ Ivanovs, Alex (2022-09-08). "Stable Diffusion: Tutorials, Resources, and Tools". Stack Diary. Retrieved 2022-11-02.
  18. ^ Schuhmann, Christoph (2022-11-02), CLIP+MLP Aesthetic Score Predictor, retrieved 2022-11-02
  19. ^ "LAION-Aesthetics | LAION". Archived from the original on 2022-08-26. Retrieved 2022-09-02.
  20. ^ a b Ho, Jonathan; Salimans, Tim (2022-07-25). "Classifier-Free Diffusion Guidance". arXiv:2207.12598 [cs.LG].
  21. ^ Mostaque, Emad (August 28, 2022). "Cost of construction". Twitter. Archived from the original on 2022-09-06. Retrieved 2022-09-06.
  22. ^ a b c "CompVis/stable-diffusion-v1-4 · Hugging Face". Retrieved 2022-11-02.
  23. ^ Wiggers, Kyle (2022-08-12). "A startup wants to democratize the tech behind DALL-E 2, consequences be damned". TechCrunch. Retrieved 2022-11-02.
  24. ^ a b c d e "Stable Diffusion with 🧨 Diffusers". Retrieved 2022-10-31.
  25. ^ a b c "Stable Diffusion 2.0 Release". Archived from the original on December 10, 2022.
  26. ^ "LAION". Retrieved 2022-10-31.
  27. ^ "Generating images with Stable Diffusion". Paperspace Blog. 2022-08-24. Retrieved 2022-10-31.
  28. ^ "hakurei/waifu-diffusion · Hugging Face". Retrieved 2022-10-31.
  29. ^ Chambon, Pierre; Bluethgen, Christian; Langlotz, Curtis P.; Chaudhari, Akshay (2022-10-09). "Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains". arXiv:2210.04133 [cs.CV].
  30. ^ Seth Forsgren; Hayk Martiros. "Riffusion - Stable diffusion for real-time music generation". Riffusion. Archived from the original on December 16, 2022.
  31. ^ Mercurio, Anthony (2022-10-31), Waifu Diffusion, retrieved 2022-10-31
  32. ^ Smith, Ryan. "NVIDIA Quietly Launches GeForce RTX 3080 12GB: More VRAM, More Power, More Money". Retrieved 2022-10-31.
  33. ^ Dave James (October 28, 2022). "I thrashed the RTX 4090 for 8 hours straight training Stable Diffusion to paint like my uncle Hermann". PC Gamer. Archived from the original on November 9, 2022.
  34. ^ Gal, Rinon; Alaluf, Yuval; Atzmon, Yuval; Patashnik, Or; Bermano, Amit H.; Chechik, Gal; Cohen-Or, Daniel (2022-08-02). "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion". arXiv:2208.01618 [cs.CV].
  35. ^ "NovelAI Improvements on Stable Diffusion". NovelAI. October 11, 2022. Archived from the original on October 27, 2022.
  36. ^ Yuki Yamashita (September 1, 2022). "愛犬の合成画像を生成できるAI 文章で指示するだけでコスプレ 米Googleが開発". ITmedia Inc. (in Japanese). Archived from the original on August 31, 2022.
  37. ^ Meng, Chenlin; He, Yutong; Song, Yang; Song, Jiaming; Wu, Jiajun; Zhu, Jun-Yan; Ermon, Stefano (August 2, 2021). "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations". arXiv:2108.01073 [cs.CV].
  38. ^ a b c d "Stable Diffusion web UI". GitHub. 10 November 2022.
  39. ^ invisible-watermark, Shield Mountain, 2022-11-02, retrieved 2022-11-02
  40. ^ "stable-diffusion-tools/emphasis at master · JohannesGaessler/stable-diffusion-tools". GitHub. Retrieved 2022-11-02.
  41. ^ a b Luzi, Lorenzo; Siahkoohi, Ali; Mayer, Paul M.; Casco-Rodriguez, Josue; Baraniuk, Richard (2022-10-21). "Boomerang: Local sampling on image manifolds using diffusion models". arXiv:2210.12100 [cs.CV].
  42. ^ Bühlmann, Matthias (2022-09-28). "Stable Diffusion Based Image Compression". Medium. Retrieved 2022-11-02.
  43. ^ a b Cai, Kenrick. "Startup Behind AI Image Generator Stable Diffusion Is In Talks To Raise At A Valuation Up To $1 Billion". Forbes. Retrieved 2022-10-31.
  44. ^ a b "高性能画像生成AI「Stable Diffusion」無料リリース。「kawaii」までも理解し創造する画像生成AI". Automaton Media (in Japanese). August 24, 2022.
  45. ^ Heikkilä, Melissa (16 September 2022). "This artist is dominating AI-generated art. And he's not happy about it". MIT Technology Review.
  46. ^ a b Ryo Shimizu (August 26, 2022). "Midjourneyを超えた? 無料の作画AI「 #StableDiffusion 」が「AIを民主化した」と断言できる理由". Business Insider Japan (in Japanese).
  47. ^ James Vincent "AI art tools Stable Diffusion and Midjourney targeted with copyright lawsuit" The Verge, 16 January, 2023.
  48. ^ Korn, Jennifer (2023-01-17). "Getty Images suing the makers of popular AI art tool for allegedly stealing photos". CNN. Retrieved 2023-01-22.
  49. ^ "Stable Diffusion Public Release". Stability.Ai. Archived from the original on 2022-08-30. Retrieved 2022-08-31.
  50. ^ "From RAIL to Open RAIL: Topologies of RAIL Licenses". Responsible AI Licenses (RAIL). Retrieved 2023-02-20.
  51. ^ "Ready or not, mass video deepfakes are coming". The Washington Post. 2022-08-30. Archived from the original on 2022-08-31. Retrieved 2022-08-31.
  52. ^ "License - a Hugging Face Space by CompVis". Archived from the original on 2022-09-04. Retrieved 2022-09-05.
  53. ^ Katsuo Ishida (August 26, 2022). "言葉で指示した画像を凄いAIが描き出す「Stable Diffusion」 ~画像は商用利用も可能". Impress Corporation (in Japanese).

External links[edit]