Jump to content

Stable Diffusion

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Labdajiwa (talk | contribs) at 06:35, 29 September 2022 (-Category:Software using the MIT license. As of August 22, 2022, SD doesn't use the MIT License anymore. https://github.com/CompVis/stable-diffusion/commit/69ae4b3). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Stable Diffusion
Developer(s)StabilityAI
Initial releaseAugust 22, 2022
Stable release
1.5 (model),[1] / August 31, 2022
Repositorygithub.com/CompVis/stable-diffusion
Written inPython
Operating systemAny that support CUDA kernels
TypeText-to-image model
LicenseCreative ML OpenRAIL-M
Websitestability.ai

Stable Diffusion is a deep learning, text-to-image model released by startup StabilityAI in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt.[2]

Stable Diffusion is a latent diffusion model, a variety of generative neural network developed by researchers at LMU Munich. It was developed by Stability AI in collaboration with LMU and Runway, with support from EleutherAI and LAION.[3][4][5] Stability AI is in talks to raise capital at a valuation of up to one billion dollars as of September 2022.[6]

Stable Diffusion's code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU. This marked a departure from previous proprietary text-to-image models such as DALL-E and Midjourney which were accessible only via cloud services.[7]

Architecture

Diagram of the latent diffusion architecture used by Stable Diffusion.

Stable Diffusion is a form of diffusion model (DM). Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise to training images, and can be thought of as a sequence of denoising autoencoders. Stable Diffusion uses a variant known as a "latent diffusion model" (LDM) developed by researchers at LMU Munich. Rather than learning to denoise image data (in "pixel space"), an autoencoder is trained to transform images into a lower-dimensional latent space. The process of adding and removing noise is applied to this latent representation, with the final denoised output then decoded into pixel space. Each denoising step is accomplished by a U-Net architecture. The researchers point to reduced computational requirements for training and generation as an advantage of LDMs.[3][8]

The denoising step may be conditioned on a string of text, an image, or some other data. An encoding of the conditioning data is exposed to the denoising U-Nets via a cross-attention mechanism. For conditioning on text, a transformer language model was trained to encode text prompts.[8]

Usage

Demonstration of how different text prompts affect the output of images generated by the Stable Diffusion model, when instructed to draw the same subject. Each individual row represents a different prompt fed into the model, and the variation of art style between each row directly correlates with the presence or absence of certain phrases and keywords.[a]

The Stable Diffusion model supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output,[4] and the redrawing of existing images which incorporate new elements described within a text prompt (a process commonly known as guided image synthesis[9]) through the use of the model's diffusion-denoising mechanism.[4] In addition, the model also allows the use of prompts to partially alter existing images via inpainting and outpainting, when used with an appropriate user interface that supports such features, of which numerous different open source implementations exist.[10]

Stable Diffusion is recommended to be run with 10GB or more VRAM, however users with less VRAM may opt to load the weights in float16 precision instead of the default float32 to lower VRAM usage.[11]

Text to image generation

Demonstration of the effect of negative prompts on image generation.
  • Top: No negative prompt
  • Centre: "green trees"
  • Bottom: "round stones, round rocks"

The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt, in addition to assorted option parameters covering sampling types, output image dimensions, and seed values, and outputs an image file based on the model's interpretation of the prompt.[4] Generated images are tagged with an invisible digital watermark to allow users to identify an image as generated by Stable Diffusion,[4] although this watermark loses its effectiveness if the image is resized or rotated.[12] The Stable Diffusion model is trained on a dataset consisting of 512×512 resolution images,[4] meaning that txt2img output images are optimally configured to be generated at 512×512 resolution as well, and deviating from this size can result in poor quality generation outputs.[11]

Each txt2img generation will involve a specific seed value which affects the output image; users may opt to randomise the seed in order to explore different generated outputs, or use the same seed to obtain the same image output as a previously generated image.[11] Users are also able to adjust the number of inference steps for the sampler; a higher value takes a longer duration of time, however a smaller value may result in visual defects.[11] Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt;[13] more experimentative or creative use cases may opt for a lower value, while use cases aiming for more specific outputs may use a higher value.[11]

Negative prompts are a feature included in some user interface implementations of Stable Diffusion which allow the user to specify prompts which the model should avoid during image generation, for use cases where undesirable image features would otherwise be present within image outputs due to the positive prompts provided by the user, or due to how the model was originally trained.[10] The use of negative prompts has a highly statistically significant effect on decreasing the frequency of generating unwanted outputs compared to the use of emphasis markers, which are another alternative method of adding weight to parts of prompts utilised by some open-source implementations of Stable Diffusion, where brackets are added to keywords to add or reduce emphasis.[14]

Image modification

Stable Diffusion includes another sampling script, "img2img", which consumes a text prompt, path to an existing image, and strength value between 0.0 and 1.0, and outputs a new image based on the original image that also features elements provided within the textual prompt; the strength value denotes the amount of noise added to the output image, with a higher value producing images with more variation, however may not be semantically consistent with the prompt provided.[4] Image upscaling is one potential use case of img2img, among others.[4]

Inpainting and outpainting

Additional use-cases for image modification via img2img are offered by numerous different front-end implementations of the Stable Diffusion model. Inpainting involves selectively modifying a portion of an existing image delineated by a user-provided mask, which fills the masked space with newly generated content based on the provided prompt.[10] Conversely, outpainting extends an image beyond its original dimensions, filling the previously empty space with content generated based on the provided prompt.[10]

Demonstration of inpainting and outpainting techniques using img2img within Stable Diffusion
Step 1: An image is generated from scratch using txt2img.
Step 2: Via outpainting, the bottom of the image is extended by 512 pixels and filled with AI-generated content.
Step 3: In preparation for inpainting, a makeshift arm is drawn using the paintbrush in GIMP.
Step 4: An inpainting mask is applied over the makeshift arm, and img2img generates a new arm while leaving the remainder of the image untouched.

License

Unlike models like DALL-E, Stable Diffusion makes its source code available,[15][4] along with pre-trained weights. Its license prohibits certain use cases, including crime, libel, harassment, doxxing, "exploiting ... minors", giving medical advice, automatically creating legal obligations, producing legal evidence, and "discriminating against or harming individuals or groups based on ... social behavior or ... personal or personality characteristics ... [or] legally protected characteristics or categories".[16][17]

Training

Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publically available dataset derived from Common Crawl data scraped from the web. The dataset was created by LAION, a German non-profit which receives funding from Stability AI.[18][19] The model was initially trained on a large subset of LAION-5B, with the final rounds of training done on "LAION-Aesthetics v2 5+", a subset of 600 million captioned images which an AI predicted that humans would give a score of at least 5 out of 10 when asked to rate how much they liked them.[18][20] This final subset also excluded low-resolution images and images which an AI identified as carrying a watermark.[18]

The model was trained using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours, at a cost of $600,000.[21][22][23]

Explanatory notes

  1. ^ Partial snippets of prompts are as follows:
    • First row: art style of artgerm and greg rutkowski
    • Second row: art style of makoto shinkai and akihiko yoshida and hidari and wlop
    • Third row: art style of Michael Garmash
    • Fourth row: Charlie Bowater and Lilia Alvarado and Sophie Gengembre Anderson and Franz Xaver Winterhalter, by Konstantin Razumov, by Jessica Rossier, by Albert Lynch
    • Fifth row: art style of Jordan Grimmer, Charlie Bowater and Artgerm
    • Sixth row: art style of ROSSDRAWS, very detailed deep eyes by ilya kuvshinov
    • Seventh row: game cg japanese anime Jock Sturges Kyoto Animation Alexandre Cabanel Granblue Fantasy light novel pixiv
    • Eighth row: art style of Sophie Anderson, and greg rutkowski, and albert lynch
    • Ninth row: art style of Konstantin Razumov, and Jessica Rossier, and Albert Lynch
    • Tenth row: hyper realistic anime painting sophie anderson Atelier meruru josei isekai by Krenz cushart by Kyoto Animation official art
    • Eleventh row: art style of wlop and michael garmash
    • Twelfth row: art style of greg rutkowski and alphonse mucha
    • Thirteenth row: by Donato Giancola, Sophie Anderson, Albert Lynch

See also

References

  1. ^ Mostaque, Emad (2022-06-06). "Stable Diffusion 1.5 beta now available to try via API and #DreamStudio, let me know what you think. Much more tomorrow…". Twitter. Archived from the original on 2022-09-27.
  2. ^ "Diffuse The Rest - a Hugging Face Space by huggingface". huggingface.co. Archived from the original on 2022-09-05. Retrieved 2022-09-05.
  3. ^ a b "Stable Diffusion Launch Announcement". Stability.Ai. Archived from the original on 2022-09-05. Retrieved 2022-09-06.
  4. ^ a b c d e f g h i "Stable Diffusion Repository on GitHub". CompVis - Machine Vision and Learning Research Group, LMU Munich. 17 September 2022. Retrieved 17 September 2022.
  5. ^ "Revolutionizing image generation by AI: Turning text into images". LMU Munich. Retrieved 17 September 2022.
  6. ^ Cai, Kenrick. "Startup Behind AI Image Generator Stable Diffusion Is In Talks To Raise At A Valuation Up To $1 Billion". Forbes. Retrieved 2022-09-10.
  7. ^ "The new killer app: Creating AI art will absolutely crush your PC". PCWorld. Archived from the original on 2022-08-31. Retrieved 2022-08-31.
  8. ^ a b Rombach; Blattmann; Lorenz; Esser; Ommer (June 2022). High-Resolution Image Synthesis with Latent Diffusion Models (PDF). International Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA. pp. 10684–10695. arXiv:2112.10752.
  9. ^ Meng, Chenlin; He, Yutong; Song, Yang; Song, Jiaming; Wu, Jiajun; Zhu, Jun-Yan; Ermon, Stefano (August 2, 2021). "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations". arXiv. arXiv. doi:10.48550/arXiv.2108.01073.
  10. ^ a b c d "Stable Diffusion web UI". GitHub.
  11. ^ a b c d e "Stable Diffusion with 🧨 Diffusers". Hugging Face official blog. August 22, 2022.
  12. ^ "invisible-watermark README.md". GitHub.
  13. ^ Ho, Jonathan; Salimans, Tim (July 26, 2022). "Classifier-Free Diffusion Guidance". arXiv. arXiv. doi:10.48550/arXiv.2207.12598.
  14. ^ Johannes Gaessler (September 11, 2022). "Emphasis". GitHub.
  15. ^ "Stable Diffusion Public Release". Stability.Ai. Archived from the original on 2022-08-30. Retrieved 2022-08-31.
  16. ^ "Ready or not, mass video deepfakes are coming". The Washington Post. Archived from the original on 2022-08-31. Retrieved 2022-08-31.
  17. ^ "License - a Hugging Face Space by CompVis". huggingface.co. Archived from the original on 2022-09-04. Retrieved 2022-09-05.
  18. ^ a b c Baio, Andy (30 August 2022). "Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion's Image Generator". Waxy.org.
  19. ^ Heikkilä, Melissa (16 September 2022). "This artist is dominating AI-generated art. And he's not happy about it". MIT Technology Review.
  20. ^ "LAION-Aesthetics | LAION". laion.ai. Archived from the original on 2022-08-26. Retrieved 2022-09-02.
  21. ^ Mostaque, Emad (August 28, 2022). "Cost of construction". Twitter. Archived from the original on 2022-09-06. Retrieved 2022-09-06.
  22. ^ "Stable Diffusion v1-4 Model Card". huggingface.co. Retrieved 2022-09-20.{{cite web}}: CS1 maint: url-status (link)
  23. ^ "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch. Retrieved 2022-09-20.{{cite web}}: CS1 maint: url-status (link)