Getting Started

Introduction to Stable Diffusion

Stable Diffusion is a specific type of AI model used for generating images. These images can range from photorealistic - similar to what you'd capture with a camera - to more stylized, artistic representations akin to a professional artist's work.

While its main function is to create images from text descriptions, Stable Diffusion isn't limited to just that. You can also use it for other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. And it's not just about images - you can also use this model to create videos and animations.

A base model is an official version released by Stability AI. These are the fundamental models that you can use as they are. On the other hand, custom models are essentially base models that have been retrained with additional data. This retraining allows them to generate images of specific styles or objects. If you want to create a style that falls somewhere in between two models, you can easily merge two models to achieve this.

Components of a Stable Diffusion Model

Stable Diffusion isn't just one large, single model. Instead, it's made up of various components and models that collaborate to generate images from text.

Model files are large .ckpt or .safetensors files obtained from repositories such as HuggingFace or CivitAI. These files contain the weights for three different models:

CLIP - a model to convert text prompt to a compressed format that the UNET model can understand
MODEL - the main Stable Diffusion model, also known as UNET. Generates a compressed image
VAE - Decodes the compressed image to a normal-looking image

CheckpointLoader

In the default ComfyUI workflow, the CheckpointLoader serves as a representation of the model files. It allows users to select a checkpoint to load and displays three different outputs: MODEL, CLIP, and VAE.

CheckpointLoader

1. CLIP Model

The CLIP model is connected to CLIPTextEncode nodes. CLIP, acting as a text encoder, converts text to a format understandable by the main MODEL.

CLIPTextEncode

2. Stable Diffusion MODEL (aka UNET)

In Stable Diffusion, image generation involves a sampler, represented by the sampler node in ComfyUI. The sampler takes the main Stable Diffusion MODEL, positive and negative prompts encoded by CLIP, and a Latent Image as inputs. The Latent Image is an empty image since we are generating an image from text (txt2img).

Sampler

The sampler adds noise to the input latent image and denoises it using the main MODEL. Gradual denoising, guided by encoded prompts, is the process through which Stable Diffusion generates images.

3. VAE Model

The third model used in Stable Diffusion is the VAE, responsible for translating an image from latent space to pixel space. Latent space is the format understood by the main MODEL, while pixel space is the format recognizable by image viewers.

VAEDecode

The VAEDecode node takes the latent image from the sampler as input and outputs a regular image. This image is then saved to a PNG file using the SaveImage node.