Stable Diffusion is a specific type of AI model used for generating images. These images can range from photorealistic - similar to what you'd capture with a camera - to more stylized, artistic representations akin to a professional artist's work.
While its main function is to create images from text descriptions, Stable Diffusion isn't limited to just that. You can also use it for other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. And it's not just about images - you can also use this model to create videos and animations.
A base model is an official version released by Stability AI. These are the fundamental models that you can use as they are. On the other hand, custom models are essentially base models that have been retrained with additional data. This retraining allows them to generate images of specific styles or objects. If you want to create a style that falls somewhere in between two models, you can easily merge two models to achieve this.
Stable Diffusion isn't just one large, single model. Instead, it's made up of various components and models that collaborate to generate images from text.
Model files are large .ckpt
or .safetensors
files obtained from repositories such as HuggingFace or CivitAI. These files contain the weights for three different models:
CLIP
- a model to convert text prompt to a compressed format that the UNET model can understandMODEL
- the main Stable Diffusion model, also known as UNET. Generates a compressed imageVAE
- Decodes the compressed image to a normal-looking imageIn the default ComfyUI workflow, the CheckpointLoader serves as a representation of the model files. It allows users to select a checkpoint to load and displays three different outputs: MODEL
, CLIP
, and VAE
.
The CLIP model is connected to CLIPTextEncode nodes. CLIP, acting as a text encoder, converts text to a format understandable by the main MODEL.
In Stable Diffusion, image generation involves a sampler, represented by the sampler node in ComfyUI. The sampler takes the main Stable Diffusion MODEL, positive and negative prompts encoded by CLIP, and a Latent Image as inputs. The Latent Image is an empty image since we are generating an image from text (txt2img).
The sampler adds noise to the input latent image and denoises it using the main MODEL. Gradual denoising, guided by encoded prompts, is the process through which Stable Diffusion generates images.
The third model used in Stable Diffusion is the VAE, responsible for translating an image from latent space to pixel space. Latent space is the format understood by the main MODEL, while pixel space is the format recognizable by image viewers.
The VAEDecode node takes the latent image from the sampler as input and outputs a regular image. This image is then saved to a PNG file using the SaveImage node.