Stable Diffusion Models

Stable Diffusion models, also known as checkpoints or models, are pre-trained weights specifically designed to generate images in a certain style. The type of images a model can generate is heavily influenced by the images used during training. For instance, if the training data didn't include any images of cats, the model won't be able to generate cat images. Similarly, if the model was trained exclusively with cat images, it will only generate cat-centric images.

Fine-tuning

Fine-tuning is a technique where you take a model that's already been trained on a lot of data and then train it a bit more on some new data. The objective here is to adapt a general-purpose model (aka pre-trained model), which has acquired a broad spectrum of features from a large dataset, to a specific task or domain. This is helpful when you don't have a lot of data for your specific task.

Fine-tuning helps pre-trained models to consistently generate outputs with specific, nuanced features that you train it with.

Fine-tuning Stable Diffusion models can be achieved through a variety of methods, including checkpoints, LoRAs, embeddings, and hypernetworks. Let's dive into what each of these methods entails and how they compare.

Checkpoints

Checkpoint models form the backbone of Stable Diffusion. They are comprehensive files, typically 2 – 7 GB in size, that contain all the information needed to generate an image. No additional files are necessary when working with checkpoint models.

To create a custom checkpoint model, you can use additional training or a technique known as Dreambooth. Both approaches start with a base model, such as Stable Diffusion v1.5 or XL, which is then fine-tuned with additional data. For instance, if you're interested in vintage cars, you can train the model with a dataset of vintage car images to skew the aesthetic towards this sub-genre.

LoRAs

LoRA models are small patch files, typically ranging from 10-200 MB, used to modify the styles of checkpoint models. They work in conjunction with checkpoint models and cannot function independently.

LoRA models are similar to hypernetworks in that they both modify the cross-attention module. However, the modification process differs. LoRA models alter the cross-attention by changing its weight, whereas hypernetworks insert additional networks.

Embeddings (aka Textual Inversions)

Embeddings, also known as textual inversions, are small files (usually 10 – 100 KB) that define new keywords to generate new objects or styles. They are used in tandem with a checkpoint model.

Embeddings are the product of a fine-tuning method called textual inversion. This method doesn't alter the model itself; instead, it defines new keywords to achieve certain styles. Embeddings and hypernetworks work on different parts of a Stable Diffusion model. While textual inversion creates new embeddings in the text encoder, a hypernetwork inserts a small network into the cross-attention module of the noise predictor.

Hypernetworks

Hypernetworks are additional network modules that are added to checkpoint models. These files are typically 5 – 300 MB in size and must be used alongside a checkpoint model.

Developed by Novel AI, hypernetworks are a fine-tuning technique that alters a Stable Diffusion model's style. During the training process, the Stable Diffusion model remains unchanged, but the attached hypernetwork is permitted to adjust. This makes training quicker and less resource-intensive, which is one of the primary benefits of hypernetworks.

It's important to note that the term 'hypernetwork' in the context of Stable Diffusion differs from its usual meaning in machine learning. In this case, it does not refer to a network that generates weights for another network.

Comparing the Methods

Each of these methods has its unique advantages and drawbacks.

  • Checkpoint models are powerful and versatile, capable of storing a wide range of styles. However, they are also large and can be resource-intensive to train.

  • LoRA models and hypernetworks are smaller and faster to train, making them more manageable. However, they only modify the cross-attention module and need to work in conjunction with checkpoint models.

  • Embeddings are the smallest and simplest to use, but sometimes it might be unclear which model they should be used with. They can also be challenging to use effectively, and it's not uncommon to struggle to reproduce the intended effect.

The method you choose for fine-tuning Stable Diffusion largely depends on your specific needs and resources. Whether you use checkpoints, LoRAs, embeddings, or hypernetworks, each provides a unique approach to customizing and enhancing your AI model.

Stable Diffusion Model File Formats

When you visit a model download page, you might be greeted with a variety of model file formats. This can be quite perplexing, especially when you're unsure about which one to select. Let's simplify the options for you:

Pruned vs Full vs EMA-only Models

Stable Diffusion checkpoint models usually contain two sets of weights. The first set is the weights post the final training step, and the second set is the EMA (Exponential Moving Average), which is the average weights across the last few training steps. If your objective is to simply use the model, the EMA-only model will suffice. This model is also known as the pruned model and contains the weights utilized when operating the model.

On the other hand, if your aim is to fine-tune the model with additional training, you'll require the full model, which comprises both sets of weights. So, to utilize the model for image generation, opt for the pruned or EMA-only model. This choice will conserve some disk space, which is always a good thing!

FP16 and FP32 Models

FP, or Floating Point, refers to a computer's method of storing decimal numbers, which in this case, are the model weights. FP16 uses 16 bits per number and is known as half precision, while FP32 uses 32 bits per number and is referred to as full precision. Given that the training data for deep learning models like Stable Diffusion is generally noisy, a full-precision model is seldom required. The extra precision merely stores noise! Therefore, whenever available, opt for the FP16 models. They're approximately half the size, which means you'll save a few gigabytes!

Pytorch vs Safetensor Models

The traditional PyTorch model format is .pt or sometimes .ckpt. However, this format has a significant drawback - it lacks security. Malicious codes can be embedded within it, which can run on your machine when you use the model. Safetensors with .safetensor extension is an upgraded version of the .pt model format. It performs the same function of storing weights but without executing any codes. So, always choose the .safetensor version when it's available. If it's not, ensure you download the .pt files from a reliable source.

SDv1.5 vs SDv2.1 vs SDXL

Stable Diffusion (SD) models have seen several iterations since their inception, with SD1.5, SD2.1, and SDXL being some of the most notable versions. Here, we'll take a closer look at these versions, comparing their features, usability, and overall performance.

SD1.5 vs SD2.1

SD1.5, released in October 2022, remains widely used despite the introduction of SD2.1 in December of the same year. The continued popularity of SD1.5 is largely due to two factors: hardware compatibility and community support.

SD1.5 is known for its friendliness towards lower-end hardware, such as a 3070ti with only 8GB VRAM. This makes it accessible to a broader user base. Additionally, SD1.5 boasts a wealth of community-created content, which includes established LoRAs and custom models. Tools like animatediff and ipadapters also perform better on this version. As video content gains popularity, SD1.5's usability has seen a resurgence.

In contrast, SD2.1 introduced a significant change by replacing the text encoder. Instead of using OpenAI's CLIP, as SD1.5 does, SD2.1 employs OpenCLIP, an open-source version of CLIP trained on a known dataset. This dataset, a subset of LAION-5B, is carefully filtered to exclude NSFW images. According to Stability AI, this change enhances the quality of the generated images and even outperforms an unreleased version of CLIP on certain metrics.

However, SD2.1 has its downsides. The NSFW filter used on its training data results in a reduced ability to depict humans. Furthermore, SD2.1 produces excessive watermarks that are resistant to negative prompts, adding an extra step to the image generation process. Many users find that custom models from SD1.5 perform better than the base models of SD2.1.

Drawing a comparison between the base models of SD 1.5 and SD2.1 can be challenging due to the unique default style inherent to each model, which introduces a degree of subjectivity to any direct comparisons.

SD1.5 vs SDXL

SDXL, released in July 2023, is a more advanced version that outperforms SD1.5 in terms of visual quality. However, its high hardware requirements make it less accessible to many users, as it's too demanding for most hardware setups and challenging to train on Google Colab.

While some users may prefer the aesthetic of SD1.5, SDXL excels in terms of composition and adherence to prompts. That said, the high hardware requirements of SDXL and the need to relearn prompts can make it a less attractive option for some users, especially when SD1.5 already provides reliable results and is well-understood by the community.