This repository explains how to accelerate image generation in ComfyUI using Pruna, an inference optimization engine that makes AI models faster, smaller, cheaper, and greener. ComfyUI is a popular node-based GUI for image generation models, for which we provide a custom compilation node that accelerates Stable Diffusion (SD) and Flux inference, while preserving output quality.
This repository explains how to accelerate image generation in ComfyUI using Pruna, an inference optimization engine that makes AI models faster, smaller, cheaper, and greener. ComfyUI is a popular node-based GUI for image generation models, for which we provide two custom nodes:
Both of them can be applied to Stable Diffusion (SD) and Flux models.
Here, you'll find:
Note that the Pruna Pro version is required to use the caching node or the x_fast
compilation mode.
custom_nodes
folder:cd <path_to_comfyui>/custom_nodes
git clone https://github.com/PrunaAI/ComfyUI_pruna.git
cd <path_to_comfyui> && python main.py --disable-cuda-malloc --gpu-only
The Pruna node will appear in the nodes menu in the Pruna
category.
Important note: The compilation node requires launching ComfyUI with the --disable-cuda-malloc
flag;
otherwise the node may not function properly. For optimal performance, we also recommend setting the
--gpu-only
flag.
We provide two types of workflows: one using a Stable Diffusion model and another based on Flux. To these models, we apply caching, compilation or their combination.
| Optimization Technique | Stable Diffusion | Flux | |--------------------------|-----------------|------| | Compilation | SD Compilation (Preview) | Flux Compilation (Preview) | | Caching | SD Caching (Preview) | Flux Caching (Preview) | | Caching + Compilation | SD Caching + Compilation (Preview) | Flux Caching + Compilation (Preview) |
To load the workflow:
Open
in the Workflow
tab, as shown here, and select the fileTo run the workflow, make sure that you have first set up the desired model.
You have two options for the base model:
<path_to_comfyui>/models/checkpoints
<path_to_comfyui>/models/diffusers
Load Checkpoint
node with a DiffusersLoader
nodeThe node is tested using the SafeTensors format, so for the sake of reproducibility, we recommend using that format. However, we don't expect any performance differences between the two.
After loading the model, you can choose the desired workflow, and you're all set!
Note: In this example, we use the Stable Diffusion v1.4 model. However, our nodes are compatible with any other SD model — feel free to use your favorite one!
To use Flux, you must separately download all model components—including the VAE, CLIP, and diffusion model weights—and place them in the appropriate folder.
Steps to set up Flux:
For the CLIP models: Get the following files:
Move them to <path_to_comfyui>/models/clip/
.
For the VAE model:
Get the VAE model, and move it to <path_to_comfyui>/models/vae/
directory.
For the Flux model:
You first need to request access to the model here. Once you have access, download the weights and move them to <path_to_comfyui>/models/diffusion_models/
.
Now, just load the workflow and you're ready to go!
Through the GUI, you can configure various optimization settings. Specifically:
Compilation: We currently support two compilation modes: x_fast
and torch_compile
, with x_fast
set as the default.
Caching: Our caching mechanism supports the adaptive
algorithm, which allows you to adjust the threshold
and max_skip_steps
parameters:
threshold
: Acceptable values range from 0.001
to 0.2
.max_skip_steps
: Acceptable values range from 1
to 5
.We recommend using the default values (threshold = 0.01
, max_skip_steps = 4
), but you can experiment with different settings to balance speed and quality. In general, increasing the threshold results in more aggressive caching, which may improve performance at the expense of image quality. Note that, if you want to change the parameters of the nodes after the first execution, you have to restart the workflow.
Note: Caching and
x_fast
compilation require access to the Pruna Pro version.
The node was tested on an NVIDIA L40S GPU. Below, we compare the performance of the base model, with the models
optimized with Pruna's compilation and caching nodes. We run two types of experiments: one using 50 denoising steps and another
using 28 steps. We compare the iterations per second (as reported by ComfyUI
) and the end-to-end time required to generate a single image.
Hyperparameters: For caching, we use the default hyperparameters, which are threshold = 0.01
and max_skip_steps = 4
.
Hyperparameters: For the SD model, when the number of denoising steps is small, the caching node with the
default hyperparameters tends to not provide substantial speedups. For that reason, here, only for
the SD model, we set the threshold to 0.02
.
For questions, feedback or community discussions, feel free to join our Discord.
For bug reports or technical issues, please open an issue in this repository.