ComfyUI Extension: ComfyUI-easycontrol

Authored by jax-explorer

Created 8 months ago

Updated 8 months ago

187 stars

ComfyUI EasyControl Nodes is a collection of nodes for ComfyUI that allows you to load and use EasyControl models.

Custom Nodes (0)

README

ComfyUI EasyControl Nodes

ComfyUI EasyControl Nodes is a collection of nodes for ComfyUI that allows you to load and use EasyControl models.

https://github.com/Xiaojiu-z/EasyControl comfyui

need 40GB VRAM to run. (open CPU offload only 24GB)

autodownload flux model (need 50GB disk space)

lora need download to models/loras

support lora list: https://huggingface.co/Xiaojiu-Z/EasyControl/tree/main/models

show

Migrate subjects

show

Style Photo show

mutil view show

online run:

ghibli online workflow run: https://www.comfyonline.app/explore/6cd58cc5-5d17-4ad8-9e10-91681085902c

ghibli online app run: https://www.comfyonline.app/explore/app/gpt-ghibli-style-image-generate

Migrate subjects online workflow run: https://www.comfyonline.app/explore/02c7d12b-19f5-46e4-af3d-b8110fff0c81

style_photo online workflow run: https://www.comfyonline.app/explore/125c295f-2f1f-4fbc-a1c8-b66c9b1265a3

mutil view online workflow run: https://www.comfyonline.app/explore/ad7f29a1-af00-4367-b211-0b1f23254e3b

https://www.comfyonline.app comfyonline is comfyui cloud website, Run ComfyUI workflows online and deploy APIs with one click

Provides an online environment for running your ComfyUI workflows, with the ability to generate APIs for easy AI application development.

EasyControl analysis

EasyControl: Empowering Diffusion Transformers with Efficient and Flexible Control

In recent years, AI image generation technology based on diffusion models has achieved revolutionary progress. From DALL-E to Stable Diffusion, and onto newer Transformer-based models (like components of Google Imagen, Alibaba's AnyText, Tsinghua's PixArt-α, and the latest Flux.1), we've witnessed rapid improvements in image quality and text adherence. However, simply generating beautiful images is often not enough; users frequently require finer control, such as specifying human poses, preserving object outlines, or controlling scene depth.

In the era of UNet-based Stable Diffusion, the emergence of ControlNet [83] was a milestone. It introduced spatial conditional control by adding a trainable adapter network while preserving the powerful generation capabilities of the original model. Subsequently, techniques like IP-Adapter [80] further enabled control over subject content. These solutions greatly enriched the Stable Diffusion ecosystem.

However, as the technological frontier shifts towards more computationally efficient and scalable Diffusion Transformer (DiT) architectures (like Flux.1 [29], SD3 [9]), migrating control capabilities efficiently and flexibly has become a new challenge. Existing control methods for DiTs often face computational bottlenecks (the quadratic complexity of Transformer's attention mechanism), difficulties in combining multiple conditions (especially with poor zero-shot performance), and compatibility issues with popular community-customized models (like various style LoRAs).

It is against this backdrop that the paper "EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer" [This Paper] emerges, introducing a novel, efficient, and flexible unified conditional control framework for DiT models.

Limitations of Traditional Control Schemes (e.g., ControlNet)

Before delving into EasyControl, let's review the potential issues traditional schemes like ControlNet might encounter in the DiT era:

Architectural Mismatch: ControlNet was designed based on the UNet architecture. Its encoder directly copies the UNet encoder, injecting conditional features into the UNet decoder via zero-convolution layers. This design cannot be directly transplanted onto the pure Transformer architecture of DiTs.
Computational Cost: ControlNet itself has a large parameter count (often comparable to the base model, e.g., ~1B-3B parameters for SD1.5 ControlNet). During inference, it requires running two large networks, leading to significant computational overhead. Directly applying a similar approach to DiTs by introducing large adapter networks would exacerbate the cost issue.
Compatibility: While powerful, ControlNet sometimes exhibits conflicts or degraded performance when combined with numerous community fine-tuned models or LoRAs.
Resolution Sensitivity: The way ControlNet processes condition maps might lead to information loss or weakened control when the input condition resolution differs significantly from the generation resolution.

EasyControl's Core Innovations

EasyControl cleverly leverages the characteristics of the DiT architecture, addressing the aforementioned challenges through three key innovations, achieving efficient, flexible, and plug-and-play control:

Lightweight Condition Injection LoRA Module (CIL):
- Isolated Design: The core idea of EasyControl is "isolation." It avoids modifying the main DiT backbone that processes text and noise. Instead, it introduces an independent "Condition Branch" for the conditional signal (e.g., Canny edge map, face image).
- Targeted LoRA: It does not use traditional LoRA, which acts on the model's backbone for fine-tuning. Instead, LoRA (Low-Rank Adaptation) is applied exclusively within this new condition branch, efficiently learning how to encode and align the conditional information. The weights of the original text and noise branches remain completely frozen.
- Advantages: This design offers multiple benefits:
  - Lightweight: Each condition control module has a very small parameter count (around 15M in the paper), far less than ControlNet.
  - Plug-and-Play: Since the backbone network is untouched, EasyControl modules can be easily loaded like plugins and coexist harmoniously with various customized base models or style LoRAs, minimizing conflicts.
  - Zero-Shot Multi-Condition: Most impressively, even though each condition module is trained independently, the framework supports combining multiple different types of conditions (e.g., pose + face) in a zero-shot manner for complex control, achieving stable results. This is ensured by the subsequent Causal Mutual Attention mechanism.
Position-Aware Training Paradigm (PATP):
- Efficient Training: To reduce training costs, EasyControl downsamples the input condition images to a fixed low resolution (e.g., 512x512) during training.
- Cross-Resolution Control: How can it accurately control high-resolution generation after low-resolution training? EasyControl introduces Position-Aware Interpolation (PAI). For spatially strong conditions (like Canny, Depth), it intelligently interpolates the position embeddings based on the scaling factor between the original and resized condition maps. This ensures that even when seeing a low-resolution condition, the model understands the correct spatial location of its features in the final high-resolution output. For subject conditions (like Face), a simpler PE Offset strategy is used for distinction.
- Flexibility: PATP enables EasyControl to generate images at arbitrary resolutions and aspect ratios while maintaining good conditional control, breaking the limitation of fixed resolutions.
Causal Attention & KV Cache:
- Accelerated Inference: The computational bottleneck in Transformers is Self-Attention. EasyControl utilizes Causal Attention mechanisms (with different masking strategies for training and inference), which decouples the computation of the condition branch from the denoising timestep.
- KV Cache: Based on this, EasyControl implements the first successfully applied KV Cache strategy in conditional diffusion models. At the beginning of inference (t=0), the system calculates and caches the Key and Value pairs generated by all condition branches once. In all subsequent denoising steps (t≥1), these cached values are directly reused, avoiding massive redundant computations. This significantly reduces inference latency, especially noticeable with a higher number of sampling steps. According to the paper, the full EasyControl is 58% (single-condition) to 75% (dual-condition) faster than the version without PATP and KV Cache.

EasyControl vs. Traditional ControlNet Comparison

Conclusion

EasyControl introduces the first truly efficient, flexible, and plug-and-play unified control framework for Diffusion Transformers. Through clever architectural design (CIL), training strategy (PATP), and inference optimization (KV Cache), it not only solves the core technical challenges of DiT control but also keeps the parameter count and computational cost extremely low. Its excellent zero-shot multi-condition combination capability and compatibility with community-customized models herald a new era of prosperity for controllable generation within the DiT ecosystem.

Although the paper also points out limitations in handling conflicting inputs and extreme resolutions, EasyControl undoubtedly paves the way for more powerful and user-friendly controllable image generation models, marking a significant milestone in the development of DiTs.