ComfyUI Extension: ComfyUI-Concept-Diffusion

Authored by Junst

Created 3 months ago

Updated 3 months ago

3 stars

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features for ComfyUI

Custom Nodes (0)

README

ComfyUI-Concept-Diffusion

💥💥💥Fixing Now!!!💥💥💥

ComfyUI custom node implementation of ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features.

This node allows you to generate high-quality saliency maps that precisely locate textual concepts within images using diffusion transformer attention layers.

Features

ConceptAttention Node: Extract concept embeddings from diffusion transformer attention layers
Saliency Map Generation: Generate precise saliency maps for textual concepts
Zero-shot Segmentation: Perform zero-shot semantic segmentation using concept attention
Multi-concept Support: Handle multiple concepts simultaneously
Video Support: Works with video generation models (CogVideoX)
Visualization Tools: Overlay attention maps on original images
Flexible Thresholding: Multiple threshold methods for saliency maps

Installation

Clone this repository to your ComfyUI custom_nodes folder:

git clone https://github.com/Junst/ComfyUI-Concept-Diffusion.git
cd ComfyUI-Concept-Diffusion

Install dependencies:
```
pip install -r requirements.txt
```
Restart ComfyUI

Nodes

ConceptAttentionNode

Main node for generating concept attention maps from diffusion models.

Inputs:

model: Diffusion model (MODEL)
clip: CLIP text encoder (CLIP)
image: Input image (IMAGE)
concepts: Comma-separated list of concepts (STRING)
num_inference_steps: Number of inference steps (INT)
seed: Random seed (INT, optional)

Outputs:

concept_maps: Generated concept attention maps (CONCEPT_MAPS)
visualized_image: Visualization of all concept maps (IMAGE)

ConceptSaliencyMapNode

Extract individual concept saliency maps and convert to masks.

Inputs:

concept_maps: Concept attention maps (CONCEPT_MAPS)
concept_name: Name of concept to extract (STRING)
threshold: Threshold for mask generation (FLOAT)

Outputs:

mask: Binary mask for the concept (MASK)
saliency_image: Saliency map visualization (IMAGE)

ConceptSegmentationNode

Perform zero-shot semantic segmentation using concept attention.

Inputs:

concept_maps: Concept attention maps (CONCEPT_MAPS)
image: Original image (IMAGE)
concepts: List of concepts for segmentation (STRING)

Outputs:

segmentation_mask: Segmentation mask (MASK)
segmented_image: Colored segmentation result (IMAGE)

ConceptAttentionVisualizerNode

Visualize concept attention maps overlaid on the original image.

Inputs:

concept_maps: Concept attention maps (CONCEPT_MAPS)
image: Original image (IMAGE)
overlay_alpha: Transparency of overlay (FLOAT)

Outputs:

visualized_image: Image with attention overlay (IMAGE)

Usage Example

Load an image using LoadImage node
Load a diffusion model (Flux, SD3, etc.) using CheckpointLoaderSimple
Connect the model, CLIP, and image to ConceptAttentionNode
Specify concepts like "person, car, tree, sky, building"
Use ConceptSaliencyMapNode to extract specific concept maps
Use ConceptSegmentationNode for zero-shot segmentation
Use ConceptAttentionVisualizerNode for visualization
Save results using SaveImage nodes

Example Workflow

See example_workflow.json for a complete ComfyUI workflow example.

Testing

Run the test script to verify the nodes work correctly:

python test_nodes.py

Technical Details

This implementation is based on the ConceptAttention paper which shows that:

Multi-modal diffusion transformers (DiTs) have rich representations
Linear projections in the attention output space produce sharper saliency maps
Concept embeddings can be extracted without additional training
The method works for both image and video generation models

Supported Models

Flux (Flux1-dev, Flux1-schnell)
Stable Diffusion 3/3.5
CogVideoX (for video)
Other DiT-based diffusion models

Based on

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

License

MIT License