ComfyUI Extension: ComfyUI-JoyCaption

Authored by 1038lab

Created 2 months ago

Updated 2 months ago

36 stars

Joy Caption is a ComfyUI custom node powered by the LLaVA model for efficient, stylized image captioning. Caption Tools nodes handle batch image processing and automatic separation of caption text.

Custom Nodes (0)

README

ComfyUI-JoyCaption

Joy Caption is a ComfyUI custom node powered by the LLaVA model for efficient, stylized image captioning. Caption Tools nodes handle batch image processing and automatic separation of caption text.

Joycaption_node

News & Updates

2025/06/07: Update ComfyUI-JoyCaption to v1.1.1 ( update.md )

v1 1 1

2025/06/05: ComfyUI-JoyCaption to v1.1.0 ( update.md )

Add Caption tools

Features

Simple and user-friendly interface
Multiple caption types support
Flexible length control
Memory optimization options
Automatic model download - the model will be downloaded automatically and properly renamed on first use
Support for multiple caption types
Support for advanced customization options
Multiple precision options (fp32, bf16, fp16)

Installation

Clone this repository to your ComfyUI/custom_nodes directory:

cd ComfyUI/custom_nodes
git clone https://github.com/1038lab/ComfyUI-JoyCaption.git

Install the required dependencies:

cd ComfyUI/custom_nodes/ComfyUI-JoyCaption
pip install -r requirements.txt

Download Models

The models will be automatically downloaded and renamed on first use, or you can manually download them: | Model | Link | | ----- | ---- | | JoyCaption Beta One | Download | | JoyCaption Alpha Two | Download |

After downloading, place the model files in your ComfyUI/models/LLM/ directory.

Basic Usage

Basic Node

Add the "JoyCaption" node from the 🧪AILab/📝JoyCaption category
Connect an image source to the node
Select the model file (defaults to llama-joycaption-beta-one-hf-llava)
Adjust the parameters as needed
Run the workflow

Advanced Node

Add the "JoyCaption (Advanced)" node from the 🧪AILab/📝JoyCaption category
Connect an image source to the node
Select the caption type
Adjust the parameters as needed
Run the workflow

Parameters

Basic Node

| Parameter | Description | Default | Range | | --------- | ----------- | ------- | ----- | | Model | The JoyCaption model to use | llama-joycaption-beta-one-hf-llava | - | | Memory Control | Memory optimization settings | Default | Default (fp32), Balanced (8-bit), Maximum Savings (4-bit) | | Caption Type | Caption style selection | Descriptive | Descriptive, Descriptive (Casual), Straightforward, Tags, Technical, Artistic | | Caption Length | Output length control | medium | any, very short, short, medium, long, very long |

Quantization Options

| Mode | Precision | Memory Usage | Speed | Quality | Recommended GPU | |------|-----------|--------------|-------|---------|----------------| | Default | fp32 | ~16GB | 1x | Best | 24GB+ | | Default | bf16 | ~8GB | 1.5x | Excellent | 16GB+ | | Default | fp16 | ~8GB | 2x | Very Good | 16GB+ | | Balanced | 8-bit | ~4GB | 2.5x | Good | 12GB+ | | Maximum Savings | 4-bit | ~2GB | 3x | Acceptable | 8GB+ |

Advanced Node

| Parameter | Description | Default | Range | | --------- | ----------- | ------- | ----- | | Extra Options | Additional feature options | [] | Multiple options | | Person Name | Name for person descriptions | "" | Any text | | Max New Tokens | Maximum tokens to generate | 512 | 1-2048 | | Temperature | Generation temperature | 0.6 | 0.0-2.0 | | Top-p | Sampling parameter | 0.9 | 0.0-1.0 | | Top-k | Top-k sampling | 0 | 0-100 | | Precision | Use fp32 for best quality, bf16 for balanced performance, fp16 for maximum speed. 8-bit and 4-bit quantization provide significant memory savings with minimal quality impact |

Setting Tips

| Setting | Recommendation | | ------- | -------------- | | Memory Mode | Based on GPU VRAM: 24GB+ use Default, 12GB+ use Balanced, 8GB+ use Maximum Savings | | Input Resolution | The model works best with images of 512x512 or higher resolution | | Memory Usage | If you encounter memory issues, try using 4-bit mode or processing images at a lower resolution | | Performance | For batch processing, consider reducing max_new_tokens for faster throughput | | Temperature | Lower values (0.6-0.7) for more stable results, higher values (0.8-1.0) for more diverse results |

About Model

This implementation uses LLaVA-based image captioning models, optimized to provide fast and accurate image descriptions.

Model features:

Free and open weights
Uncensored content coverage
Broad style diversity (digital art, photoreal, anime, etc.)
Fast processing with bfloat16 precision
High-quality caption generation
Memory-efficient operation
Consistent results across various image types
Support for multiple caption styles
Optimized for training diffusion models

The models are trained on diverse datasets, ensuring:

Balanced representation across different image types
High accuracy in various scenarios
Robust performance with complex scenes
Support for both SFW and NSFW content
Equal coverage of different styles and concepts

Roadmap

Future plans include:

Support for more caption types
Additional optimization options
Improved memory management
Batch processing support

Credits

JoyCaption Model: fancyfeast
Created by: 1038lab

License

This repository's code is released under the GPL-3.0 License.