A ComfyUI extension for chatting with your images. Runs on your own system, no external services used, no filter. Uses the a/LLaVA multimodal LLM so you can give instructions or ask questions in natural language. It's maybe as smart as GPT3.5, and it can see.
A ComfyUI extension for chatting with your images. Runs on your own system, no external services used, no filter.
Uses the LLaVA multimodal LLM so you can give instructions or ask questions in natural language. It's maybe as smart as GPT3.5, and it can see.
Try asking for:
The model is quite capable of analysing NSFW images and returning NSFW replies.
It is unlikely to return an NSFW response to a SFW image, in my experience. It seems like this is because (1) the model's output is strongly conditioned on the contents of the image so it's hard to activate concepts that aren't pictured and (2) the LLM has had a hefty dose of safety-training.
This is probably for the best in general. But you will not have much success asking NSFW questions about SFW images.
git clone https://github.com/ceruleandeep/ComfyUI-LLaVA-Captioner
into your custom_nodes
folder
custom_nodes\ComfyUI-LLaVA-Captioner
custom_nodes/ComfyUI-LLaVA-Captioner
folder you just created
cd C:\ComfyUI_windows_portable\ComfyUI\custom_nodes\ComfyUI-LLaVA-Captioner
or wherever you have it installedpython install.py
models\llama
:
Add the node via image
-> LlavaCaptioner
Supports tagging and outputting multiple batched inputs.
This is easy to install but getting it to use the GPU can be a saga.
GPU inference time is 4 secs per image on a RTX 4090 with 4GB of VRAM to spare, and 8 secs per image on a Macbook Pro M1. CPU inference time is 25 secs per image. If your inference times are closer to 25 than to 5, you're probably doing CPU inference.
Unfortunately the multimodal models in the Llama family need about a 4x larger context size than the text-only ones,
so the llama.cpp
promise of doing fast LLM inference on their CPUs hasn't quite
arrived yet. If you have a GPU, put it to work.