ComfyUI Extension: VLM_nodes

Authored by gokayfem

Created

Updated

432 stars

Custom Nodes for Vision Language Models (VLM) , Large Language Models (LLM), Image Captioning, Automatic Prompt Generation, Creative and Consistent Prompt Suggestion, Keyword Extraction

Custom Nodes (0)

    README

    <div align="center"> <h1> 👁️ VLM Nodes</h1> <p align="center"> <b> 🔽Examples below</b> • 📙 <a href="https://github.com/gokayfem/Awesome-VLM-Architectures">Visit my other repo to learn more about Vision Language Models</a> </p> </div> <br/>

    Usage

    • For Windows and Linux
    cd custom_nodes
    git clone https://github.com/gokayfem/ComfyUI_VLM_nodes.git
    

    Acknowledgements

    If you get errors related to llama-cpp-python or if it is not using GPU.
    I recommend installing it with the right arguments provided in this link llama-cpp-python

    VLM Nodes

    Utilizes llama-cpp-python for integration of LLaVa models. You can load and use any VLM with LLaVa models in GGUF format with this nodes.
    You need to download the model similar to ggml-model-q4_k.gguf and it's clip projector similar to mmproj-model-f16.gguf from this repositories (in the files and versions).
    python=>3.9 is necessary.
    Put all of the files inside models/LLavacheckpoints
    Note that every model's clip projector is different!

    Structured Output

    Getting structured outputs can be quite challenging through prompt engineering alone.
    I've added the Structured Output node to VLM Nodes.
    Now, you can obtain your answers reliably.
    You can extract entities, numbers, classify prompts with given classes, and generate one specific prompt. These are just a few examples.
    You can add additional descriptions to fields and choose the attributes you want it to return.
    structured

    Image to Music

    Utilizes VLMs, LLMs and AudioLDM-2 to make music from images.
    Use SaveAudioNode to save the music inside output folder.
    It will automatically download the necessary files into models/LLavacheckpoints/files_for_audioldm2

    https://github.com/gokayfem/ComfyUI_VLM_nodes/assets/88277926/2c5bdcde-d637-49ad-b317-14ac0a12f7df

    LLM to Music

    Utilizes Chat Musician, an open-source LLM that integrates intrinsic musical abilities.
    ChatMusician Demo Page
    You can try prompts from this demo page.

    Download the GGUF file
    ChatMusician GGUF Files
    ChatMusician.Q5_K_M.gguf or ChatMusician.Q5_K_S.gguf recommended

    BIG BIG BIG Warning: It does NOT work perfectly, if you got errors accept the error queue prompt again with the same settings!!

    https://github.com/gokayfem/ComfyUI_VLM_nodes/assets/88277926/7f22d4f2-b998-402e-88c8-c382a730d624

    InternLM-XComposer2-VL Node

    Utilizes AutoGPTQ for integration of InternLM-XComposer2-VL Model. It will automatically download the necessary files into models/LLavacheckpoints/files_for_internlm. This is one of the best models for visual perception.
    Important Note : This model is heavy.

    Automatic Prompt Generation and Suggestion Nodes

    Get Keyword node: It can take LLava outputs and extract keywords from them.
    LLava PromptGenerator node: It can create prompts given descriptions or keywords using (input prompt could be Get Keyword or LLava output directly).
    Suggester node: It can generate 5 different prompts based on the original prompt using consistent in the options or random prompts using random in the options.

    • Works best with LLava 1.5 and 1.6.

    Play with the temperature for creative or consistent results. Higher the temperature more creative are the results.
    If you want to dive deep into LLM Settings

    Outputs are JSON looking texts, you can see them as a text using JsonToText Node.
    You can see any string output with ViewText Node
    You can set any string input using SimpleText Node
    Utilizes llama-cpp-agents for getting structured outputs.

    LLM Prompt Generation from text nodes

    LLM PromptGenerator node: Qwen 1.8B Stable Diffusion Prompt
    IF prompt MKR
    This LLM's works best for now for prompt generation.
    LLMSampler node: You can chat with any LLM in gguf format, you can use LLava models as an LLM also.

    API PromptGenerator node: You can use ChatGPT and DeepSeek API's to create prompts. https://platform.deepseek.com/ gives 10m free tokens.

    • ChatGPT-4
    • ChatGPT-3.5
    • DeepSeek You can use them for simple chat also there is an option in the node.

    UForm-Gen2 Qwen Node

    UForm-Gen2 is an extremely fast small generative vision-language model primarily designed for Image Captioning and Visual Question Answering.
    UForm-Gen2 Qwen
    It will automatically download the necessary files into models/LLavacheckpoints/files_for_uform_gen2_qwen

    Kosmos-2 Node

    Kosmos-2: Grounding Multimodal Large Language Models to the World. Kosmos-2 It will automatically download the necessary files into models/LLavacheckpoints/files_for_kosmos2

    moondream1 and moondream2 Node

    This node is designed to work with the Moondream model, a powerful small vision language model built by @vikhyatk using SigLIP, Phi-1.5, and the LLaVa training dataset. The model boasts 1.6 billion parameters and is made available for research purposes only; commercial use is not allowed.

    moondream2 is a small vision language model designed to run efficiently on edge devices.

    It will automatically download the necessary files into models/LLavacheckpoints/files_for__moondream and models/LLavacheckpoints/files_for_moondream2

    JoyTag Node

    @fpgamine's JoyTag is a state of the art AI vision model for tagging images, with a focus on sex positivity and inclusivity.
    It uses the Danbooru tagging schema, but works across a wide range of images, from hand drawn to photographic. It will automatically download the necessary files into models/LLavacheckpoints/files_for_joytagger

    Qwen2-VL Node

    Utilizes the latest Qwen2-VL series of models, which are state-of-the-art vision language models supporting various resolutions, ratios, and languages. The models excel at:

    • Understanding images of various resolutions & ratios
    • Complex visual reasoning and decision making
    • Multilingual support (English, Chinese, European languages, Japanese, Korean, Arabic, Vietnamese, etc.)

    Available models include 2B, 7B, and 72B parameter versions, with standard, AWQ, and GPTQ quantized variants. It will automatically download the necessary files into models/LLavacheckpoints/files_for_qwen2vl.

    Important Note: Larger models (7B, 72B) require significant VRAM. Choose quantized versions (AWQ, GPTQ) for reduced memory usage.

    Link to Qwen2-VL Models

    Example LLaVa Nodes

    image

    Example Image to Music

    image

    Example InternLM-XComposer Node

    image

    Example Using Automatic Prompt Generation

    image

    LLM Nodes

    VLM + LLM

    Example UForm-Gen2 Qwen Node

    image

    Example Kosmos-2 Node

    image

    Example moondream

    image

    Example Joytag

    image

    Example Prompt Generation

    image

    Example SimpleChat

    image

    Example LLava Sampler Advanced

    image