ComfyUI Extension: ComfyUI-QwenClip

Authored by isaac-mcfadyen

Created

Updated

0 stars

A variety of random text encoder tools intended for use with ComfyUI and Qwen Image/Qwen Image Edit. More (may) be added as I try out various modifications to Qwen Image.

Custom Nodes (0)

    README

    ComfyUI-QwenClip

    A variety of random text encoder tools intended for use with ComfyUI and Qwen Image/Qwen Image Edit. More (may) be added as I try out various modifications to Qwen Image.

    CLIPSetQwenImagePrompt/CLIPSetQwenImageEditPrompt

    Qwen's text encoder, Qwen2.5-VL-7B, uses a fixed prefix of Describe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background when encoding the prompt.

    This node modifies this prefix, which can be useful for guiding the model to pay attention to specific attributes and/or features of the image.

    It takes the CLIP as input as well as a template to apply. Qwen's default template is normally hardcoded in ComfyUI here and can be used for inspiration.

    Example

    nvidia/Cosmos-Reason1-7B seems to work well with Qwen Image since it's a variant of Qwen 2.5 VL.

    Using the following prefix seems to improve the model's consistency with objects in the scene (note the <think> format part is directly from NVIDIA docs and improves quality): Describe only concrete, visible attributes needed to render the image. Emphasize physical consistency: correct counts/anatomy, coherent perspective and scale, proper contact and occlusion, consistent shadows and lighting direction/color, plausible materials and reflections, gravity and rigidity. Avoid contradictions, floating objects, and narrative. Be concise but exhaustive. Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>.