This is an extension for ComfyUI that makes it possible to use some LLM models provided by Ollama, such as Gemma, Llava (multimodal), Llama2, Llama3 or Mistral. Speaking specifically of the LLaVa - Large Language and Vision Assistant model, although trained on a relatively small dataset, it demonstrates exceptional capabilities in understanding images and answering questions about them. This model presents similar behaviors to multimodal models such as GPT-4, even when presented with invisible images and instructions.
Now, nodes can accept Pydantic Schemas as input, making it easier to define structured outputs.
Example use case:
To generate Pydantic schemas, you can use the Python Interpreter Node by Christian Byrne.
This extension for ComfyUI enables the use of Ollama LLM models, such as Gemma, Llava (multimodal), Llama2, Llama3, and Mistral.
Follow the official Ollama installation guide.
The easiest way to install this extension is through ComfyUI Manager:
git clone https://github.com/alisson-anjos/ComfyUI-Ollama-Describer.git
Path should be custom_nodes\ComfyUI-Ollama-Describer
.
Run install.bat
pip install -r requirements.txt
model
: Select LLaVa models (7B, 13B, etc.).custom_model
: Specify a custom model from Ollama's library.api_host
: Define the API address (e.g., http://localhost:11434
).timeout
: Max response time before canceling the request.temperature
: Controls randomness (0 = factual, 1 = creative).top_k
, top_p
, repeat_penalty
: Fine-tune text generation.max_tokens
: Maximum response length in tokens.seed_number
: Set seed for reproducibility (-1 for random).keep_model_alive
: Defines how long the model stays loaded after execution:
0
: Unloads immediately.-1
: Stays loaded indefinitely.10
) keeps it in memory for that number of seconds.prompt
: The main instruction for the model.system_context
: Provide additional context for better responses.structured_output_format
: Accepts either a Python dictionary or a valid JSON string to define the expected response structure.structured_output_format
, which allows defining structured outputs via a Python dictionary or a valid JSON string..txt
caption files saved in the output directory.Works in conjunction with Ollama Image Captioner to provide additional customization for captions.
Allows fine-tuning of captions by enabling or disabling specific details like lighting, camera angle, composition, and aesthetic quality.
Useful for controlling caption verbosity, accuracy, and inclusion of metadata like camera settings or image quality.
Helps tailor the output for different applications such as dataset labeling, content creation, and accessibility enhancements.
Provides additional customization settings for generated captions.
Helps refine style, verbosity, and accuracy based on user preferences.
| Suffix | Meaning | | -------------- | ------------------------------------------------- | | Q | Quantized model (smaller, faster) | | 4, 8, etc. | Number of bits used (lower = smaller & faster) | | K | K-means quantization (more efficient) | | M | Medium-sized model | | F16 / F32 | Floating-point precision (higher = more accurate) |
More details on quantization: Medium Article.