ComfyUI Extension: ComfyUI-Gemini_Flash_2.0_Exp
A ComfyUI custom node that integrates Google's Gemini Flash 2.0 Experimental model, enabling multimodal analysis of text, images, video frames, and audio directly within ComfyUI workflows.
Custom Nodes (2)
README
ComfyUI-Gemini_Flash_2.0_Exp
Support My Work
If you find this project helpful, consider buying me a coffee:
A ComfyUI custom node that integrates Google's Gemini Flash 2.0 Experimental model, enabling multimodal analysis of text, images, video frames, and audio directly within ComfyUI workflows. Now with image generation capabilities!
Audio option
Features
- Multimodal input support:
- Text analysis
- Image analysis
- Video frame analysis
- Audio analysis
- NEW! Image Generation using gemini-2.0-flash-exp-image-generation model
- Chat mode with conversation history
- Voice chat with smart Audio recorder node
- Structured output option
- Temperature and token limit controls
- Proxy support
- Configurable API settings via config.json
Installation
Install via ComfyUI manager
or
Clone this repository into your ComfyUI custom_nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/ShmuelRonen/ComfyUI-Gemini_Flash_2.0_Exp.git
Install required dependencies:
# Install BOTH packages (both are required)
pip install google-genai
pip install google-generativeai
# OR
python -m pip install google-genai
python -m pip install google-generativeai
# Other dependencies
pip install pillow
pip install torchaudio
Important:
For Ubuntu/Debian-based systems:
sudo apt-get install libportaudio2
-
Get your free API key from Google AI Studio:
- Visit Google AI Studio
- Log in with your Google account
- Click on "Get API key" or go to settings
- Create a new API key
- Copy the API key for use in config.json
-
Set up your API key in the
config.json
file (will be created automatically on first run)
Configuration
API Key Setup
Make config.json
file in the node main folder:
{
"GEMINI_API_KEY": "your_api_key_here"
}
WSL 2 Ubuntu Users
Note: always insert the API-KEY into Gemini Flash 2 node.
Node Inputs
Required Inputs:
- prompt: Main text prompt for analysis or generation
- input_type: Select from ["text", "image", "video", "audio"]
- model_version: Select model including the new image generation model
- operation_mode: Select between "analysis" or "generate_images" mode
- chat_mode: Boolean to enable/disable chat functionality
- clear_history: Boolean to reset chat history
Optional Inputs:
- Additional_Context: Additional text input for context
- images: Multiple image inputs (IMAGE type with list=True)
- video: Video frame sequence input (IMAGE type)
- audio: Audio input (AUDIO type)
- api_key: Directly enter your API key (recommended for WSL/Ubuntu)
- max_output_tokens: Set maximum output length (1-8192)
- temperature: Control response randomness (0.0-1.0)
- structured_output: Enable structured response format
- max_images: Maximum number of images to process (1-16)
- batch_count: Number of images to generate (for image generation mode)
- seed: Random seed for reproducible image generation
Usage Examples
Basic Text Analysis:
Text Input Node -> Gemini Flash Node [input_type: "text", operation_mode: "analysis"]
Image Analysis:
Load Image Node -> Gemini Flash Node [input_type: "image", operation_mode: "analysis"]
Video Analysis:
Load Video Node -> Gemini Flash Node [input_type: "video", operation_mode: "analysis"]
Audio Analysis:
Load Audio Node -> Gemini Flash Node [input_type: "audio", operation_mode: "analysis"]
Image Generation:
Text Input Node -> Gemini Flash Node [model_version: "gemini-2.0-flash-exp-image-generation", operation_mode: "generate_images"]
Image Generation with Reference:
Load Image Node -> Gemini Flash Node [model_version: "gemini-2.0-flash-exp-image-generation", operation_mode: "generate_images"]
Chat Mode
Chat mode maintains conversation history and provides a more interactive experience:
- Enable chat mode by setting
chat_mode: true
- Chat history format:
=== Chat History ===
USER: your message
ASSISTANT: Gemini's response
=== End History ===
- Use
clear_history: true
to start a new conversation - Chat history persists between calls until cleared
Chat Mode Tips:
- Works with all input types (text, image, video, audio)
- History is displayed in the output
- Maintains context across multiple interactions
- Clear history when switching topics
Video Frame Handling
When processing videos:
- Automatically samples frames evenly throughout the video
- Resizes frames for efficient processing
- Works with both chat and non-chat modes
Image Generation
The new image generation capabilities allow you to:
- Generate images from text descriptions
- Generate variations based on reference images
- Control the generation with seed and temperature parameters
- Generate multiple images with batch_count
Image Generation Tips:
- For best results, use the "gemini-2.0-flash-exp-image-generation" model
- Use "generate_images" operation mode
- Provide clear, detailed prompts for better results
- Connect reference images for style guidance
- Use seed parameter for reproducible results
Troubleshooting Cross-Platform Issues
Windows vs. Ubuntu/WSL Differences
- On Windows, both config file and GUI methods work well
- On Ubuntu/WSL, entering the API key directly in the GUI is more reliable
- If using lowercase filenames on Ubuntu (e.g.,
gemini_flash_node.py
instead ofGemini_Flash_Node.py
), the node will still work properly
Common Issues on Ubuntu/WSL:
- If you get "400 Bad Request" errors, try entering your API key directly in the GUI
- Make sure binary data (images, audio) is properly base64 encoded
- Check network connectivity and proxy settings
- Ensure proper file permissions for config files
Error Handling
The node provides clear error messages for common issues:
- Invalid API key
- Rate limit exceeded
- Invalid input formats
- Network/proxy issues
Rate Limits
Default rate limits (from config.json):
- 10 requests per minute (RPM_LIMIT)
- 4 million tokens per minute (TPM_LIMIT)
- 1,500 requests per day (RPD_LIMIT)
Audio Analysis with Smart Recording:
The package includes two nodes for audio handling:
- Audio Recorder Node: Smart audio recording with silence detection
- Gemini Flash Node: Audio content analysis
Audio Recorder Node Features:
- Live microphone recording with automatic silence detection
- Smart recording termination after detecting silence
- Configurable silence threshold and duration
- Compatible with most input devices
- Visual recording status indicator (10-second auto-reset)
- Seamless integration with Gemini Flash analysis
Audio Recording Setup:
Audio Recorder Node -> Gemini Flash Node [input_type: "audio"]
Audio Recorder Controls:
- device: Select input device (microphone)
- sample_rate: Audio quality setting (default: 44100 Hz)
- silence_threshold: Sensitivity for silence detection (0.001-0.1)
- silence_duration: Required silence duration to stop recording (0.5-5.0 seconds)
- Record Button:
- Click to start recording
- Records until silence is detected
- Button resets after 10 seconds automatically
- Visual feedback during recording (red indicator)
Using Voice Commands/Audio Analysis:
- Add Audio Recorder node to your workflow
- Connect it to Gemini Flash node
- Configure recording settings:
- Choose input device
- Adjust silence detection parameters
- Set sample rate if needed
- Click "Start Recording" to begin
- Speak your message
- Recording automatically stops after detecting silence
- The recorded audio is processed and sent to Gemini for analysis
- Recording button resets after 10 seconds, ready for next recording
Example Audio Analysis Workflow:
Audio Recorder Node [silence_duration: 2.0, silence_threshold: 0.01] ->
Gemini Flash Node [input_type: "audio", prompt: "Transcribe and analyze this audio"]
Contributing
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
License
MIT License
Acknowledgments
- Google's Gemini API
- ComfyUI Community
- All contributors
Note: This node is experimental and based on Gemini 2.0 Flash Experimental model. Features and capabilities may change as the model evolves.