ComfyUI Extension: ComfyUI-Computer-Vision

Authored by MicheleGuidi

Created

Updated

5 stars

Extension nodes for ComfyUI that improves automatic segmentation using bounding boxes generated by Florence 2 and segmentation from Segment Anything 2 (SAM2). Currently just an enhancement of nodes from a/Kijai.

Custom Nodes (0)

    README

    Evaluating the Efficiency of Context-Based Segmentation

    Introduction

    Automatic image segmentation has become a vital tool in various computer vision tasks, with models like SAM2 offering significant advancements. However, despite the performance of large models, SAM2 sometimes struggles with small or intricate objects, even when precise bounding boxes are provided. The integration of context-aware models seeks to address this limitation by improving segmentation accuracy, particularly for smaller objects. This study explores whether incorporating contextual information into the segmentation process provides measurable improvements or if alternative approaches, such as using the Base+ model, can achieve similar results with greater efficiency.

    Problem Statement

    While SAM2's large model performs well on general objects, it exhibits a tendency to misinterpret small objects or fine details, leading to inaccuracies in segmentation, especially when smaller objects appear in complex scenes. On the other hand, smaller SAM2 models tend to perform better in fine-grained tasks but fail to handle large objects adequately. To address these challenges, a context-based approach has been proposed, where the input image is cropped around the bounding box of the target object, providing the model with a more focused view.

    Methodology

    In this study, we tested the segmentation performance using various approaches:

    1. Context-Based Segmentation: This approach crops the input image around each bounding box, ensuring that the SAM2 model processes only the target object, potentially improving accuracy for small objects.
    2. Tiled Segmentation: A tiling-based approach that slices the image into smaller sections to avoid performance degradation but without focusing on the context of specific objects.
    3. Large SAM2.1: The base large model, known for its higher capacity, but prone to errors when small objects are present.
    4. Base+ SAM2.1: An alternative model that balances computational efficiency and segmentation quality by offering a more streamlined approach, often without needing extensive context processing.
    5. Small SAM2.1: A version of the SAM2 model with fewer parameters, optimized for faster processing and lower computational demands. While it performs better on small objects and fine details, it may struggle with larger or more complex scenes due to its reduced capacity compared to the large model.

    The testing was conducted using a set of objects such as golf balls, golf clubs, human faces, and various combinations of items (e.g., shoes, hands, pants).

    Results

    | Prompt | Original Image | Florence Bboxes | Context Node | Tiled Node | Large SAM2.1 | Base+ SAM2.1 | Small SAM2.1 | |---------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------| | golf ball | | | | | | | | | golf club | | | | | | | | | human face | | | | | | | | | shoes, hands, cap, mouth, pants, ball, watch | | | | | | | |

    • Contextual and Tiled Nodes: Both utilized the SAM2.1 large model for segmentation.
    • Node Settings: Identical across all tests to ensure a fair comparison.
    <!-- Image aliases -->

    Analysis

    The results indicate that Base+ consistently performed on par with or exceeded the accuracy of the context-based approach. In several test cases, Base+ even outperformed the context-based model, suggesting that the added computational complexity of context processing might not always yield substantial improvements. Additionally, Base+ provided a more efficient segmentation without sacrificing detail, especially in scenarios where context processing was not feasible or necessary.

    • Context-Based Segmentation: This approach demonstrated noticeable improvements in handling small objects by focusing exclusively on the cropped areas around the bounding boxes. However, this increased segmentation accuracy came at the cost of additional computation and image manipulation.
    • Base+: Offered an efficient alternative that maintained a high level of accuracy while avoiding the complexity of cropping and reprocessing images. In many cases, Base+ provided results comparable to context-based segmentation without the need for additional context-aware image preparation.

    Conclusion

    The evaluation suggests that while context-based segmentation models can enhance accuracy, particularly for small objects, Base+ offers a comparable or even superior alternative in many scenarios. It provides a promising approach for users who seek a balance between segmentation quality and computational efficiency.

    For researchers or developers interested in experimenting with these models, the context-based approach can still offer benefits in specific cases where detailed object isolation is crucial. However, in most practical applications, Base+ may be the preferred model for achieving high-quality segmentation without the added complexity of context processing.


    Available Nodes

    | Node | Type | Description | Image | |--------------------------|-------------------|---------------------------------------------------------------------------------------------------|------------------------------------------------------------------------| | Sam2ContextSegmentation | ✅ Main Node | Segments objects from bounding boxes by generating crops centered on each box. Uses local context to guide SAM2 toward cleaner and more coherent results. | Sam2ContextSegmentation | | Sam2TiledSegmentation | ⚠️ Alternative | Segmentation via regular tiling using SAHI. Works in simple cases but is less accurate than the contextual approach. | Sam2TiledSegmentation |


    Context Process

    | Output | Description | Image | |---------------------------|------------------------------------------------------------------------------------------------------|------------------------| | Original Image | Input image containing one or more objects to be segmented | | | Florence Bounding Boxes | Image annotated with bounding boxes predicted by Florence 2 for each detected object (ball, glove) | | | Context Tiles | Contextual crops centered on each bounding box, used as input for SAM2 | | | Colored Masks | Per-object segmentation masks overlaid with unique colors for visual clarity | | | Cleaned Mask | Composite mask with small, disconnected regions removed to reduce noise (if setting is enabled) | | | Final Mask | Final mask combining all cleaned segments, used for downstream processing | |

    <!-- Image aliases -->

    Installation

    To install these custom nodes, clone or download this repository into your ComfyUI/custom_nodes/ folder.

    After restarting ComfyUI, the nodes Sam2ContextSegmentation and Sam2TiledSegmentation will appear and be ready to use.

    A complete example workflow is included in the workflows folder to demonstrate how to use the context-based segmentation effectively.


    Credits

    Thanks to the following open-source projects for their valuable contributions:

    Images include content sourced from federgolf.it and a frame from a UEFA Champions League 2010 broadcast.