ComfyUI Extension: ComfyUI - DGLS (Dynamic GPU Layer Swapping)

Authored by obisin

Created

Updated

6 stars

Smart dynamic layer swapping between GPU and CPU for optimal inference performance with comprehensive mixed precision handling and copy-compute overlap optimization. Enables running much larger models on limited VRAM setups.

Custom Nodes (0)

    README

    DGLS — Dynamic GPU Layer Swapping for ComfyUI

    Smart dynamic layer swapping between GPU and CPU for optimal inference performance with target casting, and copy-compute overlap optimization. Enables running much larger models on limited VRAM setups.


    NOTE:

    This is still under development. I am actively bug testing this atm. It's fully working on RTX 2060 and 2080ti. I havent promoted this or written about it due to some tests I want to carry out. I am currently getting between 10-30% speed improvement compared to official node.

    Features

    • Drop‑in ComfyUI integration. Adds two nodes under the loaders category: DGLS Model Loader and DGLS Swapping Loader.
    • Architecture‑aware layer extraction. Handles Cosmos, FLUX, WAN 2.1/2.2, HunyuanVideo, Qwen, and generic transformer layouts.
    • Buffers‑on‑GPU / params‑on‑CPU design. Small buffers remain on GPU; master parameter tensors live on CPU and are rebound to GPU storage just‑in‑time per layer (via an optimized _reassign_param), avoiding redundant cloning.
    • Auto or manual GPU residency. Leave it on auto (default) or pass an explicit comma‑separated gpu_layer_indices to pin chosen layers on GPU.
    • Predictive prefetch. Chooses the next prefetch layers in ring order; optional CUDA streams and a conservative CPU helper thread can overlap H2D copies with compute.
    • CUDA Streams: Uses CUDA streams to overlap memory transfers with computation
    • Target Casting: Precision casting for specific Dtypes in mixed layer models

    Requirements

    • ComfyUI (CUDA‑enabled PyTorch).
    • GPU: Any CUDA‑capable GPU.
    • System RAM: 32GB+ recommended if you plan to enable overlap/pinning. 16GB can work, 64GB+ is best.
    • **Currently only working with diffusion models: Any workflow with 'Load Diffusion Model' official node should be fine with this swapped for it. OmniGen not working atm.

    Installation (Custom Nodes)

    1. Navigate to your ComfyUI custom nodes directory:
    cd ComfyUI/custom_nodes/
    
    1. Clone or copy the DGLS files:
    git clone https://github.com/obisin/dgls-comfyui
    # or manually place the files in a new folder
    
    1. Restart ComfyUI - the nodes should appear in the loaders category

    REMEMBER TO INSTALL THE REQUIREMENTS.TXT FOR BEST PERFORMANCE


    Nodes & Parameters

    1) DGLS Model Loader

    Title: DGLS Model Loader Returns: (MODEL, LAYERS) for the next node. Category: loaders

    Inputs (required):

    • model_name — selectable from Comfy paths (unet + diffusion_models).
    • model_type — one of: default | hunyuan | unet (use default unless hunyuan).
    • cast_dtype — one of: default | fp32 | fp16 | bf16 | fp8_e4m3fn | fp8_e4m3fn_fast | fp8_e5m2 (applied at load time).
    • clear_model_cache — boolean; force reload from disk, bypassing Comfy’s model cache.
    • verbose — boolean; prints detection/dtype/extraction info (slower when enabled).

    Inputs (optional):

    • nuke_all_caches — boolean; CAUTION: aggressively clears assorted Comfy caches (can force other nodes to reload).

    What it does

    • Loads the model and extracts a consistent LAYERS sequence across supported architectures.

    2) Dynamic Swapping Loader

    Title: Dynamic Swapping Loader Input: (MODEL, LAYERS) from the previous node. Returns: (MODEL) with smart swapping enabled. Category: loaders

    Inputs (required):

    • prefetch (int, default 1, min 0, max 100) — number of future layers to stage.
    • cpu_threading (bool, default False) — background CPU helper for async staging (can help on some systems; may add overhead on others).
    • cuda_streams (bool, default False) — enable CUDA streams/events for copy–compute overlap (needs a little VRAM headroom).
    • verbose (bool, default True) — print layer sizes, timings, and decisions. Good for debugging and getting optimal settings when manually choosing layers

    Inputs (optional):

    • gpu_layer_indices (string) — comma‑separated layer indices to keep permanently on GPU, e.g. 0,1,2,28,29. When set, this overrides auto selection. Best for models with some layers having difficulty swapping.
    • cast_target (string) — one‑time recast mapping like "f32 f16" or multiple pairs: "bf16 f16, f32 bf16". Float8/4‑bit require kernel support and are applied conservatively.

    How it swaps (overview)

    1. Setup: identify swappable layers; keep module buffers on GPU; maintain CPU master parameters.
    2. Per‑layer compute: for layer k, compute the needed set = {k, k+1…k+prefetch} in a ring order over swappables.
    3. Evict & stage: unneeded layers rebind back to CPU masters; needed layers copy CPU→GPU and _reassign_param binds the GPU storage (no extra clones when shapes/dtypes/devices already match).
    4. (Optional) overlap: CUDA streams + events coordinate; an optional CPU thread can prepare upcoming transfers.

    This design minimizes VRAM while keeping hot state local and avoiding parameter churn.


    Quick Start

    Minimal (safe defaults)

    1. DGLS Model Loader → choose your model → verbose: off (optional).

    2. Dynamic Swapping Loader

      • prefetch: 1
      • cpu_threading: off
      • cuda_streams: off
      • leave gpu_layer_indices empty (auto GPU residency)
    3. Connect to your sampler / Apply Model node and run.

    Balanced (mid‑VRAM)

    • Try prefetch: 2, enable cuda_streams if you have headroom. Keep cpu_threading off unless profiling shows benefit.
    • If you know the bottlenecks (e.g., earliest/latest blocks), set gpu_layer_indices explicitly.
    • For some model prefecth 0 might be preferable

    Manual GPU residency

    • Provide gpu_layer_indices like 0,1,2,28,29 to pin those layers on GPU; others will swap dynamically.

    Casting & Precision (optional)

    • Use cast_target to convert specific source → target dtypes at start‑up, e.g.:

      • "f32 f16" (downcast FP32 → FP16)
      • "bf16 f16, f32 bf16" (multiple rules)
    • Float8 and 4‑bit (nf4/fp4) paths are guarded: they’re only applied if kernels support them; otherwise the safer fallback (e.g., bf16) is used.

    • For global model dtype at load time, use the Model Loader’s cast_dtype (separate from cast_target).


    Tips & Tuning

    • Auto vs manual GPU layers: leaving gpu_layer_indices empty uses DGLS auto‑selection based on availability and layer sizes. Explicit indices take priority.
    • Prefetch: start with 1. Increase only if you see stalls and have spare VRAM. The ring order avoids wasteful wraps.
    • Overlap options: try CUDA streams first if you have headroom. CPU threading is conservative (1 worker) and helps mainly where trasnfer latency dominates.
    • Verbose diagnostics: enable verbose to print chosen residents, swappable indices, sizes (MB), and timing—useful for dialing in prefetch and residency.

    Troubleshooting

    • “Cannot fit layer X on GPU.” Reduce gpu_layer_indices, lower prefetch. The emergency path will attempt cleanup/re‑stage; persistent failure means lowering residency.
    • Instability with overlap. Turn off cuda_streams and cpu_threading; re‑run with verbose to inspect staging.
    • Odd dtype/buffers. Prefer conservative cast_target rules; keep batch‑norm‑like stats in fp32 (the loader handles common cases).

    File Map

    • dgls_model_loader.py — model loading, dtype at load, cache management, robust layer extraction, and tensor healing.
    • dynamic_swapping_loader.py — swapping engine, auto or explicit GPU residency, prefetch/overlap options, and the ComfyUI loader node.

    License

    (see repository license file).