FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference
Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbf{FastUSP}, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf{1.12$\times$–1.16$\times$} end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09$\times$} speedup on 2 GPUs; on 4–8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30$\times$–1.46$\times$ of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead – rather than communication latency – is the primary bottleneck on modern high-bandwidth GPU interconnects.
💡 Research Summary
The paper addresses the growing need for efficient multi‑GPU inference of large diffusion models such as FLUX (12 B parameters) and Qwen‑Image (≈10 B parameters). While Unified Sequence Parallelism (USP) – a combination of Ulysses (head‑parallel All‑to‑All) and Ring Attention (sequence‑parallel point‑to‑point communication) – is the state‑of‑the‑art method for distributed attention, existing implementations suffer from three main inefficiencies: (1) a massive kernel‑launch overhead caused by the hundreds of tiny CUDA kernels executed per denoising step, (2) communication latency that, although present, is relatively small on modern NVLink interconnects, and (3) limited end‑to‑end impact of operator‑level tricks such as pipelined Ring Attention.
FastUSP is introduced as a three‑level optimization framework that tackles each of these bottlenecks orthogonally while keeping the user‑level API unchanged.
Compile‑level optimization (primary contributor). The authors employ PyTorch’s torch.compile in “reduce‑overhead” mode together with CUDA Graphs. By capturing the entire inference graph, they fuse many small kernels into a few larger ones and replay the graph with virtually no CPU‑side launch cost. This alone yields a 9‑16 % reduction in end‑to‑end latency, confirming that kernel‑launch overhead dominates on high‑bandwidth GPUs such as the RTX 5090, where individual kernel runtimes are only tens of microseconds.
Communication‑level optimization. K and V tensors, originally communicated in BF16 (2 bytes per element), are quantized to FP8 E4M3 (1 byte) before the All‑to‑All operation and de‑quantized after receipt. The quantization incurs <0.1 % relative error while halving the data volume, which is especially valuable in bandwidth‑constrained settings (e.g., cross‑node InfiniBand).
Operator‑level optimization. The classic Ring Attention pipeline is transformed into a double‑buffered, asynchronous scheme. Two CUDA streams and two buffers allow each GPU to receive the next KV chunk while simultaneously computing attention on the previously received chunk. The online softmax merge maintains numerical stability via log‑sum‑exp accumulation. Micro‑benchmarks show a 1.25‑1.27× speedup over the baseline Ring Attention, but because attention communication accounts for only 5‑10 % of total inference time on NVLink, the overall contribution to end‑to‑end performance is modest.
The authors evaluate FastUSP on two models: FLUX‑1‑dev (12 B) and Qwen‑Image, across 2, 4, and 8 RTX 5090 GPUs (NVLink ~900 GB/s). Results for FLUX show consistent speedups of 1.12‑1.16× across all GPU counts, with the highest gain (1.16×) at 2 GPUs where kernel‑launch overhead is proportionally larger. For Qwen‑Image, FastUSP achieves a 1.09× speedup on 2 GPUs; however, on 4‑8 GPUs the Ring Attention pattern is not compatible with the current PyTorch Inductor, preventing the compile‑level optimization. In this regime, the baseline USP still scales well (1.30‑1.46× relative to 2‑GPU performance).
A detailed performance analysis confirms the authors’ initial findings: (1) kernel launch overhead is the dominant bottleneck, (2) communication latency is minor on NVLink, and (3) operator‑level improvements have limited end‑to‑end effect. The paper also discusses limitations and future directions, notably the need for better compiler support for dynamic control‑flow patterns like Ring Attention, and the potential of extending FP8 quantization to other tensors or cross‑node scenarios.
In summary, FastUSP demonstrates that a systematic, multi‑level approach—especially graph compilation with CUDA Graphs—can close the most significant performance gap in distributed diffusion inference. By eliminating kernel‑launch overhead and adding complementary communication and operator optimizations, FastUSP reduces per‑step latency by roughly 12 % on modern multi‑GPU systems, offering a practical pathway to cost‑effective, high‑throughput generation with the largest diffusion models currently available.
Comments & Academic Discussion
Loading comments...
Leave a Comment