MSCCL++: Rethinking GPU Communication Abstractions for AI Inference

MSCCL++: Rethinking GPU Communication Abstractions for AI Inference
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

AI applications increasingly run on fast-evolving, heterogeneous hardware to maximize performance, but general-purpose libraries lag in supporting these features. Performance-minded programmers often build custom communication stacks that are fast but error-prone and non-portable. This paper introduces MSCCL++, a design methodology for developing high-performance, portable communication kernels. It provides (1) a low-level, performance-preserving primitive interface that exposes minimal hardware abstractions while hiding the complexities of synchronization and consistency, (2) a higher-level DSL for application developers to implement workload-specific communication algorithms, and (3) a library of efficient algorithms implementing the standard collective API, enabling adoption by users with minimal expertise. Compared to state-of-the-art baselines, MSCCL++ achieves geomean speedups of $1.7\times$ (up to $5.4\times$) for collective communication and $1.2\times$ (up to $1.38\times$) for AI inference workloads. MSCCL++ is in production of multiple AI services provided by Microsoft Azure, and has also been adopted by RCCL, the GPU collective communication library maintained by AMD. MSCCL++ is open source and available at https://github.com/microsoft/mscclpp . Our two years of experience with MSCCL++ suggests that its abstractions are robust, enabling support for new hardware features, such as multimem, within weeks of development.


💡 Research Summary

**
The paper introduces MSCCL++, a three‑layer programming methodology designed to deliver high‑performance, portable GPU collective communication for AI inference workloads. Modern AI models run on rapidly evolving heterogeneous hardware, but existing general‑purpose libraries (e.g., NCCL, RCCL, MSCCL) lag behind, forcing developers to hand‑craft custom communication kernels that are fast yet error‑prone and non‑portable. MSCCL++ addresses this gap by separating concerns into three hierarchical APIs:

  1. Primitive API – a low‑level, performance‑preserving interface that exposes only the essential hardware transfer modes while encapsulating the complex synchronization, ordering, and consistency semantics required across GPUs, CPUs, and NICs. It defines three channel abstractions—PortChannel (port‑mapped I/O such as DMA or RDMA), MemoryChannel (peer‑to‑peer memory‑mapped I/O), and SwitchChannel (switch‑mapped I/O supporting multicast/aggregation). Each channel provides a minimal set of primitives (put, get, signal, wait, reduce, broadcast, etc.) that are asynchronous and one‑sided, allowing overlapped compute‑communication without busy‑wait loops.

  2. DSL (Domain‑Specific Language) API – built on top of the Primitive API, this high‑level language lets developers declaratively describe custom collective algorithms. The DSL presents a thread‑block‑centric view of the whole GPU mesh, automatically inserts intra‑block synchronizations, fuses memory accesses, and generates an execution plan interpreted by a DSL Executor. Crucially, the DSL retains the asynchronous, one‑sided semantics of the primitives, enabling sophisticated compute‑communication overlap that earlier MSCCL approaches could not express.

  3. Collective API – a drop‑in replacement for the NCCL API (including bootstrapping). Users with minimal expertise can simply link against the MSCCL++ collective library and obtain the same programming model as NCCL while benefiting from the optimized kernels underneath. For AI inference, the library ships a set of DSL‑generated collective kernels (AllReduce, AllGather, ReduceScatter, etc.) that are tuned for typical LLM inference patterns.

Performance Evaluation
Experiments on NVIDIA A100, H100, and AMD MI300x GPUs compare MSCCL++ against NCCL, RCCL, and the original MSCCL. Results show:

  • DSL‑generated kernels are on average 1.99× (NCCL), 2.08× (RCCL), and 1.43× (MSCCL) faster for core collectives.
  • Direct use of the Primitive API yields an additional ~3 % speedup over the DSL, confirming that the low‑level primitives expose the full performance envelope.
  • In real inference workloads, LLM decoding in vLLM gains 1.11× speedup and in SGLang 1.31× when swapping NCCL for MSCCL++.

Rapid Adaptation to New Hardware
When NVIDIA introduced multimem instructions for switch‑mapped I/O, integrating them into MSCCL++ required only 16 person‑weeks; supporting NVIDIA’s multi‑node NVLink took 2 person‑weeks. This demonstrates that the minimal, hardware‑aware abstraction enables swift exploitation of emerging features without rewriting the entire stack.

Adoption and Impact
MSCCL++ is open‑source (GitHub) and has been in production for two years. AMD’s RCCL has adopted its APIs and library, and Microsoft Azure’s AI services already run on MSCCL++. The paper’s two‑year field experience validates the robustness of the abstractions and their suitability for large‑scale, latency‑sensitive inference.

Key Insights

  • Minimal, hardware‑close abstractions preserve raw bandwidth/latency while shielding developers from the intricate GPU‑CPU‑NIC coordination required for correctness.
  • A declarative DSL bridges the gap between expert‑level performance tuning and developer productivity, automatically handling synchronization and memory‑access optimizations.
  • Layered design offers a migration path: users can start with the NCCL‑compatible collective API, move to DSL for workload‑specific tuning, and finally drop to the Primitive API for the absolute performance ceiling.

In summary, MSCCL++ provides a compelling solution to the longstanding tension between performance, portability, and productivity in GPU collective communication. By exposing just enough hardware detail to enable aggressive optimizations while offering high‑level, easy‑to‑use abstractions, it delivers measurable speedups for both micro‑benchmarks and end‑to‑end AI inference, and it does so with a development model that can keep pace with the fast‑moving GPU ecosystem.


Comments & Academic Discussion

Loading comments...

Leave a Comment