The Landscape of GPU-Centric Communication
In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best.
💡 Research Summary
The paper surveys the emerging paradigm of GPU‑centric communication, which seeks to shift the responsibility for data movement from the CPU to the GPUs themselves in high‑performance computing (HPC) and machine‑learning (ML) workloads. As GPUs have become the dominant accelerator in modern supercomputers—nine out of the ten top‑500 systems in November 2025 rely on GPU clusters—the inter‑GPU communication path has turned into a critical scalability bottleneck. Historically, the CPU orchestrated all intra‑node and inter‑node transfers, treating GPUs as pure compute devices. Over the past decade, a series of hardware and software innovations collectively termed “GPU‑centric communication” have reduced CPU involvement, granted GPUs autonomy to initiate and synchronize transfers, and aligned communication semantics with the massive parallelism of GPU kernels.
The authors first define GPU‑centric communication as any mechanism that minimizes CPU participation in the critical path of multi‑GPU execution. They then introduce a taxonomy that separates intra‑node from inter‑node communication and further classifies each by the location of the API call (host‑side vs. device‑side) and the data path (who moves the bytes). Intra‑node communication is divided into four types: (1) Host Native – traditional host‑side memcpy without peer‑to‑peer (P2P) support; (2) Host‑Controlled – host‑side calls that exploit P2P over PCIe, NVLink, or Infinity Fabric; (3) Device Native – kernel‑level load/store or SHMEM‑style operations that bypass the CPU entirely; and (4) Host Fallback – device‑side attempts that fall back to host memory when P2P is unavailable.
Inter‑node communication is more complex because it must involve the network interface controller (NIC). The paper proposes a five‑step evolution: (1) Host Native – the classic CPU‑centric path with two copies (CPU→GPU and CPU→NIC); (2) Pinned Host Native – shared pinned memory reduces one copy; (3) GPU RDMA – the NIC directly reads/writes GPU memory via Remote Direct Memory Access, eliminating the CPU‑GPU copy; (4) GPU‑Triggered – the GPU issues the NIC’s send/receive command after the CPU has pre‑registered packets; (5) Device Native – the entire packet construction, registration, and trigger happen on the GPU, achieving zero‑copy, fully asynchronous transfers.
The survey then details vendor‑provided mechanisms that constitute the building blocks for the higher‑level libraries. Memory‑management primitives include page‑locked (pinned) memory, Unified Virtual Addressing (UVA), Inter‑Process Communication (IPC) for same‑node multi‑process access, and Unified Virtual Memory (UVM) for automatic page‑fault‑driven migration. NVIDIA’s GPUDirect family is traced from the original GPUDirect 1.0 (P2P) through GPUDirect 2.0, GPUDirect RDMA, and the most recent GPUDirect Async, each adding tighter integration between GPU, PCIe/NVLink, and the NIC. Hardware milestones such as NVLink generations, NVSwitch, and modern InfiniBand NICs are placed on a timeline, showing how bandwidth and latency improvements enable the communication models described earlier.
Next, the paper compares the major user‑level libraries. NCCL (NVIDIA), RCCL (AMD), and oneCCL (Intel) focus on collective operations (All‑Reduce, All‑Gather, Broadcast) and expose a GPU‑aware MPI interface that can operate in both host‑controlled and GPU‑triggered modes. NVSHMEM, ROCSHMEM, and Intel SHMEM implement a Partitioned Global Address Space (PGAS) model, allowing direct remote memory accesses and supporting both RDMA‑based and device‑native pathways. GPU‑aware MPI implementations (e.g., MVAPICH‑GPU, OpenMPI‑GPU) integrate with these collectives to provide a unified programming model. Performance data from prior benchmarking studies indicate that GPU‑RDMA can deliver 2–3× higher bandwidth than PCIe‑based copies, while GPU‑Triggered transfers reduce latency for small messages by 30–40 %. NCCL remains the de‑facto standard for high‑throughput collectives, especially on NVIDIA’s NVLink‑connected GPUs.
The authors identify three research paradigms shaping the field: (1) Integrated memory‑communication stacks that expose a single abstraction for both data placement and movement; (2) Multi‑tensor pipelines that overlap computation and communication at the kernel level; and (3) Asynchronous global synchronization mechanisms that exploit GPUDirect Async and GPU‑triggered NIC commands to avoid costly CPU barriers. They also list open challenges: lack of a vendor‑agnostic standard for heterogeneous GPU‑NIC interactions, robust fault‑tolerance and load‑balancing strategies for large‑scale clusters, and the need for higher‑level, programmer‑friendly APIs that hide the complexity of the underlying mechanisms while preserving performance.
In conclusion, the paper argues that GPU‑centric communication is no longer a niche optimization but a prerequisite for scaling modern HPC and AI workloads. The synergy of hardware advances (NVLink, NVSwitch, RDMA‑capable NICs), runtime features (pinned memory, UVA, UVM, IPC), and sophisticated libraries (NCCL, SHMEM families, GPU‑aware MPI) yields substantial reductions in latency and increases in bandwidth. While NVIDIA currently offers the most mature ecosystem, AMD and Intel provide comparable capabilities, and future progress will hinge on cross‑vendor standardization and deeper integration of communication primitives into programming models. The survey serves as a roadmap for researchers, system architects, and library developers aiming to exploit the full potential of multi‑GPU systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment