UCCL-EP: Portable Expert-Parallel Communication

February 20, 2026

Reading time: 5 minute

...

📝 Original Info

Title: UCCL-EP: Portable Expert-Parallel Communication
ArXiv ID: 2512.19849
Date: 2025-12-22
Authors: ** Ziming Mao†, Yihan Zhang‡, Chihan Cui§, Zhen Huang¶, Kaichao You♣, Zhongjie Chen♡, Zhiying Xu♠, Zhenyu Gu¶, Scott Shenker†⋄, Costin Raiciu★, Yang Zhou‡, Ion Stoica† † University of California, Berkeley ‡ University of California, Davis § University of Wisconsin–Madison ¶ AMD ♣ Independent Researcher ♡ Tsinghua University ♠ Amazon Web Services ⋄ International Computer Science Institute (ICSI) ★ Broadcom & University Politehnica of Bucharest — **

📝 Abstract

Mixture-of-Experts (MoE) workloads rely on expert parallelism (EP) to achieve high GPU efficiency. State-of-the-art EP communication systems such as DeepEP demonstrate strong performance but exhibit poor portability across heterogeneous GPU and NIC platforms. The poor portability is rooted in architecture: GPU-initiated token-level RDMA communication requires tight vertical integration between GPUs and NICs, e.g., GPU writes to NIC driver/MMIO interfaces. We present UCCL-EP, a portable EP communication system that delivers DeepEP-level performance across heterogeneous GPU and NIC hardware. UCCL-EP replaces GPU-initiated RDMA with a high-throughput GPU-CPU control channel: compact token-routing commands are transferred to multithreaded CPU proxies, which then issue GPUDirect RDMA operations on behalf of GPUs. UCCL-EP further emulates various ordering semantics required by specialized EP communication modes using RDMA immediate data, enabling correctness on NICs that lack such ordering, e.g., AWS EFA. We implement UCCL-EP on NVIDIA and AMD GPUs with EFA and Broadcom NICs. On EFA, it outperforms the best existing EP solution by up to $2.1\times$ for dispatch and combine throughput. On NVIDIA-only platform, UCCL-EP achieves comparable performance to the original DeepEP. UCCL-EP also improves token throughput on SGLang by up to 40% on the NVIDIA+EFA platform, and improves DeepSeek-V3 training throughput over the AMD Primus/Megatron-LM framework by up to 45% on a 16-node AMD+Broadcom platform.

💡 Deep Analysis

📄 Full Content

UCCL-EP: Portable Expert-Parallel Communication Ziming Mao† Yihan Zhang‡ Chihan Cui§ Zhen Huang¶ Kaichao You♣Zhongjie Chen♡ Zhiying Xu♠ Zhenyu Gu¶ Scott Shenker†⋄Costin Raiciu⋆Yang Zhou‡ Ion Stoica† †UC Berkeley ‡UC Davis §UW–Madison ¶AMD ♣Independent Researcher ♡Tsinghua University ♠Amazon Web Services ⋄ICSI ⋆Broadcom & University Politehnica of Bucharest Abstract Mixture-of-Experts (MoE) workloads rely on expert paral- lelism (EP) to achieve high GPU efficiency. State-of-the-art EP communication systems such as DeepEP demonstrate strong performance but exhibit poor portability across het- erogeneous GPU and NIC platforms. The poor portability is rooted in architecture: GPU-initiated token-level RDMA com- munication requires tight vertical integration between GPUs and NICs, e.g., GPU writes to NIC driver/MMIO interfaces. We present UCCL-EP, a portable EP communication sys- tem that delivers DeepEP-level performance across hetero- geneous GPU and NIC hardware. UCCL-EP replaces GPU- initiated RDMA with a high-throughput GPU-CPU control channel: compact token-routing commands are transferred to multithreaded CPU proxies, which then issue GPUDirect RDMA operations on behalf of GPUs. UCCL-EP further emulates various ordering semantics required by specialized EP communication modes using RDMA immediate data, enabling correctness on NICs that lack such ordering, e.g., AWS EFA. We implement UCCL-EP on NVIDIA and AMD GPUs with EFA and Broadcom NICs. On EFA, it outper- forms the best existing EP solution by up to 2.1× for dis- patch and combine throughput. On NVIDIA-only platform, UCCL-EP achieves comparable performance to the origi- nal DeepEP. UCCL-EP also improves token throughput on SGLang by up to 40% on the NVIDIA+EFA platform, and improves DeepSeek-V3 training throughput over the AMD Primus/Megatron-LM framework by up to 45% on a 16-node AMD+Broadcom platform. 1 Introduction State-of-the-art large language models (LLMs), such as DeepSeek-V3 [13,32], OpenAI gpt-oss [48], Google Gemini- 3 Pro [18], and Meta LLaMA 4 [2], are increasingly based on the Mixture-of-Experts (MoE) architecture. In a MoE layer, a gating network running on GPUs selects a small subset of ex- perts for each token activation, dispatches the token activation to those experts, and then aggregates their output activations. Modern MoE models typically instantiate hundreds of experts that specialize in different input patterns, so that only a few experts are active for each token. This sparsity allows MoE models to achieve accuracy comparable to large dense models ♠This work does not relate to the position at Amazon. Nvidia NIC Nvidia GPU AWS EFA NIC Broadcom NIC AMD NIC … AMD GPU Intel GPU … IBGDA NIC driver (a) IBGDA-style. UCCL-EP Nvidia NIC Nvidia GPU AWS EFA NIC Broadcom NIC AMD NIC … AMD GPU Intel GPU Linux libibverbs (portable to all NICs) … CPU (b) UCCL-EP Figure 1: Assuming m GPU vendors and n NIC vendors, UCCL- EP enables O(m) effort, instead of IBGDA’s O(m×n), to support GPU-initiated token-level communication for expert parallelism. while using only a fraction of the per-token inference cost, making them the standard choice for many frontier LLMs. Training and serving large MoE models require expert parallelism (EP), which places different experts on different GPUs and communicates token activations among GPUs in an all-to-all manner. By sparsely sharding experts on dif- ferent GPUs, EP leaves enough GPU memory for matrix multiplication on extremely large batch sizes (e.g., 4096 in DeepSeek-V3 [32]), thus enabling high GPU resource effi- ciency. Expert-parallel communication plays a pivotal role in the EP efficiency [32,64], because token activations are small (e.g., 7KB), dispatch and combine operations are frequent (e.g., selecting 8 experts per token), and routing destinations are only determined at runtime in GPUs (§2.1). GPU-initiated token-level (fine-grained) communication (§2.2) is an emerging and key communication pattern for efficient token dispatch and combine at runtime, where DeepEP [65] by DeepSeek is the most popular communi- cation system implementing it. Different from CPU-initiated bulk-transfer (coarse-grained) communication in NCCL/R- CCL [4,44], DeepEP leverages the advanced NVIDIA IBGDA (InfiniBand GPUDirect Async) [46] technique that enables GPUs to directly operate RDMA NICs (network interface controllers) to write out small activations. Leveraging GPU- initiated token-level communication, DeepEP implements ef- ficient GPU-side token deduplication during dispatch (avoid sending duplicate tokens to experts on the same node) and hierarchical reduce during combine to achieve superior per- formance. DeepEP has been widely adopted by various training and serving frameworks such as Megatron-LM [7], vLLM [61], and SGLang [60] Although GPU-initiated token-level communication leads to high performance, its design unfortunately results in poor portability. There are two key reasons: GPUs directly issuing

📄 Read Full PDF on ArXiv