Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism

Reading time: 5 minute
...

📝 Original Info

  • Title: Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism
  • ArXiv ID: 2512.21487
  • Date: 2025-12-25
  • Authors: Xinglin Pan, Shaohuai Shi, Wenxiang Lin, Yuxin Wang, Zhenheng Tang, Wei Wang, Xiaowen Chu

📝 Abstract

The mixture-of-experts (MoE) architecture scales model size with sublinear computational increase but suffers from memory-intensive inference due to KV caches and sparse expert activation. Recent disaggregated expert parallelism (DEP) distributes attention and experts to dedicated GPU groups but lacks support for shared experts and efficient task scheduling, limiting performance. We propose FinDEP, a fine-grained task scheduling algorithm for DEP that maximizes task overlap to improve MoE inference throughput. FinDEP introduces three innovations: 1) partitioning computation/communication into smaller tasks for fine-grained pipelining, 2) formulating a scheduling optimization supporting variable granularity and ordering, and 3) developing an efficient solver for this large search space. Experiments on four GPU systems with DeepSeek-V2 and Qwen3-MoE show FinDEP improves throughput by up to 1.61x over prior methods, achieving up to 1.24x speedup on a 32-GPU system.

💡 Deep Analysis

📄 Full Content

Large language models (LLMs) are scaling rapidly, and so are their computational costs. For instance, models like Falcon [1] Multi-Head Latent Attention [5], while MHA denotes Multi-Head Attention [30]. The "Shared" block indicates one shared expert or several shared experts, which may be optional depending on the MoE configuration.

with 180 billion parameters and Llama-3.1 [27] with 405 billion parameters exemplify this trend. Mixture-of-Experts (MoE) architectures [12,15,24] address this challenge by activating only a subset of the model’s expert components for each input. This makes it possible to build much larger models without making training or inference more expensive. Recent MoE-based LLMs, such as DeepSeek-V3 [6] and Qwen3-MoE [28], show that this design can create highly capable models that are still fast and cheap to use. As a result, MoE has become a key technique for building future LLMs in a way that balances power and efficiency.

Despite the advantages, running inference on large MoE models remains challenging [2,6,13,34,37] due to its extensive memory requirement to hold all experts in the MoE layers and key-value (KV) caches in the attention layers. As a result, distributing the MoE model across multiple GPUs has been a common practice [6,15,36] for efficient inference through expert parallelism (EP), which assigns experts across GPUs.

Recent research [17,29,36] suggests distributing attention layers and expert layers onto distinct GPUs through disaggregated expert parallelism (DEP) 1 , due to the different computational and memory access patterns of attention layers and expert layers. This approach enables the modules to scale independently while optimizing the use of various hardware capabilities. In DEP, a multi-GPU system is divided into two groups: the attention group (AG), responsible for storing all attention layers, and the expert group (EG), which holds all non-shared experts. It is important to mention that in certain MoE models like DeepSeek-V3 [6], shared experts within the MoE layer are often placed in the AG as they need to be processed by all input tokens. The dependency between attention and expert layers is substantial, as each attention layer’s output serves as the input for the subsequent expert layer, which then outputs to another attention layer as shown in Fig. 1. Consequently, DEP necessitates bidirectional communication: from AG to EG (A2E) and the reverse (E2A). Data dependencies and communication overhead easily lead to the GPU computational resources idle, thereby limiting inference efficiency.

Existing optimizations try to alleviate the GPU idle duration of DEP via 1) overlapping computation and communication tasks to reduce the communication time with the ping-pong pipeline (PP-Pipe) algorithm proposed in MegaScale-Infer [36] or 2) offloading communication tasks to CPU resources to enable overlaps between CPU communications and GPU computations in StepMesh [29]. These techniques enable only coarse-level task scheduling by dividing a mini-batch into several micro-batches. As a result, different tasks from these micro-batches can be executed in a pipeline fashion, but this does not sufficiently hide A2E/E2A communications, leading to suboptimal inference efficiency. Moreover, certain cutting-edge MoE models such as DeepSeek series [4][5][6] introduce shared experts within the MoE layer, which are required to compute for every input token, similar to the attention layer, leading to increased GPU idle time.

In this paper, we propose FinDEP, a fine-grained task scheduling framework for MoE inference with DEP to address the above two efficiency problems by three key innovations. (1) We partition timeconsuming tasks including computations in EG, communications in A2E and E2A, and computations in AG into smaller tasks by splitting each task’s input tensor into several segments (denoted as 𝑟 ). This partitioning of the tensor creates 𝑟 smaller tasks per original task, allowing for dynamic scheduling aimed at improving the throughput for MoE models, regardless of whether they have shared experts. (2) Intuitively, increasing 𝑟 allows greater parallelization for enhanced overlapping. However, this also increases the launch overheads associated with executing tasks, such as kernel dispatch on GPUs and communication startup costs. Thus, a balance must be built between the advantages of overlapping and the execution overheads. Consequently, we construct performance models for computation tasks in AG and EG and their A2E/E2A communication tasks. Using these models, we establish an optimization problem to characterize the DEP inference time with fine-grained task scheduling, including task ordering and tensor partition granularity. (3) We develop an efficient algorithm to find the near-optimal solution to the formulated optimization problem with a polynomial time complexity, thus avoiding the very time-consuming brute-force search on the huge solution space.

We conduct

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut