Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens

Reading time: 2 minute
...

📝 Original Info

  • Title: Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens
  • ArXiv ID: 2512.09277
  • Date: 2025-12-10
  • Authors: Yanpeng Yu, Haiyue Ma, Krish Agarwal, Nicolai Oswald, Qijing Huang, Hugo Linsenmaier, Chunhui Mei, Ritchie Zhao, Ritika Borkar, Bita Darvish Rouhani, David Nellans, Ronny Krashinsky, Anurag Khandelwal

📝 Abstract

Expert Parallelism (EP) permits Mixture of Experts (MoE) models to scale beyond a single GPU. To address load imbalance across GPUs in EP, existing approaches aim to balance the number of tokens each GPU processes. Surprisingly, we find that this objective degrades performance rather than improving it when processing is memory-bound -a common occurrence in MoE serving, especially in the decode phase. Our analysis reveals that balancing the number of tokens processed per GPU increases the number of activated experts, exacerbating memory pressure in the memory-bound regime. We propose METRO 1 , a novel token-routing algorithm for high-performance expert-parallel MoE serving in the memorybound regime that balances the number of activated experts per GPU rather than token counts. METRO achieves nearoptimal routing quality with minimal computational overhead by jointly optimizing algorithmic efficiency and leveraging the GPU's parallel processing power. To guarantee routing quality, METRO also employs a novel allGather scheme to gather global top-k knowledge, which has min...

📄 Full Content

...(본문 내용이 길어 생략되었습니다. 사이트에서 전문을 확인해 주세요.)

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut