Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing

Reading time: 5 minute
...

📝 Original Info

  • Title: Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing
  • ArXiv ID: 2512.18674
  • Date: 2025-12-21
  • Authors: ** - Wentao Liu (동남대학교, 중국) - Yuhao Hu (동남대학교, 중국) - Ruiting Zhou (동남대학교, 중국) – 교신 저자 - Baochun Li (토론토 대학교, 캐나다) - Ne Wang (홍콩 폴리테크닉 대학교, 홍콩) **

📝 Abstract

Mixture-of-Experts (MoE) has become a dominant architecture in large language models (LLMs) due to its ability to scale model capacity via sparse expert activation. Meanwhile, serverless computing, with its elasticity and pay-per-use billing, is well-suited for deploying MoEs with bursty workloads. However, the large number of experts in MoE models incurs high inference costs due to memory-intensive parameter caching. These costs are difficult to mitigate via simple model partitioning due to input-dependent expert activation. To address these issues, we propose Remoe, a heterogeneous MoE inference system tailored for serverless computing. Remoe assigns non-expert modules to GPUs and expert modules to CPUs, and further offloads infrequently activated experts to separate serverless functions to reduce memory overhead and enable parallel execution. We incorporate three key techniques: (1) a Similar Prompts Searching (SPS) algorithm to predict expert activation patterns based on semantic similarity of inputs; (2) a Main Model Pre-allocation (MMP) algorithm to ensure service-level objectives (SLOs) via worst-case memory estimation; and (3) a joint memory and replica optimization framework leveraging Lagrangian duality and the Longest Processing Time (LPT) algorithm. We implement Remoe on Kubernetes and evaluate it across multiple LLM benchmarks. Experimental results show that Remoe reduces inference cost by up to 57% and cold start latency by 47% compared to state-of-the-art baselines.

💡 Deep Analysis

Figure 1

📄 Full Content

Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing Wentao Liu˚, Yuhao Hu˚, Ruiting Zhou˚, Baochun Li:, Ne Wang; ˚School of Computer Science and Engineering, Southeast University, China :Department of Electrical and Computer Engineering, University of Toronto, Canada ;Department of Computing, The Hong Kong Polytechnic University, Hong Kong Email: ˚(liuwentao, yuhaohu, ruitingzhou)@seu.edu.cn, :bli@ece.toronto.edu, ;newang@polyu.edu.hk Abstract—Mixture-of-Experts (MoE) has become a dominant architecture in large language models (LLMs) due to its ability to scale model capacity via sparse expert activation. Meanwhile, serverless computing, with its elasticity and pay-per-use billing, is well-suited for deploying MoEs with bursty workloads. However, the large number of experts in MoE models incurs high inference costs due to memory-intensive parameter caching. These costs are difficult to mitigate via simple model partitioning due to input- dependent expert activation. To address these issues, we propose Remoe, a heterogeneous MoE inference system tailored for serverless computing. Remoe assigns non-expert modules to GPUs and expert modules to CPUs, and further offloads infrequently activated experts to separate serverless functions to reduce memory overhead and enable parallel execution. We incorporate three key techniques: (1) a Similar Prompts Searching (SPS) algorithm to predict expert activation patterns based on semantic similarity of inputs; (2) a Main Model Pre-allocation (MMP) algorithm to ensure service-level objectives (SLOs) via worst- case memory estimation; and (3) a joint memory and replica optimization framework leveraging Lagrangian duality and the Longest Processing Time (LPT) algorithm. We implement Remoe on Kubernetes and evaluate it across multiple LLM benchmarks. Experimental results show that Remoe reduces inference cost by up to 57% and cold start latency by 47% compared to state-of- the-art baselines. I. INTRODUCTION The rise of large language models (LLMs) has ushered in a new era of deep learning applications, enabling capa- bilities such as advanced text generation and context-aware understanding [1], [2]. Among recent LLM architectures, the Mixture-of-Experts (MoE) model has emerged as a promising solution to scale model capacity without proportionally in- creasing inference computation. The foundational MoE archi- tecture replaces a transformer’s standard feed-forward network (FFN) with multiple expert FFNs and a gating network for token-to-expert routing [3]. This approach allows for building vastly larger and more capable models, as only a fraction of the model’s total parameters (experts) are used for any given inference task. Meanwhile, serverless computing has gained traction as a cost-effective deployment paradigm for machine learning (ML) inference [4], owing to its elasticity, fine-grained billing, and simplified resource management [5]. Corresponding author: Ruiting Zhou (email: ruitingzhou@seu.edu.cn). These features make it particularly attractive for LLM infer- ence workloads that exhibit bursty traffic [6]. However, the convergence of MoE models and serverless platforms is far from straightforward. Pricing for serverless computing is the product of the resources allocated to a function and its execution duration. While MoE’s sparse expert activation is efficient, its vast number of experts intro- duces unique challenges under the serverless pricing model. The primary challenge stems from the massive memory requirement of MoE models. Deploying the full model as a single serverless function typically requires loading all experts into memory, even if most are unused. This results in signifi- cant memory waste and high costs during inference, especially when expensive GPU memory is involved. To address the high memory occupation of MoE models, expert offloading has been widely studied [7]–[10], where most experts are cached on slower CPU memory, and only the predicted active experts are dynamically transferred to the GPU for inference. Existing offloading methods such as fMoE [7] and HOBBIT [10] implement dynamic expert swapping between the CPU and GPU through experts prefetching techniques. These approaches, however, still require a large, continuously provisioned memory pool on the CPU to hold the inactive experts. This persistent memory allocation fails to eliminate cost inefficiencies thus making existing solutions suboptimal for serverless MoE inference. To mitigate the high memory costs, distributing experts across multiple serverless functions is a natural strategy. Unfortunately, this approach is complicated by the unbalanced and unpredictable nature of expert activation in MoE. In MoE inference, the activated experts depend heavily on the input prompt and vary across requests. Several studies [7], [9], [11] have shown that for a single prompt, expert activation frequencies vary significantly, and this specialized pattern is difficult to pr

📸 Image Gallery

LPT.png bill_run.png cold_start_deepseekv2.png cold_start_moegpt2.png cost_tokens_deepseekv2.png cost_tokens_moegpt2.png deepseek_curve.png moegpt2_curve.png overall_deepseekv2.png overall_moegpt2.png predict_deepseekv2.png predict_gpt2.png prefill_decode_compare.png prove_lmsys_gpt2.png remote_ratio_combined.png system_model.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut