Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models
📝 Abstract
Mixture-of-Experts (MoE) Large Language Models (LLMs) efficiently scale-up the model while keeping relatively low inference cost. As MoE models only activate part of the experts, related work has proposed expert prediction and caching methods to prefetch the experts for faster inference. However, existing approaches utilize the activations from the previous layer for prediction, incurring low accuracy and leave the first layer unoptimized. Applying complex layers or even training standalone networks for better prediction introduces high computation overhead. In this paper, we propose pre-attention expert prediction to achieve accurate and lightweight expert prefetching. The key insight is that some functions in LLMs are ranking-preserving, indicating that matching the ranking of selected experts using simple linear functions is possible. Therefore, we utilize the activations before the attention block in the same layer with 2 linear functions and ranking-aware loss to achieve accurate prediction, which also supports prefetching in the first layer. Our lightweight, pre-attention expert routers achieve 93.03% accuracy on DeepSeek V2 Lite, 94.69% on Qwen3-30B, and 97.62% on Phi-mini-MoE, showing about 15% improvement on absolute accuracy over the state-of-the-art methods.
💡 Analysis
Mixture-of-Experts (MoE) Large Language Models (LLMs) efficiently scale-up the model while keeping relatively low inference cost. As MoE models only activate part of the experts, related work has proposed expert prediction and caching methods to prefetch the experts for faster inference. However, existing approaches utilize the activations from the previous layer for prediction, incurring low accuracy and leave the first layer unoptimized. Applying complex layers or even training standalone networks for better prediction introduces high computation overhead. In this paper, we propose pre-attention expert prediction to achieve accurate and lightweight expert prefetching. The key insight is that some functions in LLMs are ranking-preserving, indicating that matching the ranking of selected experts using simple linear functions is possible. Therefore, we utilize the activations before the attention block in the same layer with 2 linear functions and ranking-aware loss to achieve accurate prediction, which also supports prefetching in the first layer. Our lightweight, pre-attention expert routers achieve 93.03% accuracy on DeepSeek V2 Lite, 94.69% on Qwen3-30B, and 97.62% on Phi-mini-MoE, showing about 15% improvement on absolute accuracy over the state-of-the-art methods.
📄 Content
PRE-ATTENTION EXPERT PREDICTION AND PREFETCHING FOR MIXTURE-OF-EXPERTS LARGE LANGUAGE MODELS Shien Zhu 1 Samuel Bohl 1 Robin Oester 1 Gustavo Alonso 1 ABSTRACT Mixture-of-Experts (MoE) Large Language Models (LLMs) efficiently scale-up the model while keeping relatively low inference cost. As MoE models only activate part of the experts, related work has proposed expert prediction and caching methods to prefetch the experts for faster inference. However, existing approaches utilize the activations from the previous layer for prediction, incurring low accuracy and leave the first layer unoptimized. Applying complex layers or even training standalone networks for better prediction introduces high computation overhead. In this paper, we propose pre-attention expert prediction to achieve accurate and lightweight expert prefetching. The key insight is that some functions in LLMs are ranking-preserving, indicating that matching the ranking of selected experts using simple linear functions is possible. Therefore, we utilize the activations before the attention block in the same layer with 2 linear functions and ranking-aware loss to achieve accurate prediction, which also supports prefetching in the first layer. Our lightweight, pre-attention expert routers achieve 93.03% accuracy on DeepSeek V2 Lite, 94.69% on Qwen3-30B, and 97.62% on Phi-mini-MoE, showing about 15% improvement on absolute accuracy over the state-of-the-art FATE results. 1 INTRODUCTION The Mixture-of-Experts (MoE) architecture is widely adopted by the latest Large Language Models (LLMs) such as Llama 4 (Meta, 2025), DeepSeek-V3 (DeepSeek-AI et al., 2025), and QWen-3 (Team, 2025). MoE LLMs can achieve high cognition and generation performance while keeping inference cost relatively low. The key reason is that MoE LLMs only activate part of the experts in the Feed-Forward Networks (FFNs), which avoids the high computation cost of using all experts. For example, DeepSeek-V3 only acti- vates 1 shared expert and 8 of 256 routed experts (activates 37 out of 671 billion parameters), significantly reducing runtime memory and computation overhead. As the MoE routers are placed inside the FFNs, the expert selection- loading-computation pipeline suffers from the long expert loading time, especially when the MoE models are too large to fit in the GPU memory. To solve the expert-loading bottleneck in MoE LLMs, var- ious prediction and caching methods have been proposed to efficiently prefetch and serve the experts. MoE-Infinity (Xue et al., 2025) designs a sparsity-aware expert cache to trace the activated experts to reduce the Time-Per-Output- 1Systems Group, D-INFK, ETH Zurich, Switzerland. Cor- respondence to: Shien Zhu shien.zhu@inf.ethz.ch, Gustavo Alonso alonso@inf.ethz.ch. Under review by a journal. Copyright 2025 by the author(s). Token (TPOT). PopFetcher (Zhang et al., 2025a) prefetches the experts of the next layer based on their popularity. HOB- BIT (Tang et al., 2024) proposes a multi-dimensional cache manager and dynamic expert loader to accelerate expert loading. Pre-Gated MoE (Hwang et al., 2024) modifies the gate function to select the experts of the next layer and applies caching methods for faster inference. However, we observe three problems in these prediction sys- tems. First, it is very hard to achieve high prediction accu- racy by predicting from the previous layer. SP-MoE (Chen et al., 2025) prefetches the experts for speculative decod- ing that drafts multiple tokens per step and achieves more than 70% of prediction accuracy in most layers. DuoServe- MoE (Zhang et al., 2025b) accelerates the inference by duo CUDA streams and a layer-level expert predictor with 54- 67% of top-2 accuracy. FATE (Fang et al., 2025) uses the activations from the previous layer for prediction, and the prediction part contributes 78.8% to accuracy. In addition, many proposals report only the performance gain, without the prediction accuracy. Second, predicting from the previ- ous layer has a limitation on prefetching for the first layer. Although they still predict the experts for the first layer, the actual prediction accuracy for the first and second layers is significantly lower than that of the other layers (Fang et al., 2025). For this reason, AdapMoE (Zhong et al., 2024) tries to mitigate the prediction accuracy gap of the first few layers by integrating prefetching and cache management techniques. Third, increasing the complexity of the predic- arXiv:2511.10676v1 [cs.CL] 10 Nov 2025 Pre-Attention Expert Prediction and Prefetching For MoE LLMs Figure 1. Token generation pipeline in typical MoE architectures (profiled on DeepSeek-V2-Lite on a Nvidia V100 GPU). tor can bring higher prediction accuracy, but also incur high computational overhead. The prediction latency has to be as short as possible to leave enough time to fetch the experts, otherwise the prefetching benefits will diminish. In this paper, we propose pre-attention expert prediction to
This content is AI-processed based on ArXiv data.