DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mixture of Experts (MoE) architectures significantly enhance the capacity of LLMs without proportional increases in computation, but at the cost of a vast parameter size. Offloading MoE expert parameters to host memory and leveraging both CPU and GPU computation has recently emerged as a promising direction to support such models on resourceconstrained local PC platforms. While promising, we notice that existing approaches mismatch the dynamic nature of expert workloads, which leads to three fundamental inefficiencies: (1) Static expert assignment causes severe CPUGPU load imbalance, underutilizing CPU and GPU resources; (2) Existing prefetching techniques fail to accurately predict high-workload experts, leading to costly inaccurate prefetches; (3) GPU cache policies neglect workload dynamics, resulting in poor hit rates and limited effectiveness. To address these challenges, we propose DALI, a workloaDAware offLoadIng framework for efficient MoE inference on local PCs. To fully utilize hardware resources, DALI first dynamically assigns experts to CPU or GPU by modeling assignment as a 0-1 integer optimization problem and solving it efficiently using a Greedy Assignment strategy at runtime. To improve prefetching accuracy, we develop a Residual-Based Prefetching method leveraging inter-layer residual information to accurately predict high-workload experts. Additionally, we introduce a Workload-Aware Cache Replacement policy that exploits temporal correlation in expert activations to improve GPU cache efficiency. By evaluating across various MoE models and settings, DALI achieves significant speedups in the both prefill and decoding phases over the state-of-the-art offloading frameworks.

💡 Research Summary

Mixture‑of‑Experts (MoE) models dramatically increase the capacity of large language models while keeping compute modest, but their parameter count can reach hundreds of billions, far exceeding the memory of a typical consumer‑grade GPU. Recent work therefore offloads expert weights to host memory and executes parts of the computation on both CPU and GPU. Existing hybrid approaches either assign whole layers to a single device (layer‑wise) or statically split experts between CPU and GPU based on a fixed workload threshold (expert‑wise). Both suffer from three fundamental inefficiencies: (1) static expert placement causes severe CPU‑GPU load imbalance as token‑dependent expert workloads vary; (2) prefetching methods cannot reliably predict which high‑workload experts will be needed, leading to costly mis‑prefetches; (3) GPU cache replacement policies ignore the temporal dynamics of expert usage, resulting in low hit rates.
DALI addresses these issues with three coordinated techniques. First, it formulates expert assignment as a 0‑1 integer optimization problem that minimizes total inference latency, accounting for per‑expert execution times on CPU and GPU and PCIe transfer costs. Because exact solving is prohibitive, DALI introduces a greedy heuristic that iteratively assigns the most demanding experts to the device with the lower projected load, rebalancing at each MoE layer. This dynamic assignment adapts to changing batch sizes and token streams, keeping both processors well utilized.
Second, DALI proposes Residual‑Based Prefetching. By exploiting the residual (difference) between the hidden states of consecutive MoE layers, a lightweight MLP predicts which experts will have high workloads in the next layer. The top‑k predicted experts are prefetched over PCIe before they are needed, dramatically improving prefetch accuracy compared with prior statistical or feature‑based methods.
Third, DALI introduces a Workload‑Aware Cache Replacement policy. Observing strong temporal correlation in expert activations, the policy maintains a “hot set” of experts that have been frequently used in the recent N steps and evicts the least recently used members of the “cold set”. It also incorporates the workload predictions from the greedy assignment to proactively keep likely‑to‑be‑used experts in GPU memory. This yields cache hit rates of up to 68 % versus roughly 25 % in previous systems.
Extensive experiments on Mixtral‑8×7B, DeepSeek‑V2‑Lite, and Qwen‑1.5 demonstrate that DALI outperforms state‑of‑the‑art offloading frameworks (llama.cpp, KTransformers, MoE‑Lightning, HybriMoE). In the pre‑fill phase DALI achieves average speedups of 7.62× (range 2‑8×), and in the decoding phase 3.97× (range 1.3‑4×). PCIe transfer time is reduced from up to 78 % of total latency to below 45 %, and GPU cache hit rates improve by more than 2.5×. The system runs on commodity RTX 3090/4090 PCs, showing that large MoE models can be deployed locally without expensive server hardware.
The paper concludes that jointly optimizing device assignment, prefetching, and caching—while respecting the dynamic nature of MoE workloads—enables efficient inference on resource‑constrained PCs. Future work will explore multi‑GPU scaling, support for more complex expert routing (e.g., multi‑top‑k), and integration with system‑level schedulers for edge‑cloud hybrid deployments.

DALI: A Workload-Aware Offloading Framework for Efficient MoE Inference on Local PCs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment