Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

Reading time: 6 minute
...

📝 Abstract

Mixture-of-Experts (MoE) has become a practical architecture for scaling LLM capacity while keeping per-token compute modest, but deploying MoE models on a single, memory-limited GPU remains difficult because expert weights dominate the HBM footprint. Existing expert offloading and prefetching systems reduce the resident set, yet they often pay expert-loading costs on the critical path when activation becomes dense. Post-training quantization (PTQ) lowers the footprint without transfers, but prevailing pipelines fix expert bit-widths offline and assume routing remains stable, even though MoE expert utilization is heavy-tailed and the hot set can shift across workloads. We present DynaExq, a runtime-aware mixed-precision serving system that treats single-GPU MoE inference under a hard HBM envelope as an online, budget-constrained precision allocation problem. The key insight is to keep the experts that dominate runtime traffic resident at higher precision, while maintaining a low-precision fallback for the remaining experts, so the system can reduce transfer volume and avoid the waiting latency that limits offloading and prefetching under dense activation. DynaExq estimates long-horizon expert hotness from router traces, selects a per-layer high-precision resident set via a budget-feasible top- $n$ rule, and applies promotions and demotions asynchronously through stable expert handles so the forward pass always executes on a fully materialized expert version. Across Qwen3-MoE-30B/80B and six benchmarks, DynaExq improves accuracy over static PTQ on Qwen3-80B (73.09% to 77.57%) under comparable device-memory budgets and achieves up to 2.73x higher throughput than offloading/prefetch baselines at batch size 32.

💡 Analysis

Mixture-of-Experts (MoE) has become a practical architecture for scaling LLM capacity while keeping per-token compute modest, but deploying MoE models on a single, memory-limited GPU remains difficult because expert weights dominate the HBM footprint. Existing expert offloading and prefetching systems reduce the resident set, yet they often pay expert-loading costs on the critical path when activation becomes dense. Post-training quantization (PTQ) lowers the footprint without transfers, but prevailing pipelines fix expert bit-widths offline and assume routing remains stable, even though MoE expert utilization is heavy-tailed and the hot set can shift across workloads. We present DynaExq, a runtime-aware mixed-precision serving system that treats single-GPU MoE inference under a hard HBM envelope as an online, budget-constrained precision allocation problem. The key insight is to keep the experts that dominate runtime traffic resident at higher precision, while maintaining a low-precision fallback for the remaining experts, so the system can reduce transfer volume and avoid the waiting latency that limits offloading and prefetching under dense activation. DynaExq estimates long-horizon expert hotness from router traces, selects a per-layer high-precision resident set via a budget-feasible top- $n$ rule, and applies promotions and demotions asynchronously through stable expert handles so the forward pass always executes on a fully materialized expert version. Across Qwen3-MoE-30B/80B and six benchmarks, DynaExq improves accuracy over static PTQ on Qwen3-80B (73.09% to 77.57%) under comparable device-memory budgets and achieves up to 2.73x higher throughput than offloading/prefetch baselines at batch size 32.

📄 Content

The rapid expansion of large language models (LLMs) has driven their adoption across a broad range of applications [39]. Beyond cloud deployment in centralized data centers, there is growing interest in running LLMs closer to end users to reduce interactive latency, limit exposure of sensitive data, and avoid dependence on stable network connectivity [11]. These pressures have made on-device and edge inference an increasingly relevant deployment mode, but they also expose a practical constraint: the memory footprint of modern LLMs often exceeds the capacity of a single commodity accelerator, especially on edge platforms.

MoE architectures [41] have emerged as a pragmatic path to scaling model capacity without incurring proportional per-token compute. By routing each token to a small subset of experts, MoE increases representational capacity while keeping the activated parameters per token relatively small. The reduction in activated compute, however, does not translate into a proportional reduction in storage demand at inference time. To preserve unconstrained routing, the system must keep the full set of expert weights accessible at low latency, which shifts the bottleneck from arithmetic throughput to parameter residency. As a result, MoE-based LLMs can remain memory-intensive even when only a small fraction of parameters is used per token. For example, Qwen3-Next-80B activates only about 3B parameters per token, yet storing the full model parameters requires on the order of 160 GB of memory, far beyond what typical edge-class GPUs can provide. This gap makes single-device deployment difficult and motivates systems support that treats expert parameters as the primary constrained resource.

Two classes of techniques are commonly used to relax this memory bottleneck. The first is expert offloading and prefetching [8, 17, 21, 27-29, 35, 36, 40], which treats GPU memory as a cache and moves experts between the GPU and slower tiers such as host memory or SSD. The second is post-training quantization (PTQ) [7,9,10,12,14,18,22,38], which compresses expert weights into low bit-width representations. Both are effective in specific regimes, but both rely on assumptions that can be violated under realistic serving mixes.

Offloading and prefetching are most effective when each iteration touches a small, stable working set of experts. In practice, MoE activation can become substantially denser during prefill and at larger batch sizes, which expands the per-iteration working set and increases transfer pressure (Table 2). When the working set grows beyond what can be staged over PCIe or NVLink within the available overlap window, transfers become visible as GPU waiting time and amplify tail latency (Figure 1). In such regimes, placement policies face a structural limitation: even with accurate prediction, the system must still move a large volume of expert weights to sustain throughput.

PTQ reduces expert footprint without introducing transfer dependencies on the critical path. However, most MoE PTQ pipelines assign a uniform bit-width to all experts or fix a per-expert precision map offline using limited calibration. This design implicitly assumes that the relative importance of experts remains stable at serving time. That assumption is brittle for MoE. Expert utilization is heavy-tailed over long horizons and the identity of frequently used experts can shift across workloads such as general text, math reasoning, and code generation. Under workload shift, a static precision map misallocates scarce high-precision capacity: it either preserves precision for experts that contribute little traffic in the current workload, or over-compresses experts that become hot, degrading quality precisely when the workload is harder.

To address this deployment gap, we propose DynaExq, which treats single-GPU MoE serving under a hard HBM envelope as an online, budget-constrained precision allocation problem. The key idea is to make expert precision a runtime-controlled resource, rather than a fixed offline choice. DynaExq continuously observes router outputs during serving, estimates which experts account for a disproportionate share of traffic over a stable time horizon, and allocates a limited high-precision budget to those experts while keeping the remainder in a lower-precision representation to stay within the device memory cap.

DynaExq is organized as a lightweight control loop that separates policy decisions from the token critical path. On the policy side, a scheduler aggregates routing traces and periodically computes a budget-feasible per-layer resident set for high-precision experts. This selection is constrained by a fixed HBM budget that accounts for non-expert parameters and runtime allocations such as the KV cache, so the resulting precision plan is feasible by construction. On the mechanism side, DynaExq enforces non-blocking execution by decoupling precision transitions from computation. It maintains stable expert

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut