EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization

EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mixture-of-Experts (MoE) models enable scalable computation and performance in large-scale deep learning but face quantization challenges due to sparse expert activation and dynamic routing. Existing post-training quantization (PTQ) methods fail to address activation outliers, routing instability, and sparse expert calibration, leading to significant performance degradation. To address this, we propose EAQuant, a PTQ framework tailored for MoE architectures. Our method introduces three expert-aware innovations: (1) smoothing aggregation to suppress activation outliers, (2) routing consistency alignment to preserve expert selection post-quantization, and (3) calibration data balance to optimize sparsely activated experts. These strategies collectively enable robust, high-precision quantization of MoE models under ultra-low-bit constraints.Extensive experiments across several extreme quantization settings (e.g., W4A4/W3A4/W3A3/W2A4) demonstrate that EAQuant significantly outperforms existing methods, achieving average accuracy improvements of 1.15 - 13.81% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks and robust performance retention under aggressive quantization. By integrating these innovations, EAQuant establishes a new state-of-the-art for high-precision, efficient MoE model compression.Our code is available at https://github.com/darren-fzq1/EAQuant.


💡 Research Summary

The paper addresses the unique challenges of post‑training quantization (PTQ) for Mixture‑of‑Experts (MoE) models, which have become a cornerstone for scaling large language models while keeping inference costs sublinear. Three core problems are identified: (1) activation outliers that appear in a small subset of channels across many experts, (2) extreme sensitivity of the routing layer to quantization noise, and (3) severe calibration data imbalance because only a few “core” experts see most tokens. Existing PTQ methods either ignore these issues or treat MoE components as dense models, leading to large accuracy drops under ultra‑low‑bit settings.

EAQuant proposes three expert‑aware solutions. First, an “expert‑aware smoothing aggregation” computes per‑expert smoothing vectors (as in SmoothQuant) but then aggregates them with a max‑operator into a single global smoothing vector s. This vector can be fused into the preceding RMSNorm, eliminating any runtime overhead while guaranteeing that s dominates every per‑expert vector, thus safely scaling outlier channels regardless of which expert processes a token. Second, “routing consistency alignment” introduces a dual‑objective calibration that simultaneously minimizes mean‑squared error between full‑precision and quantized router logits and the KL‑divergence between their softmax probability distributions. By aligning both logits and probabilities, the top‑k expert selection remains stable after quantization, preventing token misrouting and load‑imbalance. Third, “calibration data balance” detects under‑utilized experts during the calibration phase and augments the calibration set (e.g., via paraphrasing or back‑translation) to raise their token count to the expected proportion. This ensures that every expert receives sufficient statistics for reliable scale estimation, without resorting to synthetic data generation that could bias comparisons.

The method is evaluated on three prominent MoE architectures—OLMoE‑7B, DeepSeek‑MoE‑16B, and Mixtral‑8x7B—across aggressive quantization configurations (W4A4, W3A4, W3A3, W2A4). EAQuant consistently outperforms the previous state‑of‑the‑art PTQ approach DuQuant, delivering average accuracy gains ranging from 1.15 % to 13.81 % depending on the model and bit‑width. Notably, reasoning benchmarks such as ARC‑E see improvements of over 2 % points, and perplexity remains within 0.5 % of the full‑precision baseline even at W2A4. The routing KL‑divergence drops dramatically, and the frequency of token misrouting is reduced by more than 90 %. Calibration‑balanced experts no longer suffer accuracy degradation, confirming the effectiveness of the data‑balancing step.

In summary, EAQuant introduces a coherent framework that jointly tackles activation outliers, routing fragility, and calibration sparsity through expert‑aware mechanisms. By doing so, it enables ultra‑low‑bit quantization of MoE models without sacrificing accuracy or inference stability, paving the way for deploying massive language models on memory‑constrained edge devices.


Comments & Academic Discussion

Loading comments...

Leave a Comment