Uncertainty Makes It Stable: Curiosity-Driven Quantized Mixture-of-Experts
Deploying deep neural networks on resource-constrained devices faces two critical challenges: maintaining accuracy under aggressive quantization while ensuring predictable inference latency. We presen
Deploying deep neural networks on resource-constrained devices faces two critical challenges: maintaining accuracy under aggressive quantization while ensuring predictable inference latency. We present a curiosity-driven quantized Mixture-of-Experts framework that addresses both through Bayesian epistemic uncertainty-based routing across heterogeneous experts (BitNet ternary, 1 to 16 bit BitLinear, post-training quantization). Evaluated on audio classification benchmarks (ESC-50, Quinn, Ur-banSound8K), our 4-bit quantization maintains 99.9% of 16-bit accuracy (0.858 vs 0.859 F1) with 4× compression and 41% energy savings versus 8-bit. Crucially, curiosity-driven routing reduces MoE latency variance by 82% (p=0.008, Levene’s test) from 230 ms to 29 ms standard deviation, enabling stable inference for batteryconstrained devices. Statistical analysis confirms 4bit/8-bit achieve practical equivalence with full precision (p>0.05), while MoE architectures introduce 11% latency overhead (p<0.001) without accuracy gains. At scale, deployment emissions dominate training by 10,000× for models serving more than 1,000 inferences, making inference efficiency critical. Our information-theoretic routing demonstrates that adaptive quantization yields accurate (0.858 F1, 1.2M params), energy-efficient (3.87 F1/mJ), and predictable edge models, with simple 4-bit quantized architectures outperforming complex MoE for most deployments.
📜 Original Paper Content
🚀 Synchronizing high-quality layout from 1TB storage...