ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.
💡 Research Summary
ZipMoE addresses the pressing challenge of deploying mixture‑of‑experts (MoE) large language models on memory‑constrained edge devices without sacrificing model fidelity. Existing approaches rely on quantization or pruning, which introduce approximation errors that can alter model behavior and even open security vulnerabilities. Moreover, prior MoE serving systems assume a server‑grade architecture where CPUs and GPUs have separate memory pools and offload inactive experts to host memory or SSDs, overlapping GPU inference with PCIe I/O. This assumption breaks down on modern mobile SoCs, which use a unified memory architecture (UMA) where CPU and GPU share the same physical memory and bandwidth, and where SSD read speeds are limited to 1–5 GB/s. Consequently, I/O becomes the dominant bottleneck, especially for batch‑size‑one interactive workloads common on‑device.
The authors first perform an information‑theoretic analysis of BF16 tensors, revealing that the 8‑bit exponent field exhibits extreme skew: only about 12–15 % of possible exponent symbols appear, yielding an entropy of roughly 2.5 bits per exponent. This low entropy suggests high compressibility. Using off‑the‑shelf lossless compressors (ZSTD and LZ4HC), they achieve 68 %–74 % size reduction on MoE expert parameters, approaching the Shannon limit.
ZipMoE’s core pipeline consists of (1) offline bit‑field decomposition, separating each BF16 weight into a high‑entropy sign‑mantissa chunk (SM‑chunk) and a low‑entropy exponent chunk (E‑chunk); (2) sharding the exponent bits into K pieces, compressing each shard; and (3) serializing both compressed exponent shards and uncompressed SM‑chunks together with metadata. At runtime, a pool of CPU worker threads decompresses exponent shards in parallel. Experiments show that with three or more worker threads, decompression latency falls below SSD read latency, allowing the decompression to be fully overlapped with I/O. Memory‑controller contention is modest (≈7 % throughput reduction), confirming that parallel decompression does not starve the SSD pipeline.
The second major contribution is a compression‑aware hierarchical cache. The system maintains three cache levels: fully cached (both SM‑ and exponent bits), partially cached (only SM‑bits), and uncached (both need to be fetched). A probabilistic model of expert “skewness” predicts the likelihood that a given expert will be activated. Using dynamic programming, ZipMoE allocates the limited on‑device memory budget across the three cache levels to maximize expected cache hit rate while respecting the memory constraint. The cache‑affinity scheduler then decides, for each token, whether the required expert can be served from a full cache hit, a partial hit (requiring on‑the‑fly exponent decompression), or must be fetched entirely from storage. Because exponent decompression is cheap and can be hidden behind the I/O of the SM‑bits, a partial hit incurs essentially zero additional latency compared with a full hit. This design yields up to a 2× increase in tensor coverage for the same memory budget.
Implementation details include a zero‑copy GPU kernel that merges decompressed exponent bits with the SM‑chunks directly in unified memory, eliminating extra copies. The system was evaluated on four representative edge platforms (NVIDIA Jetson AGX Orin, a generic mobile SoC, Raspberry Pi, and an ARM‑based board) using three open‑source MoE models: DeepSeekV2‑Lite, Qwen1.5‑MoE, and Switch‑Transformers‑Large‑128. Compared with the state‑of‑the‑art offloading‑based MoE serving frameworks, ZipMoE achieves an average inference latency reduction of 72.77 % and a throughput increase of 6.76×. The gains are especially pronounced for batch‑size‑one workloads, where traditional pipelining offers little benefit and CPU/GPU resources would otherwise sit idle.
The paper also discusses limitations and future work. Compression currently targets only BF16 exponent bits; extending the approach to other formats (FP32, INT8) would broaden applicability. The cache‑affinity scheduler assumes a relatively static expert activation distribution; integrating token‑level dynamic routing could further improve hit rates. Finally, while lossless compression avoids accuracy loss, the additional CPU cycles for decompression may become a bottleneck on ultra‑low‑power devices, suggesting a need for hardware‑accelerated decompression primitives.
In summary, ZipMoE demonstrates that by exploiting statistical redundancy in MoE parameters and co‑designing lossless compression with a cache‑aware scheduling policy, it is possible to transform MoE inference on edge devices from an I/O‑bound problem into a compute‑centric workflow. This enables secure, high‑performance on‑device LLM services without resorting to lossy model modifications.
Comments & Academic Discussion
Loading comments...
Leave a Comment