ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental …

Authors: Ruibo Fan, Xiangrui Yu, Xinglin Pan

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression
ZipServ : Fast and Memor y-Ecient LLM Inference with Hardwar e- A ware Lossless Compr ession Ruibo Fan The Hong Kong University of Science and T echnology (Guangzhou) Guangzhou, China ruibo.fan@connect.hkust- gz.edu.cn Xiangrui Y u The Hong Kong University of Science and T echnology (Guangzhou) Guangzhou, China xyu868@connect.hkust- gz.edu.cn Xinglin Pan The Hong Kong University of Science and T echnology (Guangzhou) Guangzhou, China xpan413@connect.hkust- gz.edu.cn Zeyu Li The Hong Kong University of Science and T echnology (Guangzhou) Guangzhou, China zli755@connect.hkust- gz.edu.cn W eile Luo The Hong Kong University of Science and T echnology (Guangzhou) Guangzhou, China wluo976@connect.hkust- gz.edu.cn Qiang W ang Harbin Institute of T echnology , Shenzhen Shenzhen, China qiang.wang@hit.edu.cn W ei W ang The Hong Kong University of Science and T echnology Hong Kong, Hong K ong SAR weiwa@cse.ust.hk Xiaowen Chu The Hong Kong University of Science and T echnology (Guangzhou) Guangzhou, China The Hong Kong University of Science and T echnology Hong Kong, Hong K ong SAR xwchu@ust.hk Abstract Lossless model compression holds tremendous pr omise for alleviating the memory and bandwidth bottlene cks in bit- exact Large Language Model (LLM) ser ving. However , ex- isting approaches often result in substantial inference slow- downs due to fundamental design mismatches with GP U architectures: at the kernel level, variable-length bitstr eams produced by traditional entrop y codecs break SIMT par- allelism; at the system level, decoupled pipelines lead to redundant memor y trac. W e present ZipServ , a lossless compression framework co-designed for ecient LLM infer- ence. ZipServ introduces T ensor-Core- A war e Triple Bitmap Encoding (TCA - TBE), a novel xed-length format that en- ables constant-time, parallel decoding, together with a fused decompression-GEMM (ZipGEMM) kernel that decompresses weights on-the-y directly into T ensor Core registers. This " load-compressed, compute-de compressed " design eliminates intermediate buers and maximizes compute intensity . Ex- periments show that ZipServ reduces the model size by up to 30%, achieves up to 2.21 × kernel-level speedup over NVIDIA ’s cuBLAS, and expedites end-to-end inference by This work is licensed under a Creativ e Commons Attribution 4.0 Interna- tional License. ASPLOS ’26, Pittsburgh, P A, USA © 2026 Copyright held by the owner/author(s). A CM ISBN 979-8-4007-2359-9/2026/03 hps://doi.org/10.1145/3779212.3790250 an average of 1.22 × over vLLM. ZipServ is the rst lossless compression system that provides both storage savings and substantial acceleration for LLM inference on GP Us. CCS Concepts: • Computing methodologies → Shared memory algorithms . Ke ywords: LLM Inference, Lossless Compr ession, GEMM, GP U, T ensor Core A CM Reference Format: Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, W eile Luo, Qiang W ang, W ei W ang, and Xiaowen Chu. 2026. ZipServ : Fast and Memory-Ecient LLM Inference with Hardwar e-A ware Lossless Compression. In Proceedings of the 31st A CM International Con- ference on A rchitectural Support for Programming Languages and Operating Systems, V olume 2 (ASPLOS ’26), March 22–26, 2026, Pittsburgh, P A, USA. A CM, New Y ork, NY, USA, 17 pages. hps: //doi.org/10.1145/3779212.3790250 1 Introduction The transformative power of Large Language Models (LLMs) like GPT -4 [ 54 ], LLaMA -3 [ 17 ], and Qwen-3 [ 70 ] is rooted in their massive scale [ 3 , 36 ], enabling a new paradigm of AI ap- plications [ 6 , 60 , 74 , 81 ]. Howev er , this immense scale cr eates signicant deployment challenges, making GP U memor y and memory bandwidth the primar y bottlenecks for LLM serving, espe cially in resource-constrained environments. Model compression oers a promising solution for ef- cient LLM deployment. Most existing approaches ar e lossy , reducing size by approximating model weights via ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Ruibo Fan, et al. Figure 1. Execution time of lossless compression pipelines on NVIDIA L40S GP U with GateUp_Proj layers. quantization (e.g., GPTQ [ 23 ], A W Q [ 43 ]) or pruning (e.g., SparseGPT [ 22 ]). Howev er , such approximations risk accu- racy loss. For instance, aggressive 4-bit quantization (e.g., MXFP4) slashes accuracy from 56.0% to 36.2% on Live- CodeBench [ 44 ], while even robust int8 quantization (GPTQ- int8) can cause up to 11.1% loss in long-context reasoning (NOCHA) [ 49 ]. These risks undermine reliability in safety- critical and user-facing settings, motivating approaches that guarantee bit-exact reproducibility and numerical integrity . Lossless compression oers a comp elling alternative by providing bit-exact model repr esentation without accuracy loss. T o date, its benets have largely targeted storage and training workows. For example, LMC [ 71 ] and ZipNN [ 29 ] employ Human [ 31 ] to compress model checkpoints for ecient storage and distribution, while NeuZip [ 28 ] and DietGP U [ 33 ] mitigate memory and communication over- head during training. Although recent eorts, notably DFloat11 [ 85 ], aim to extend these gains to infer ence, prac- tical eciency remains elusive . When integrated into serv- ing pipelines, existing lossless techniques incur signicant runtime overhead. A s shown in Figure 1 , the decoupled de- compression step alone takes 1.56–3.44 × the time of the core inference computation. This ov erhead forces an unpleasant tradeo between memor y eciency and runtime eciency . W e contend that this tradeo is not fundamental but arises from a mismatch between conventional compression algorithms and modern GP U architectures. The issue mani- fests at two lev els. At the kernel level , traditional entropy codecs (e.g., Human [ 31 ] or ANS [ 18 ]) produce variable- length bitstreams, whose deco ding demands serialized, data- dependent operations. These are ill-suited to the lo ckstep, parallel SIMT execution model of GP U warps, resulting in severe control-ow divergence and compute underutiliza- tion. At the system level , most frameworks employ a decou- pled inference pipeline: weights are fully decompressed into a global-memor y buer b efore kernel consumption. This staged execution results in r edundant, high-latency memory accesses, eroding compression-provided bandwidth savings and reducing arithmetic intensity during inference. T o rectify these fundamental algorithm-hardware mis- matches, we present ZipServ 1 , the rst lossless compres- sion framework co-designed for high-performance LLM in- ference on GP Us. Our key observation is that the exponent bits of BF loat16 weights in LLMs exhibit a highly skewed, low-entropy distributions in contemporary models. Exploiting this statistical redundancy , we propose T ensor-Core-A ware Triple Bitmap Encoding (TCA- TBE), a xed-length, bitmap- based weight format tailored to GPU architectures. Unlike variable-length entropy codecs, TCA - TBE enables constant- time, parallel decoding using lightweight bitwise operations, thereby eliminating control-ow divergence and aligning with the GP U’s SIMT execution model. Paired with TCA - TBE, ZipServ devises a fused decompression-GEMM kernel (ZipGEMM). Rather than decompressing weights into global memory as an intermediate step, ZipGEMM performs on- the-y decoding , delivering compressed weights directly into the register les that feed T ensor Core matrix multiplication units. This "load-compressed, compute-decompressed" design eliminates intermediate buers, reduces data movement, and maximizes the overlap b etween computation and memory ac- cess. By jointly addressing both the kernel-level and system- level mismatches, ZipServ transforms the theoretical storage savings of lossless compression into tangible p erformance gains on inference-optimized GP Us. W e demonstrate ZipServ ’s eectiveness through com- prehensive benchmarking against state-of-the-art lossless approaches, including DietGP U [ 33 ], vendor-optimized nvCOMP [ 53 ], and the Human-based DF loat11 [ 85 ]. Com- pared to these baselines, which uniformly suer signicant runtime overhead, ZipServ consistently delivers substantial accelerations at both the kernel and system level on various inference-optimized GP Us, including RTX4090, L40S, and RTX5090. Our fused ZipGEMM achieves speedups of up to 2.21 × over NVIDIA ’s cuBLAS, and up to 5.53 × over DFloat11, the fastest lossless compression pipeline. These kernel-level improvements translate into an average 1.22 × end-to-end speedup compared to leading systems like vLLM [ 39 ]. Our results demonstrate for the rst time that when co-designed with hardware , lossless compression can provide both stor- age savings and substantial LLM inference acceleration. The main contributions of this paper are as follows: • W e identify the fundamental mismatch between con- ventional entropy-based compression and GP U archi- tectures, revealing b oth kernel- and system-level bot- tlenecks that hinder ecient inference. • W e propose TCA - TBE, a xed-length, bitmap-based encoding tailored to SIMT execution and T ensor Core tiling, enabling constant-time, parallel decoding. • W e design ZipGEMM, a novel kernel that performs decompression on-the-y directly into T ensor Core 1 Publicly available at hps://github.com/HPMLL/ZipServ_ASPLOS26.git ZipServ ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA registers, eliminating intermediate memory buers and maximizing compute intensity . • W e present and evaluate ZipServ , a lossless com- pressed LLM inference framew ork that achieves end- to-end speedups across diverse LLMs and GP Us, con- stituting the rst practical evidence that lossless com- pression can directly accelerate LLM serving. 2 Background 2.1 Transformer-Based LLMs Transformer-based LLMs [ 2 , 17 , 70 ] are composed of stacked layers of multi-head attention, feed-for ward networks (FFNs), and normalization layers. During inference, computation proceeds autoregressively in two phases: prell and decode. The prell phase parallelizes computation over the input prompt, resulting in high arithmetic intensity due to large matrix multiplications operated over multiple tokens. On the contrar y , the decode phase generates tokens one at a time, where matrix multiplications involve only a single token per batch element. The decode phase, hence, suers from reduced compute utilization and greater sensitivity to memory bandwidth. In both phases, the dominant operation is dense matrix multiplication: 𝑌 = 𝑊 𝑋 , where 𝑊 ∈ R 𝑀 × 𝐾 is a learned weight matrix and 𝑋 ∈ R 𝐾 × 𝑁 are activations, where 𝑀 is the output dimension, 𝐾 is the hidden dimension, and 𝑁 is the numb er of tokens. 2.2 BFloat16 Format BFloat16 (BF16) [ 35 ] is a 16-bit oating-point format that has become the de facto precision standard for LLM inference, balancing memory eciency with numerical robustness. It is natively supported by major hardware accelerators, in- cluding NVIDIA T ensor Cores [ 47 ], Google TP Us [ 34 ], and Intel AMX [ 37 ], and is widely adopted in production-scale models, including LLaMA -3 [ 17 ], Qwen [ 70 ], and Mistral [ 2 ]. A BF16 number consists of 1 sign bit, 8 exponent bits, and 7 mantissa bits. Its numerical value is computed as: BF16 ( 𝑥 ) = ( − 1 ) sign × 2 exponent − 127 × ( 1 . mantissa ) . This layout preserves the full exponent range of IEEE FP32 (1-8-23) while reducing mantissa precision. Compared to FP16 (1-5-10), BF16 oers a wider dynamic range, reducing vulnerability to overows and undero ws in large models. 2.3 GP U Architecture and T ensor Core Execution Modern GP Us comprise multiple Streaming Multiproces- sors (SMs), each with SIMT cores, T ensor Cores, registers, shared memor y , and local caches. Threads are grouped into warps of 32, executing under the Single Instruction, Multiple Threads (SIMT) paradigm. T ensor Cores are spe- cialized processors for high-throughput matrix multipli- cations. On recent N VIDIA architectures [ 50 , 51 ], T en- sor Cores support BF16 operands through the PTX-level Figure 2. Exponent bit distribution in LLM weights. mma.sync.m16n8k16 instruction, which performs fuse d ma- trix multiply-accumulate (FMA) operations across small ma- trix tiles. A typical BF16 T ensor Core operation can b e ex- pressed as: 𝐷 frag = 𝐴 frag × 𝐵 frag + 𝐶 frag , where 𝐴 frag ∈ R 16 × 16 , 𝐵 frag ∈ R 16 × 8 , and 𝐶 frag ∈ R 16 × 8 is the FP32 accumulator frag- ment. This operation is executed at the warp level, where a group of 32 threads collaborate to compute the matrix mul- tiplication. The input and output fragments are distributed across the entire warp. Each thread holds a specic subset of fragment elements in its registers, and the complete fragment is formed collectively . 3 Gaps and Opp ortunities Lossless compr ession enables bit-exact model representation but is rarely used for inference due to high runtime overheads stemming from a mismatch between traditional codecs and GP U architectures. This section quanties compressibility in LLM weights and identies key kernels and system-level bottlenecks that motivate our co-designed solution. 3.1 Compressibility of BF16 W eights W e analyze d the BF16 weights of leading LLMs, includ- ing Llama-3-8B-Instruct [ 17 ], Mistral-Small-24B-Instruct- 2501 [ 2 ], and Qwen2.5-32B-Instruct [ 69 ], and obser ved re- markable redundancy in their 8-bit exponent elds. As shown in Figure 2 , the exponent distributions are highly skewed : the top-3 most frequent exponents account for mor e than 67% of all weights, and the top-7 exponents cover over 95% (e .g., 96.4% in Llama-3 and 97.4% in Mistral-24B). The information entropy of the exponent eld is only 2.57– 2.74 bits , far below its 8-bit allocation, implying a theo- retical lossless compression ratio of ab out 1.51 × (16/10.6) for BF16 values. These ndings are consistent with prior ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Ruibo Fan, et al. Figure 3. Existing Lossless Compr ession for BF16 W eights. Illustrated with Human Encoding. works [ 28 , 29 , 71 , 83 , 85 ]. W e further scrutinized this redun- dancy across 3,875 weight matrices fr om four LLM families (Gemma-3, Mistral, Qwen2.5, and LLaMA -3.1), revealing a critical structural property: exponent contiguity . In 99.6% of these matrices, the top-7 most frequent exponents form a numerically contiguous sequence (i.e., 𝑒 ★ , . . . , 𝑒 ★ + 6 ). Con- sequently , a simple contiguous window covers 97.1% of all weights on average , approaching the information-theoretic limit. In Appendix A , we prove that this is not coinciden- tal but an intrinsic property of LLMs. This contiguity is the cornerstone of ZipServ . It obviates the ne ed for complex, hardware-unfriendly variable-length codecs (e.g., Human) in favor of a xed-length , base-plus-oset representation. This insight directly enables our T ensor-Core- A ware Triple Bitmap Encoding (TCA - TBE) and its implicit lookup me cha- nism described in § 4.3.2 . 3.2 Kernel-Le vel Architectural Mismatch Existing methods exploit the exponent redundancy of BF16 weights by applying entropy coding to the exponent stream. For example, DF loat11 uses Human coding [ 85 ], while Diet- GP U employs Asymmetric Numeral Systems (ANS) [ 33 ]. As shown in Figure 3 , these appr oaches produce a compressed bitstream with variable-length symb ols depending on their statistical frequency . However , this bitstream must be de- compressed sequentially to correctly recover each exponent, which fundamentally conicts with the lockstep, massively parallel SIMT execution model of modern GP Us. T o illustrate this mismatch, we examine the three-stage decompression pipeline in DF loat11 [ 85 ]. ❶ Bitstream Par- titioning. The bitstream is split into chunks for parallel thread processing. Howe ver , b ecause variable-length sym- bols cross chunk boundaries, threads cannot op erate indepen- dently but require additional metadata to locate valid symbol start points, introducing overhead and disrupting parallel execution. ❷ Symbol Extraction. Threads use hierarchical lookup tables (LU T s) for symbol decoding—a data-dependent operation. When warp threads encounter dierent symbol lengths, faster threads stall for slower ones, causing diver- gence and underutilization of GP U resources. ❸ Pointer Ad- vancement. After symbol deco ding, each thread advances its bit pointer by the symbol’s length, which is only known Figure 4. Existing lossless compr ession inference pipeline. after the lookup completion. This inherently serializes the de- coding loop and sacrices opp ortunities for instruction-level parallelism. Our evaluation shows that on L40S GP Us, even highly optimized decompressors (e .g., ANS-based DietGP U and Human-based DF loat11) achieve only 43.7% and 76.5% of peak memory bandwidth, respectively . This ineciency exposes a fundamental algorithm-hardwar e mismatch: en- tropy coding is inherently data-dependent, while ecient GP U execution desires regular , uniform parallelism. 3.3 Ineciency of Decoupled Inference Pipeline The architectural ineciency of entropy-coded decoding is found not only at the kernel level, but also at the system pipeline level for LLM inference. In mainstream approaches, decompression is performe d as a separate, decoupled pre- processing stage (see Figure 4 ): it materializes the entire de- compressed weights in global memory rst and then passes it to the compute kernels. This de coupled pipeline design leads to redundant data transfers, undermining the benets of compression, particularly in bandwidth-constrained envi- ronments. W e analytically quantify its ineciency using the Rooine model, focusing on Compute Intensity (CI). Compute Intensity . CI measures the number of oating- point op erations (FLOPs) performed per byte read from global memor y . For a typical BF16 GEMM operation 𝑌 𝑀 × 𝑁 = 𝑊 𝑀 × 𝐾 𝑋 𝐾 × 𝑁 , the compute intensity is: 𝐶 𝐼 𝐺 𝐸 𝑀 𝑀 = 𝑀 𝑁 𝐾 𝑀 𝐾 + 𝐾 𝑁 + 𝑀 𝑁 . (1) In the decoupled pip eline scenario, assuming an average compression ratio (CR) of 1.51 (§ 3.1 ), the CI becomes: 𝐶 𝐼 Decoupled = 2 𝑀 𝑁 𝐾 𝑀 𝐾  2 CR + 4  + 2 ( 𝐾 𝑁 + 𝑀 𝑁 ) ≈ 𝑀 𝑁 𝐾 2 . 66 𝑀 𝐾 + 𝐾 𝑁 + 𝑀 𝑁 . (2) Rooine Model Analysis. Figure 5 illustrates the Rooine analysis on an NVIDIA RTX4090. During the decode stage, both the standard GEMM and the decoupled pipeline oper- ate in the memory-b ound regime, wher e performance scales linearly with CI. However , our analysis highlights a pro- nounced penalty for the de coupled approach: the additional memory trac required to materialize interme diate decom- pressed weights signicantly r educes CI. Specically , for a ZipServ ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Figure 5. Rooine analysis. weight matrix of size 𝑀 = 𝐾 = 4096 , the decoupled pip eline exhibits a CI degradation of 62.3%, 62.2%, 62.0%, and 61.7% relative to standard GEMM for batch sizes of 8, 16, 32, and 64, respectively . ZipServ ’s Fused Design. The ineciency of decoupled pipelines arises directly from staging decompressed weights in global memory . ZipServ addresses this by introducing a fused decompression-GEMM kernel that directly fetches compressed weights from DRAM and de compresses them on-the-y into register les, which immediately feed the T ensor Core. This approach eectively increases CI to 𝐶 𝐼 ZipServ = 2 𝑀 𝑁 𝐾 𝑀 𝐾 · 2 CR + 2 ( 𝐾 𝑁 + 𝑀 𝑁 ) ≈ 𝑀 𝑁 𝐾 0 . 66 𝑀 𝐾 + 𝐾 𝑁 + 𝑀 𝑁 . (3) Revisiting the Rooine model in Figure 5 , ZipServ ’s fused ex- ecution ( 𝐶 𝐼 ZipServ ) demonstrates a substantial improvement, achieving even higher CI (approximately 50%) than the un- compressed GEMM baseline. This benet, most pronounced in memory-b ound regimes, leads to linear spee dups relative to the compression ratio, translating information-theoretic redundancy into wall-clock acceleration. 4 Design of ZipServ Our earlier analysis identies both kernel-level and system- level sources of ineciency that hinder the de coding of loss- less compression in LLM inference. In this section, we present ZipServ , a lossless compression system co-designed for stor- age eciency and fast, bit-exact LLM inference. 4.1 Overview and W orkow As illustrated in Figure 6 , ZipServ consists of two main components: an oine compressor , which transforms BF16 model weights into a parallelization-friendly compressed representation, and an online inference engine , responsible for ecient decoding and computation at runtime. O line Compressor . At the core of the oine compressor is the T ensor-Core-A ware T riple Bitmap Encoding (TCA- TBE), a xed-length, bitmap-based compression format designed to enable parallel de co ding via GP U SIMT execution and T ensor Core–accelerated GEMM operations. As outlined in Algorithm 1 , given a model, the compressor rst proles the exponent distribution of each layer’s weights. Instead of selecting arbitrary frequent exponents, it identies a win- dow of 𝑘 numerically conse cutive exponent values (typically 𝑘 = 7 ) that maximizes coverage of the weight distribution. The compressor records the value immediately preceding this range as the BaseExp (i.e., min ( range ) − 1 ). Using this range, the compr essor encodes the entire w eight matrix into the TCA - TBE representation. Each 8 × 8 tile of weights is converted into three 64-bit bitmaps and two compact value buers: one for high-frequency values falling within the se- lected exponent range (storing only the sign and mantissa relative to BaseExp ), and another for outliers in full BF16 precision. The resulting compressed mo del is then loaded onto the GP U, ready for serving. Online Inference Engine. The inference engine employs a stage-aware strategy that adapts the execution pipeline for the prell and decode phases, all on the unied TCA - TBE format. During the compute-bound prell stage , the engine performs decouple d execution : a de dicated decompression kernel decompresses the weights into global memor y rst, followed by the prell computation. This approach allows high-throughput GEMM to eectively amortize the decom- pression overhead. In the memor y-bound decode stage , the engine switches to a fused de compression-GEMM kernel (ZipGEMM). ZipGEMM enables a “load-compr essed, compute- decompressed” execution model, where weights are decom- pressed on-the-y directly into T ensor Core registers. This eliminates redundant data transfers and maximizes compute intensity for each token generation. These two specialized execution paths deliver near-optimal inference performance. 4.2 T ensor-Core- A ware Triple Bitmap Encoding ZipServ is built on top of a novel T ensor-Core- A ware Triple Bitmap Encoding (TCA - TBE) scheme. It is designed to mini- mize the weight memory footprint while enabling ecient parallel decoding on GP Us. In contrast to existing variable- length bitstream-based entropy co decs, TCA - TBE employs a xe d-length, tile-structured representation that ensures constant-time, thread-local decompression. Its data layout is carefully aligned with T ensor Core tiling and register-lev el operand distribution, allowing the decompressed weights to be consumed directly by the mma.sync instruction. The core of TCA - TBE is a xed-length 3-bit codeword assigned to each weight element, r epresenting one of eight possible states ( 000 – 111 ). During oine compression, ZipServ pro- les the exponent histogram of a weight matrix and identies the top-7 most frequent exponent values and maps them to codewords 001 – 111 . The special codeword 000 serves as a fallback, designating weights whose exponent falls outside the top-7, which are then stored in full precision. The Choice of Codeword Length. W e choose the 3-bit codeword because it achieves a near-optimal compression ratio by leveraging the highly skewed exponent distributions observed in contemporary LLMs. T o quantify this design ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Ruibo Fan, et al. Figure 6. Overview of ZipServ . ZipServ features an oine lossless compressor (left) and an online inference engine (right). Algorithm 1 ZipServ Oine Compressor (TCA - TBE) Input: W eight Matrix W , Tile Size 𝑇 = 8 × 8 Output: Bitmaps B 1 . . 3 , High-Freq Buer H , Fallback Buer L , BaseExp 𝑒 𝑏𝑎𝑠 𝑒 1: ⊲ Phase I: Global Exponent Analysis 2: 𝐻 𝑖 𝑠 𝑡 ← ComputeExponentHistogram ( W ) 3: 𝐸 𝑡 𝑜 𝑝 ← SelectT op7ConsecutiveExponents ( 𝐻 𝑖 𝑠 𝑡 ) 4: 𝑒 𝑏𝑎𝑠 𝑒 ← min ( 𝐸 𝑡 𝑜 𝑝 ) − 1 ⊲ Set base for implicit lo okup 5: ⊲ Phase II: Tile Encoding 6: for each tile 𝑡 ∈ W do 7: Initialize local bitmaps 𝑏 1 , 𝑏 2 , 𝑏 3 ← 0 8: for 𝑖 = 0 to 63 do 9: 𝑤 ← 𝑡 [ 𝑖 ] ; 𝑒 ← 𝑤 .𝑒 𝑥 𝑝 𝑜 𝑛𝑒 𝑛𝑡 10: if 𝑒 ∈ 𝐸 𝑡 𝑜 𝑝 then 11: 𝑐 ← 𝑒 − 𝑒 𝑏𝑎𝑠 𝑒 ⊲ Compute 3-bit co de 𝑐 ∈ [ 1 , 7 ] 12: 𝑏 1 [ 𝑖 ] ← 𝑐 0 ; 𝑏 2 [ 𝑖 ] ← 𝑐 1 ; 𝑏 3 [ 𝑖 ] ← 𝑐 2 ⊲ Set bits 13: H . Push ( P ack ( 𝑤 .𝑠 𝑖𝑔𝑛, 𝑤 . 𝑚𝑎𝑛𝑡 𝑖 𝑠 𝑠 𝑎 ) ) 14: else 15: L . Push ( 𝑤 ) ⊲ Store full precision fallback 16: end if 17: end for 18: Store 𝑏 1 , 𝑏 2 , 𝑏 3 to global B 1 . . 3 19: end for choice, we calculate the expe cted per-element storage cost as: 𝐴𝑣 𝑒 𝑟 𝑎𝑔 𝑒 𝐵 𝑖 𝑡 𝑠 ( 𝑛 ) = 𝑟 𝑛 · ( 𝑛 + 8 ) + ( 1 − 𝑟 𝑛 ) · ( 𝑛 + 16 ) , where 𝑛 is the codeword length and 𝑟 𝑛 is the proportion of weights covered by the top 2 𝑛 − 1 exponent values. As shown in § 3.1 , 𝑟 3 ≈ 0 . 96 , yielding an average of 11.3 bits per element, which approaches the theoretical lo wer bound (8+2.6=10.6 bits) and oers clear advantages over 2-bit (12.4 bits) and 4-bit (12.1 bits) codewords. Besides, the 3-bit encoding yields a compact 7-entry codebook, enabling decoding via a simple table lookup. This requires only a handful of bitwise oper- ations per thread, which can be eciently performed with warp-synchronous T ensor Core pipelines. Figure 7. T ensor-Core- A ware T riple Bitmap Encoding. The 4 × 4 FragTile shown is illustrativ e; the actual size is 8 × 8 . Decoupled Triple Bitmap Layout. T o maximize de coding eciency on SIMT architectures, TCA - TBE implements a de- coupled triple bitmap layout rather than packing co dewor ds into a dense bitstream. Conventional bitstreams are ine- cient on GP Us because packing non-byte-aligned codes (e .g., 3-bit) forces codewords to span memor y word boundaries. This necessitates complex logic for non-aligned accesses and introduces data-dependent branching, which in turn causes thread divergence that se verely degrades SIMT throughput. TCA - TBE avoids these bottlene cks by de composing the 3-bit codew ords for each 8 × 8 weight tile into three indepen- dent 64-bit bitmaps, with each bitmap representing a single bit-plane (Figure 7 ). This design enables tw o benets. First, it guarantees coalesced memor y accesses, as each bitmap is a contiguous 64-bit wor d, naturally aligned to native mem- ory b oundaries. Second, it enables branch-free deco ding. All threads in a warp follow an identical execution path, aligning with the SIMT model on modern GP Us. ZipServ ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Figure 8. Data mov ement and instruction pipeline. Hierarchical Tiling Design. TCA - TBE adopts a three- level hierarchical tiling scheme that partitions the weight matrix according to the architectural granularity of modern GP Us. ❶ FragTile (FT): The base unit is an 8 × 8 tile, matching the smallest operand fragment of T ensor Cor e instruction. ❷ T ensorCore Tile (T T): Each 16 × 16 tile is composed of a 2 × 2 grid of FragTiles. This size aligns with the op erand dimen- sions ( m=16 , k=16 ) required by PTX-level T ensor Core mma in- structions ( mma.m16n8k16 ). ❸ BlockTile (BT): At the coarsest level, a 64 × 64 tile aggregates multiple T ensorCore Tiles and is processed cooperatively by a thread block. The FragTiles within a T ensor Core Tile are stored in column-major order , mirroring the op erand register layout (e.g., Ra0–Ra3) ex- pected by T ensor Core instructions. This design eliminates the nee d for runtime coordinate transformation, reducing instruction overhead. Each 8 × 8 FragTile is encoded using ve buers. ❶ Three 64-bit bitmaps, each representing one bit-plane of the 3-bit codewords. ❷ A PackedSignMantissa buer , which holds the compact 8-bit representation (sign and mantissa) of weights whose exponents fall within the top- 𝑘 frequent classes. ❸ A FullValue buer , which stores full-precision BF16 values for weights not cover ed by the exponent codebook. At the matrix level, TCA - TBE organizes these buers into four contiguous global arrays, each nested according to the tiling hierarchy . In addition, an Offset ar- ray records the starting oset of each Group Tile within the PackedSignMantissa and FullValue arrays. 4.3 Fused ZipGEMM Kernel Design TCA - TBE’s SIMT -friendly design opens up ne w opportuni- ties for high-throughput decoding. T o achieve this, ZipServ fuses decompression and matrix multiplication into a single kernel, ZipGEMM , that fetches weights from global mem- ory in a compact TCA- TBE format and decompresses them just-in-time during computation. ZipGEMM enables a load- compressed, compute-decompressed execution model, sub- stantially reducing the memory bandwidth requirement for each token generation in the decode stage (see Figure 5 ). 4.3.1 Kernel W orkow . Figure 8 illustrates the workow of the ZipGEMM kernel. Based on a split-K tiling architec- ture, each thread block iteratively processes the 𝐾 dimension in chunks. In each iteration, the kernel proceeds through four coordinated stages. ❶ Tile Loading. Threads coopera- tively load the compr essed weight tile and the correspond- ing activation tile from global memor y into shared memory , with asynchronous and vectorize d memory instructions (i.e., LDGSTS.128 ) to bypass the L1 cache and improve global memory bandwidth utilization. The PackedSignMantissa and FullValue arrays within each tile are padde d oine to ensure 128-bit alignment. ❷ W arp-Level Decoding. Each warp independently decompresses the compressed weight from shared memor y . The decompressor reconstructs the original BF16 values in a layout compatible with T ensor Core consumption, utilizing lightweight ALU operations and avoiding shared memory round-trips. ❸ Activation Register Transfer . The activation tile is moved from shared memory into r egisters using the LDSM.M88 instruction, which enables a warp to load a 16 × 16 tile and arrange it in the layout re- quired for T ensor Cores. ❹ T ensor Core Computation. Once both decompressed weights and activations reside in r egis- ters, the warp performs T ensor Core mma instructions. The execution path closely mirrors the standard cuBLAS GEMM kernels, while operating directly on compressed representa- tions and reducing global memory accesses. 4.3.2 Ecient Decompressor . ZipGEMM incorporates an ecient Decompressor that enables thread-local r econ- struction of compressed weights directly within the register le. The core principle of the Decompressor is that each thread independently decompresses the elements required for the proper T ensor Core fragment layout . Sp ecically , as shown in Figure 7 , the fragment layout requires that thread 𝑖 ’s .bf16x2 register (e.g., Ra0 ) holds the values at positions 2 𝑖 and 2 𝑖 + 1 within the 8 × 8 tile, denote d as 𝑎 0 and 𝑎 1 respectively . Since each element is encoded in one of two states—either as a high-frequency xed-length code or as a fallback full-pr ecision value—and these states are distributed in an unstructured manner , the decompressor solves a sparse, non-uniform spatial reconstruction problem. T wo challenges arise in this context. First, each thread must eciently de- termine the state of its assigne d element (compressed or fallback). Second, each thread should recover the original BF16 representation in a deterministic, SIMT -friendly man- ner . T o this end, ZipServ ’s Decompressor is structured into three tightly integrated stages: spatial bitmap indicator , dy- namic addressing, and fast exponent reassembly (see Figure 9 and Algorithm 2 ). Spatial Bitmap Indicator . Each thread rst determines the storage mode of its assigned elements by evaluating a spatial indicator mask. During oine compression, each 8 × 8 weight tile is encoded using three 64 - bit bitmaps, where each ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Ruibo Fan, et al. Figure 9. The Decompressor Design. Algorithm 2 ZipGEMM Thread-Local Decompression Input: Bitmaps B 1 . . 3 , Buers H , L , BaseExp 𝑒 𝑏𝑎𝑠 𝑒 , LaneID 𝑙 Output: Register pair 𝑅 containing two BF16 values 1: ⊲ Step 1: Spatial Indicator Construction 2: M ← B 1 ∨ B 2 ∨ B 3 3: ⊲ Step 2: Parallel Element Decompression 4: for 𝑘 ∈ { 0 , 1 } do 5: 𝑝 ← 2 · 𝑙 + 𝑘 ⊲ Global p osition in 8 × 8 tile 6: 𝑚𝑎𝑠 𝑘 ← ( 1 ≪ 𝑝 ) − 1 7: 𝑖 𝑑 𝑥 H ← P opc ( M & 𝑚 𝑎𝑠𝑘 ) ⊲ Calculate index 8: if ( M ≫ 𝑝 ) & 1 then 9: ⊲ Case A: High-Frequency Path 10: 𝑣 𝑎𝑙 ← H [ start H + 𝑖 𝑑 𝑥 H ] ⊲ Fetch Sign + Mantissa 11: ⊲ Reconstruct 3-bit co de 12: 𝑐 ← ( B 3 [ 𝑝 ] ≪ 2 ) ∨ ( B 2 [ 𝑝 ] ≪ 1 ) ∨ B 1 [ 𝑝 ] 13: 𝑒 ← 𝑒 𝑏𝑎𝑠 𝑒 + 𝑐 ⊲ Implicit Lo okup 14: 𝑤 𝑘 ← MakeBF16 ( 𝑣 𝑎𝑙 . 𝑠 𝑖𝑔𝑛, 𝑒 , 𝑣 𝑎𝑙 . 𝑚𝑎𝑛𝑡 𝑖 𝑠 𝑠 𝑎 ) 15: else 16: ⊲ Case B: Fallback Path 17: 𝑖 𝑑 𝑥 L ← 𝑝 − 𝑖𝑑 𝑥 H ⊲ Calculate index in fallback 18: 𝑤 𝑘 ← L [ start L + 𝑖 𝑑 𝑥 L ] 19: end if 20: end for 21: 𝑅 ← P ackRegister ( 𝑤 0 , 𝑤 1 ) 22: return 𝑅 bitmap encodes a single bit of the 3 - bit codeword. At run- time, the three bitmaps are combined using a warp - level bitwise OR to produce a single 64 - bit indicator mask. Each bit in this mask species the storage mode of one element: 1 for compressed ( high - frequency), 0 for fallback (uncom- pressed). Each thread determines its decoding path by in- specting the corresponding bits in this spatial indicator mask, which resides in registers. Specically , for thread 𝑖 , the bits at positions 2 𝑖 (for 𝑎 0 ) and 2 𝑖 + 1 (for 𝑎 1 ) indicate the state of the two assigned elements. For instance, Thread 19 nds that bit 38 ( 2 × 19 ) is set, indicating its 𝑎 0 element is stored in compressed form. It fetches the packed value from the high - frequency buer and proce eds with exponent r eassem- bly . In contrast, Thread 6 sees that bit 12 ( 2 × 6 ) is unset and simply loads its 𝑎 0 directly from the fallback buer . This bit- wise decision process is lightweight, fully register - resident, and completes in constant time. Dynamic Addressing. Once the storage mo de is deter- mined, each thread computes its read oset into the appr o- priate value buer on-the-y , without explicit per-element indices. This is achieved via a lightweight, warp-local prex sum over the spatial indicator . For thread 𝑖 , the oset is calcu- lated by counting how many previous elements of the same storage type appear in bits [ 0 , 2 𝑖 − 1 ] of the spatial indicator . Specically , if the element is compressed ( bit = 1 ), the oset equals the number of 1 s; if uncompressed ( bit = 0 ), it equals the number of 0 s in that range. These counts are eciently computed using GP U-native instructions such as __popc() and __shfl_sync() . For example, Thread 6, encountering an unset bit at position 12, computes its fallback buer oset by counting the number of 0 s in bits [ 0 , 11 ] . Thread 19, with bit 38 set, counts the number of 1 s in bits [ 0 , 37 ] to access the compressed buer . This dynamic addressing mechanism transforms indexing into a deterministic, SIMT -friendly op- eration that aligns naturally with GP U execution patterns. Fast Exp onent Reassembly via Implicit Lookup. T o further reduce the decoding overhead, ZipServ reconstructs exponents using an implicit lookup mechanism based on arithmetic remapping , avoiding table-based decoding. Dur- ing oine compression, the top-7 most frequent exponent values are identied globally and assigned 3-bit codewords ( 001 – 111 ), ordered by increasing numerical value instead of frequency rank. A single global base exponent is r ecorded as base_exp = min ( top_exponents ) − 1 , which is shared by all ZipServ ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA tiles. At runtime, each thread reconstructs the original expo- nent by adding the 3 - bit codewor d to the base exponent. This operation eliminates shared memor y table lookups by using a single integer ALU instruction. The recovered exponent is then fused with the sign and mantissa elds to assemble a valid BF16 value. For example, Thread 19 observes that bit 38 in the spatial indicator is set and reconstructs the 3 - bit codeword by reading the corresponding bits from the three bitmap planes, yielding 101 (5). With a global base exponent of 115, it recovers the original exponent as 115 + 5 = 120 , then combines it with the sign and mantissa to form the nal BF16 value. This arithmetic decoding process is fully SIMT -compatible, exploits the GP U’s integer pip elines. Repacking into T ensor Core Fragments. Each thread repacks the tw o reconstructed BF16 elements into a single bfloat162 register , matching the operand layout required by T ensor Core mma.sync instructions. 4.3.3 Fine-grained Software Pipeline. ZipGEMM uses a hierarchical two-level pipeline to overlap memory transfer , decompression, and computation, eectively hiding mem- ory and decompression latency . At the coarse level, tile-wise double buering overlaps global-to-shared memor y trans- fers with computation; at the ne level, slice-wise interleav- ing overlaps shar ed-to-register movement and decompres- sion with T ensor Core operations. This is implemented via two shared memory buers for compressed weights (triple bitmaps, packed sign-mantissa, fallback values) and activa- tions. Within each tile, computation is sliced along the K dimension (typically 16 × 16 fragments) and processed us- ing an interleaved load-decompress-compute pattern. While T ensor Cores execute matrix multiplication ( mma ) on slice 𝑖 , ALU units concurrently load and decompress weights for slice 𝑖 + 1 from shared memory into r egisters. This ensur es a steady compute ow by hiding decompression and memor y latency behind computation. T o co ordinate the two pipeline levels, ZipGEMM uses a hierarchical barrier strategy for inter-tile and intra-warp synchronization. Inter-tile synchronization: cp.async.wait_group<0>() and __syncthreads() en- sure all asynchronous transfers complete before switching buers. This barrier is placed after the nal slice decompres- sion but before the nal slice mma , allowing computation to proceed while the next tile is being loaded and de compressed, which maximizes overlap and minimizes stalls. Intra-warp coordination: Intra-warp operations ar e implicitly synchro- nized via the SIMT model, requiring no explicit barriers between load, decompress, and compute at the slice level. 4.4 Stage- A ware Inference Strategy ZipServ uses the fused ZipGEMM kernel exclusively dur- ing the decode stage for accelerated token generation. For the compute-bound prell stage, where large matrix dimen- sions ( 𝑁 = 𝐵𝑆 × 𝑆 𝑒 𝑞 _ 𝑙 𝑒 𝑛 ) provide high arithmetic intensity , Figure 10. Hierar chical software pipeline design. ZipServ falls back to a decoupled pip eline: an ecient de- compression kernel rst extracts the compressed weights to global memory , then performs high-thr oughput GEMM operations to amortize the decompression overhead (typi- cally < 4% as shown in § 6.4 ). In both prell and decode stages, the decompression kernel and ZipGEMM kernel shar e the same compressed format and per-thread decompression logic (§ 4.3.2 ), obviating the need for runtime format conversions. 5 Implementation W e implemented ZipServ as a high-performance, modular inference backend comprising appr oximately 3.5K lines of code. The core engine consists of about 2.5K lines of CUDA and C++, which implements the oine TCA -TBE compres- sor and the online ZipGEMM kernel. The kernel is compiled into a standalone shared library ( .so ) using nvcc , exposing C++ APIs for weight packing and kernel launching. The re- maining 1.0K lines are Python glue code used to integrate ZipServ into vLLM [ 39 ]. W e extended vLLM’s model loader and linear execution modules to support the TCA -TBE for- mat, utilizing PyBind11 to invoke our custom CUD A kernels. 6 Evaluation W e evaluate the performance of ZipServ at two levels: the kernel level of the fuse d ZipGEMM and the standalone De- compression kernel ( ZipServ -Decomp), and the end-to-end inference framework level. All experiments are conducte d on two platforms. ❶ A consumer-grade server equipped with 4 × NVIDIA RTX4090 GP Us (A da Lovelace, 24GB memory , Compute Capability 8.9), paired with an Intel Xeon Plat- inum 8352V CP U (144 cores, 512GB DDR4). ❷ A datacen- ter platform with 4 × NVIDIA L40S GP Us ( Ada Lovelace, 48GB), paired with an Intel Xeon Gold 6230R CP U (104 cores, 512GB DDR4). W e also evaluate ZipGEMM on the latest ❸ RTX5090 GP U (Blackwell, 32GB, Compute Capability 12.0) to demonstrate forward compatibility . All code is compiled using GCC 11.3 and N VCC 12.4 (with N V CC 12.8 specically for RTX5090). For kernel-lev el evaluation, we perform 100 warm-up iterations followed by 1,000 timed executions. For end-to-end evaluation, each conguration is run 10 times. ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Ruibo Fan, et al. Figure 11. K ernel performance comparison on N VIDIA RTX4090 and L40S GP Us. 6.1 ZipGEMM Kernel Performance Datasets. W e benchmark the kernel-level performance on representative linear layers fr om state-of-the-art LLMs. The input shapes for kernel b enchmarking are directly extracted from the real weight matrices of prominent LLM families, including LLaMA3.1 [ 17 ] (8B, 70B, and 405B), Qwen2.5 [ 69 ] (7B, 14B, 32B, and 72B), Gemma3 [ 68 ] (12B and 27B), and Mistral [ 2 ] (24B and 123B), covering a broad range of model scales and hidden dimensions. Baselines. W e compare ZipGEMM against four representa- tive baselines: ❶ cuBLAS_ TC v12.4.5 [ 52 ], N VIDIA ’s ocial BF16 T ensor Core GEMM kernel; ❷ DietGP U [ 33 ], a p opular open-source, GP U-native rANS codec for lossless decom- pression of oating-point weights; ❸ nvCOMP (r ANS) [ 53 ], NVIDIA ’s general-purpose asymmetric numeral systems- based decompression library; and ❹ DFloat11 [ 85 ], a state- of-the-art Human-coded GP U de compression frame work for LLM inference. Since nvCOMP lacks native BF16 sup- port, we compress exponent bits as a bitstream via r ANS and reconstruct BF16 values with a custom high-performance kernel. For DFloat11, whose compression co de is unavail- able, we benchmark full T ransformer block decompression latency and linearly scale estimates for other matrix shapes. W orkloads. W e prole all linear layers within a Trans- former block, including the merged QK V projection ( QKV_proj ), attention output projection ( O_proj ), merge d FFN gate and up pr ojection ( GateUp_proj ), and down pr o- jection ( Down_proj ), along with the model’s LM head layer . Benchmarks are conducted at batch sizes of 8, 16, and 32. Results. W e begin by evaluating the performance of our fused ZipGEMM kernel. Figure 11 shows the normalized speedup relative to cuBLAS_ TC across all evaluated mo d- els and workloads. ZipGEMM consistently outp erforms all baseline methods on b oth hardware platforms. On the RTX4090, ZipGEMM achieves an average speedup of 1 . 31 × over cuBLAS_ TC, with a peak sp eedup of 1 . 71 × . The advan- tage is even gr eater on the L40S, with an average speedup of 1 . 36 × and a maximum of 2 . 21 × . In contrast, other decou- pled decompression metho ds introduce substantial overhead, resulting in signicant slowdowns. Specically , DietGP U, nvCOMP, and DF loat11 achieve average spe edups of only 0 . 17 × / 0 . 20 × , 0 . 19 × / 0 . 23 × , and 0 . 28 × / 0 . 34 × on RTX4090 and L40S, respectively . This indicates that the decouple d decompression processes incur overheads that excee d the computation time of the baseline GEMM. ZipGEMM stands out as the only implementation that can signicantly sur- pass the ecient T ensor Core GEMM. These results high- light the eectiveness of ZipGEMM’s fused decompression- computation approach, which eciently transforms storage savings into tangible execution speedup. W e further conducted a layer-wise analysis (Figure 11 (c)). ZipGEMM exhibits signicant acceleration on most of the computationally intensive lay ers within a transformer block. For instance, within the LLaMA3.1 model family on the L40S, ZipGEMM achiev es average sp eedups of 1 . 39 × and 1 . 64 × on the GateUp_proj and Down_proj lay ers, respectively . How- ever , ZipGEMM may experience a slowdown when process- ing certain layers with small shap es; for example, on the L40S, its performance on the O_proj layer of LLaMA3.1-8B is reduced to 0 . 79 × . This is primarily because small layers require ne-grained parameter tuning (e .g., split-K congu- rations and precise tiling) to fully utilize hardware, which is beyond the scope of this work. Ne vertheless, such layers account for only a small fraction of the total FLOPs within a Transformer block. ZipGEMM delivers robust block-lev el ZipServ ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Figure 12. Micr o-level kernel performance analysis. Figure 13. Standalone decompression kernel comparison. speedups of 1 . 35 × for LLaMA3.1-8B and 1 . 48 × for LLaMA3.1- 405B on the L40S. Micro-level Analysis. W e pr oled ZipGEMM with Nsight Compute (NCU) on an RTX4090 to identify the source of its sp eedup ( 𝑀 = 28672 , 𝐾 = 4096 and 𝑁 = 32 ). As shown in Figure 12 , the performance gain stems from a deliberate architectural trade-o: introducing a predictable ALU work- load for on-the-y decoding in exchange for a reduction in memory trac. Figure 12 (a) quanties this trade-o. The high volume of integer and logical instructions (LOP3, IADD, and POPC) reects the computational cost of our core de- coding steps. This workload is the price for a 29.3% drop in DRAM reads, a direct validation of the TCA - TBE format’s ef- ciency . Crucially , the two-level software pipeline ee ctively hides the decoding latency by overlapping it with compute and memory op erations. As a result, e ven with ALU utiliza- tion soaring to 66.0%, T ensor Core utilization is maintained at a remarkable 71.6% of the cuBLAS baseline, demonstrating that compute throughput is preser ved (Figure 12 (b)). This high pipeline eciency is enabled by our data layout. As seen in Figure 12 ( c), shared memory bank conicts are virtu- ally eliminated ( ∼ 4.7K) compar ed to the millions incurred by methods like DietGP U. This conict-free access is a prereq- uisite for our ne-grained pipeline, ensuring smooth data ow and maximizing SIMT throughput. 6.2 Decompression Kernel Performance T o further dissect de compression eciency , we benchmark our standalone ZipServ -Decomp kernel. Figure 13 presents the total decompression time for all weights in a full Trans- former blo ck of LLaMA3.1-8B and Mistral-24B. ZipServ - Decomp achieves average sp eedups of 2.14 × , 1.83 × , and Figure 14. Cr oss-generation performance comparison. Figure 15. ZipServ performance under dierent N settings. 1.10 × over DietGP U, nv COMP, and DF loat11, respectively . Although the TCA - TBE format was co-designed to support fused execution with matrix multiplication, its structure proves highly ecient for standalone decompression as well. This eciency stems from its xed-length, warp-aligned de- sign, which eliminates control divergence and enables warp- synchronous per-thread decoding. In contrast, although ex- isting baselines are explicitly optimized for decompression, they often rely on variable-length, entropy-coded formats. These lead to thread div ergence, serialized bit parsing, and irregular memory access that degrade GP U eciency . 6.3 Performance Across GP U Generations and Tiers T o establish for ward compatibility , we b enchmark ZipGEMM on the latest NVIDIA RTX5090 and compare it against top- tier datacenter A100 and H800 using LLaMA3.1-8B and Mistral-24B GateUp_proj layers at batch size 32. W e rst directly port ZipGEMM to the Blackwell-based RTX5090 without exploiting new features (e.g., T ensor Memory and asynchronous WMMA execution [ 32 ]). As shown in Figur e 14 , ZipGEMM delivers substantial speedups over cuBLAS_ TC on RTX5090—1.34 × for LLaMA3.1-8B and 1.87 × for Mistral- 24B—conrming the design to b e forward-compatible. ZipGEMM also narrows the consumer–datacenter divide: on an RTX4090, ZipGEMM outperforms the standard cuBLAS_ TC on A100 with LLaMA3.1-8B (0.195 ms vs. 0.215 ms, 9.3% faster ) and is only 2.7% slower on Mistral-24B (0.530 ms vs. 0.516 ms), eectively placing it in the same perfor- mance class. This trend intensies on newer hardware . While a standard RTX5090 trails the H800 by 53.3% (LLaMA3.1-8B) and 125.7% (Mistral-24B), ZipGEMM r educes these decits to 14.1% and 20.8%, respectively (Figure 14 (b)), approaching datacenter-level performance on consumer GP Us. ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Ruibo Fan, et al. 6.4 Overhead Analysis W e analyze the system overhead from two persp ectives: runtime inference ov erhead and oine preparation cost. ❶ Runtime Overhead. Figure 15 quanties the ov erhead of ZipServ during inference across dierent 𝑁 settings ( 𝑁 = 𝐵𝑆 × 𝑆 𝑒 𝑞𝑙 𝑒 𝑛 ). In the decode stage (small 𝑁 , typically 1– 128), the fuse d ZipGEMM kernel incurs no overhead. Instead, it consistently outperforms the cuBLAS_ TC baseline in these memory-b ound regimes, with on-the-y decompression fully hidden within the kernel execution. For the compute-bound prell stage (large 𝑁 , e.g., 8192), where ZipGEMM’s on- the-y decompression overhead outweighs its benets from reduced memor y access, ZipServ switches to a decoupled pipeline. The ecient decompression kernel rst expands the compressed weights, follo wed by cuBLAS_ TC GEMMs. This incurs a limited overhead of only ∼ 4%/2% of the GEMM time at 𝑁 = 8192 / 16384 . ❷ Oine Compression Cost. Be- yond runtime performance, w e also evaluate the one-time cost of preparing the model. Compressing the LLaMA -3.1-8B model takes approximately 2.5 minutes on a 16-core Intel Xeon 8352V CP U. Given that this is an oine operation per- formed only once prior to deployment, it does not impact the critical path of online serving and is negligible when amortized over the model’s lifecycle. 6.5 End-to-end Inference Performance Setup. W e evaluate the end-to-end inference performance of ZipServ on a range of representative models and hard- ware congurations: LLaMA3.1-8B on one RTX4090 GP U, Mistral-24B on two L40S GP Us, and LLaMA3.1-70B on four L40S GP Us with tensor parallelism. W e benchmark using batch sizes of 8 and 32, with varied output sequence lengths of 128, 256, 512, 1024, and 2048 tokens to simulate dierent serving scenarios. W e compare ZipServ against three leading baseline systems: ❶ vLLM [ 39 ], a state-of-the-art LLM infer- ence and serving framework; ❷ Transformers [ 75 ], a widely adopted standard library; and ❸ DFloat11 [ 85 ], representing state-of-the-art performance for lossless compression-based inference frameworks. W e measure two key metrics: end-to- end request latency (total time to generate the full output sequence) and throughput (output tokens per second). As shown in Figure 16 , ZipServ consistently demonstrates su- perior performance across all tested congurations. Results. For latency , on average, across all models and batch sizes, ZipServ reduces latency by 17.60%, 60.79%, and 82.13% compared to vLLM, T ransformers, and DF loat11, respectively . For throughput , ZipServ provides average speedups of 1.22 × over vLLM, 3.18 × over T ransformers, and 8.52 × over DFloat11. The performance gains are pronounced for long-context generation, where the memory-bandwidth savings and computational eciency of the fused ZipGEMM kernel in the decode phase be come dominant. For instance, when generating 2048 output tokens with batch size of 32 500 1000 1500 2000 0 50 100 Llama-3.1-8B - Latency (Batch Size 8) 500 1000 1500 2000 0 100 Llama-3.1-8B - Latency (Batch Size 32) 500 1000 1500 2000 500 1000 1500 Llama-3.1-8B - Thr oughput 500 1000 1500 2000 0 200 400 Latency (s) Mistral-24B - Latency (Batch Size 8) 500 1000 1500 2000 0 250 500 Mistral-24B - Latency (Batch Size 32) 500 1000 1500 2000 250 500 750 Thr oughput (tok ens/s) Mistral-24B - Thr oughput 500 1000 1500 2000 Output L ength 0 500 1000 Llama-3.1-70B - Latency (Batch Size 8) 500 1000 1500 2000 Output L ength 0 1000 2000 Llama-3.1-70B - Latency (Batch Size 32) 500 1000 1500 2000 Output L ength 200 400 Llama-3.1-70B - Thr oughput ZipServ vLLM T ransfor mers DFloat11 Figure 16. End-to-end performance comparison. using LLaMA3.1-8B, ZipServ achiev es a throughput of 1105 tokens/sec, resulting in a 1.66 × speedup over vLLM. W e also analyzed the memor y consumption during inference . For LLaMA3.1-8B, Mistral-24B, and LLaMA3.1-70B, ZipServ re- duces the weight fo otprint of 14.96/43.92/131.56 GB down to 10.83 (72.4%)/31.30 (71.3%)/93.52 (71.1%) GB, respectively . The reduction in weight storage further enhances ser ving eciency in two key ways. First, it enables the deployment of larger models on resource-constrained hardware. Second, the freed memory can be allocated to the KV cache, allow- ing memory managers like vLLM’s PagedAttention [ 39 ] to support larger batch sizes and longer contexts, thereby con- verting static weight savings into dynamic throughput gains. Breakdown Analysis. W e further dissect the performance gains by analyzing the latency and memory comp osition of LLaMA -3.1-8B on an RTX4090, as detailed in Figure 18 . In the baseline vLLM system (at sequence length 1024), GEMM op- erations dominate the runtime, consuming 24.99 ms (83.6% of total latency). ZipServ eectively alleviates this bottleneck: the fused ZipGEMM kernel, combined with residual dense GEMMs, reduces the total linear layer latency to 14.76 ms, a 1.69 × improvement. Since Attention (3.02 ms) and other overheads (1.88 ms) remain constant, these kernel-level gains directly drive the end-to-end speedup. On the memory front, ZipServ compresses the static weights from 14.96 GB to 11.18 GB. This 3.78 GB saving is automatically repurposed by the memory manager to expand the K V cache capacity from 5.07 GB to 8.60 GB (a 1.70 × increase), thereby enabling the higher throughput and longer context support observed in our end-to-end benchmarks. 7 Limitation and Discussion ZipServ is designed for the increasingly important deploy- ment scenario on resource-constrained consumer-grade and inference-optimized GP Us, where limited memory band- width makes lossless compression a p owerful lever for ef- ciency . On such platforms, ZipServ consistently deliv ers substantial acceleration and memory savings. T o stress-test performance under more bandwidth-relaxed conditions, we ZipServ ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Figure 17. Breakdo wn of end-to-end inference time and memory consumption. Figure 18. Performance on training-oriented GP Us. also b enchmarked on training-oriented datacenter GP Us (A100, H800), where ZipGEMM may not always match the highly optimized cuBLAS baseline (Figure 18 ). This reects a hardware–softwar e mismatch rather than an algorithmic limitation: abundant HBM (HBM2e/HBM3) alleviates the memory b ottlenecks ZipServ is designed to mitigate, while lower core frequencies (e.g., 1410 MHz on A100 vs. 2520 MHz on RTX4090) make the intensive ALU workload harder to hide within the software pipeline. Nevertheless, ZipServ still provides best-in-class support for compressed inference. Our standalone decompression kernel outperforms state-of- the-art by up to 2.64 × , and ZipGEMM remains the fastest fused GEMM kernel. A s shown in § 6.3 , ZipServ also enables consumer-grade GP Us to close much of the gap with elite dat- acenter accelerators, oering a comp elling cost-p erformance proposition for deployment on accessible hardware . While ZipServ targets bit-exact inference, a comparison with lossy techniques is instructive. ZipGEMM was b ench- marked against the Marlin W8A16 FP8 kernel on an RTX4090 GP U, using a representativ e weight shap e ( 28672 × 4096 ) at batch size 32. Although ZipGEMM trails Marlin- W8A16 in latency (0.194 ms vs. 0.143 ms), the resulting 1 . 36 × gap aligns closely with the ratio of ee ctive bit-widths ( ∼ 11 bits vs. FP8). This indicates that our design reduces and hides the overhead of complex lossless decompression within the memory access latency . Furthermore, ZipServ is orthogonal to lossy methods and can b e applied atop quantize d weights to exploit residual r edundancy , combining aggr essive com- pression with enhanced performance [ 26 ]. Three key directions are envisioned for extending ZipServ . First, the TCA - TBE format can b e adapted for lossless K V Cache compression, addressing the dominant memory bot- tleneck in long-context serving [ 45 ]. Second, although cur- rently optimized for NVIDIA architectures, ZipGEMM can be adapted to other matrix accelerators, including Intel AMX [ 37 ] and AMD Matrix Cores [ 61 ]. This extensibility is supp orted by the hardware-agnostic nature of the core design, as the integer arithmetic and population count in- structions required for decompression are widely supporte d across modern instruction sets. Finally , ZipServ is applicable to br oader system-lev el challenges, including ecient model checkpointing [ 65 , 71 ] and communication compression in distributed training [ 73 , 84 ]. 8 Related W ork Lossy Model Compression. Lossy metho ds dominate LLM acceleration, mainly via post-training quantization (PTQ) [ 4 , 12 , 15 , 19 , 23 , 43 , 46 , 78 , 88 ] and pruning [ 11 , 14 , 22 , 67 , 80 , 86 ], supported by ecient kernels [ 21 , 24 , 55 – 57 , 72 , 77 ]. These approaches risk accuracy degradation [ 16 , 41 ]. ZipServ provides bit-exact, lossless, and ecient inference. Lossless Mo del Compression. A large bo dy of work has investigated memory compression to reduce bandwidth or expand capacity via lightweight hardware schemes [ 7 , 8 , 20 , 38 , 58 , 59 , 87 ], but these techniques are not tailored for model compression. Eorts such as LMC [ 71 ] and ZipNN [ 29 ] apply Human [ 31 ] to compress checkpoints for ecient storage and distribution, but oer no runtime b enets. Recent sys- tems, including NeuZip [ 28 ], DietGP U [ 33 ], nvCOMP [ 53 ], and DFloat11 [ 85 ], support lossless GP U code cs at runtime to reduce inference memory usage, but incur signicant overhead (§ 3 ). Hu-LLM [ 83 ] achieves higher eciency but targets FPGA -like architectures and do es not generalize to GP Us. Ecco [ 5 ] designs specialized Human codec hardware, but targets lossy compression. Our ZipServ fuses decompres- sion and GEMM computation, turning lossless compression into practical GP U inference acceleration. Kernel Fusion. Kernel fusion reduces memory trac by combining operators, as in FlashAttention [ 9 , 10 , 62 ] or graph-level framew orks [ 48 , 76 , 79 ]. ZipServ draws insights from them and, to our knowledge, is the rst to fuse decom- pression with GEMM, avoiding full-weight materialization. System-Level Optimizations for LLM Inference. Mod- ern LLM serving is powered by sophisticated inference en- gines [ 1 , 25 , 27 , 30 , 39 , 42 , 64 , 66 , 82 , 89 , 90 ], which focus on high-level scheduling strategies and memory orchestration. ZipServ is orthogonal and complementary , and can b e in- tegrated as a high-performance backend. This allows these engines to benet from both a reduced memory footprint and accelerated computation without altering their core logic. ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Ruibo Fan, et al. 9 Conclusion W e presented ZipServ , a lossless compr ession framework that, for the rst time, delivers signicant inference ac- celeration for Large Language Models. By co-designing a hardware-aware compression format, TCA - TBE, with a fuse d decompression-GEMM kernel, ZipServ overcomes the ar chi- tectural bottlenecks that have historically plagued lossless methods on GP Us. Our evaluation demonstrates substan- tial speedups over highly-optimized baselines like cuBLAS, particularly on consumer-grade hardware where ZipServ narrows the performance gap to e xpensive datacenter GP Us, establishing a compelling cost-p erformance proposition. Ul- timately , ZipServ reframes lossless compression from a mere storage-saving utility into a practical and powerful tool for high-performance, bit-exact LLM inference . Acknowledgments W e extend our thanks to the anonymous ASPLOS reviewers and our shepherd, Bo Wu, for their valuable fe edback and support. This work was partially supported by National Nat- ural Science Foundation of China under Grant No . 62272122, the Guangzhou Municipal Joint Funding Project with Uni- versities and Enterprises under Grant No. 2024A03J0616, Guangzhou Municipality Big Data Intelligence K ey Lab (2023 A03J0012), Hong K ong CRF grants under Grant No. C7004- 22G and C6015-23G, the NSFC/RGC Collaborative Research Scheme under the contract of CRS_HKUST601/24, and Na- tional Natural Science Foundation of China under Grant No. 62302126. W ei W ang and Xiaowen Chu are the correspond- ing authors. A Theoretical Analysis: Compressibility of LLM BF16 W eights W e present the theoretical foundation showing why expo- nent distributions in LLM weights are highly skewed and exhibit top-K contiguity . Following recent studies [ 13 , 40 , 63 ], we assume that weights 𝑤 ∈ R 𝐷 in a single layer (vectorized for analysis) follow a zero-mean normal distribution: 𝑤 ∼ N ( 0 , 𝜎 2 𝐼 ) A non-zero, normal BF16 number 𝑣 is represented as 𝑣 = ( − 1 ) 𝑆 × 2 𝐸 − 127 × ( 1 . 𝑚 1 . . . 𝑚 7 ) 2 , where 𝑆 is the sign bit, 𝐸 is the 8-bit unsigned integer value of the e xponent eld, and ( 1 . 𝑚 1 . . . 𝑚 7 ) 2 is the 7-bit mantissa with an implicit leading 1. The bias for the BF16 exponent is 127. Let 𝑥 = 𝐸 − 127 b e the actual e xponent value. Any number using this specic exponent 𝐸 will have a magnitude in the range [ 2 𝑥 , 2 𝑥 + 1 ) . Our analysis focuses on the probability dis- tribution of this exponent value 𝑥 (or equivalently , 𝐸 ), given that the weights 𝑤 are drawn from N ( 0 , 𝜎 2 ) . The redun- dancy arises if this distribution 𝑃 ( 𝑋 = 𝑥 ) is highly skewed, meaning some exponent values are far more common than others. The probability of a single weight 𝑤 𝑖 falling into the mag- nitude range corresponding to a specic exponent 𝑥 is: 𝑃 ( 𝑋 = 𝑥 ) = 𝑃 ( 2 𝑥 ≤ | 𝑤 𝑖 | < 2 𝑥 + 1 ) Note that this calculation is an appr oximation. W e are cal- culating the probability of a value falling into the e xponent’s ideal magnitude range [ 2 𝑥 , 2 𝑥 + 1 ) , which simplies the BF16 quantization process by ignoring rounding eects caused by the 7-bit mantissa. However , this serves as a robust approxi- mation for analyzing the overall exponent distribution. Given that 𝑤 𝑖 ∼ N ( 0 , 𝜎 2 ) , its Probability Density Function (PDF) is 𝑓 ( 𝑤 𝑖 ) = 1 √ 2 𝜋 𝜎 2 𝑒 − 𝑤 2 𝑖 / ( 2 𝜎 2 ) . The probability is the integral of this PDF over the positive and negative ranges: 𝑃 𝜎 ( 𝑋 = 𝑥 ) = 2 ×  2 𝑥 + 1 2 𝑥 1 √ 2 𝜋 𝜎 2 𝑒 − 𝑡 2 / ( 2 𝜎 2 ) 𝑑 𝑡 This integral can b e expressed using the error function (erf ), dened as erf ( 𝑧 ) = 2 √ 𝜋  𝑧 0 𝑒 − 𝑢 2 𝑑𝑢 : 𝑃 𝜎 ( 𝑋 = 𝑥 ) = erf  2 𝑥 + 1 𝜎 √ 2  − erf  2 𝑥 𝜎 √ 2  Theorem A.1. The function 𝑃 ( 𝑋 = 𝑥 ) = erf  2 𝑥 + 1 𝜎 √ 2  − erf  2 𝑥 𝜎 √ 2  is unimodal for 𝑥 ∈ Z . Proof. T o prove unimo dality , we consider the continuous extension 𝑓 ( 𝑥 ) = erf  2 𝑥 + 1 𝜎 √ 2  − erf  2 𝑥 𝜎 √ 2  for 𝑥 ∈ R . If 𝑓 ( 𝑥 ) is unimodal, then the discrete function 𝑃 ( 𝑋 = 𝑥 ) , which is the evaluation of 𝑓 ( 𝑥 ) at integer points, will also be unimodal. Let 𝑢 = 2 𝑥 𝜎 √ 2 , so that 𝑓 ( 𝑥 ) = erf ( 2 𝑢 ) − erf ( 𝑢 ) . The deriv- ative of the error function is 𝑑 𝑑 𝑧 erf ( 𝑧 ) = 2 √ 𝜋 𝑒 − 𝑧 2 . Thus, the derivative of 𝑓 with respe ct to 𝑥 is: 𝑑 𝑓 𝑑 𝑥 = 2 √ 𝜋 𝑢 ln 2 𝑒 − 𝑢 2  2 𝑒 − 3 𝑢 2 − 1  Let ℎ ( 𝑢 ) = 2 𝑒 − 3 𝑢 2 − 1 . Since 2 √ 𝜋 , 𝑢 , ln 2 , and 𝑒 − 𝑢 2 are all positive for 𝑢 > 0 (as 2 𝑥 > 0 ), the sign of 𝑑 𝑓 𝑑 𝑥 is determined solely by ℎ ( 𝑢 ) . Setting ℎ ( 𝑢 ) = 0 gives: 2 𝑒 − 3 𝑢 2 = 1 = ⇒ 𝑒 − 3 𝑢 2 = 1 2 = ⇒ − 3 𝑢 2 = − ln 2 = ⇒ 𝑢 2 = ln 2 3 Thus, the unique critical point is at 𝑢 0 =  ln 2 3 . For 𝑢 < 𝑢 0 , we have 3 𝑢 2 < ln 2 , so 𝑒 − 3 𝑢 2 > 1 2 , meaning ℎ ( 𝑢 ) > 0 and 𝑑 𝑓 𝑑 𝑥 > 0 , so 𝑓 ( 𝑥 ) is increasing. For 𝑢 > 𝑢 0 , we have 3 𝑢 2 > ln 2 , so 𝑒 − 3 𝑢 2 < 1 2 , meaning ℎ ( 𝑢 ) < 0 and 𝑑 𝑓 𝑑 𝑥 < 0 , so 𝑓 ( 𝑥 ) is decreasing. ZipServ ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Therefore , 𝑓 ( 𝑥 ) has a single maximum at 𝑢 0 , proving that it is unimodal. Since 𝑃 ( 𝑋 = 𝑥 ) is the discr ete sampling of 𝑓 ( 𝑥 ) at integer values, it follows that 𝑃 ( 𝑋 = 𝑥 ) is also unimodal. □ Theorem A.2. Contiguity of T op-K in Unimodal Distribu- tions. Proof. Proof by contradiction: Supp ose that the set X 𝐾 of the T op-K most probable values is not contiguous. Then, there exist three integers 𝑥 𝑎 < 𝑥 𝑐 < 𝑥 𝑏 such that: 𝑥 𝑎 , 𝑥 𝑏 ∈ X 𝐾 but 𝑥 𝑐 ∉ X 𝐾 . By the unimodal property , the probability function 𝑃 ( 𝑥 ) rst increases and then decreases, so for any 𝑥 𝑐 between 𝑥 𝑎 and 𝑥 𝑏 , we have: 𝑃 ( 𝑥 𝑐 ) ≥ min ( 𝑃 ( 𝑥 𝑎 ) , 𝑃 ( 𝑥 𝑏 ) ) . Since 𝑥 𝑎 and 𝑥 𝑏 are in X 𝐾 , they are among the 𝐾 largest probabilities. Thus, min ( 𝑃 ( 𝑥 𝑎 ) , 𝑃 ( 𝑥 𝑏 ) ) is at least as large as the 𝐾 -th largest probability . Therefore, 𝑃 ( 𝑥 𝑐 ) must also be at least as large as the 𝐾 -th largest probability , meaning 𝑥 𝑐 should be in X 𝐾 . This contradicts the assumption that 𝑥 𝑐 ∉ X 𝐾 . Hence, the T op-K set must be contiguous. □ References [1] Amey Agrawal, Nitin Kedia, Ashish Panwar , Jayashree Mohan, Nipun K watra, Bhargav S. Gulavani, Ale xey T umanov , and Ramachandran Ramjee. 2024. T aming throughput-latency tradeo in LLM inference with sarathi-ser ve. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation (Santa Clara, CA, USA) (OSDI’24) . USENIX Association, USA, Article 7, 18 pages. [2] Mistral AI. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023). [3] Zeyuan Allen-Zhu and Y uanzhi Li. 2025. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws. In ICLR . OpenReview .net. [4] Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Martin Jaggi, Dan Alistarh, T orsten Hoeer , and James Hensman. 2024. Quarot: Outlier-free 4-bit inference in rotated llms. arXiv preprint arXiv:2404.00456 (2024). [5] Feng Cheng, Cong Guo, Chiyue W ei, Junyao Zhang, Changchun Zhou, Edward Hanson, Jiaqi Zhang, Xiaoxiao Liu, Hai Li, and Yiran Chen. 2025. Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy- A ware Cache Compression. In Proceedings of the 52nd A nnual International Symposium on Computer A rchitecture . 793–807. [6] W ei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas An- gelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference . In ICML . OpenReview .net. [7] Esha Choukse , Mattan Erez, and Alaa R Alameldeen. 2018. Compr esso: Pragmatic main memory compression. In 2018 51st A nnual IEEE/A CM International Symposium on Microarchitecture (MICRO) . IEEE, 546–558. [8] Esha Choukse, Michael B Sullivan, Mike O’Connor , Mattan Erez, Je Pool, David Nellans, and Stephen W Keckler . 2020. Buddy compression: Enabling larger memory for deep learning and hpc workloads on gpus. In 2020 ACM/IEEE 47th Annual International Symposium on Computer A rchitecture (ISCA) . IEEE, 926–939. [9] Tri Dao . 2024. FlashAttention-2: Faster Attention with Better Paral- lelism and W ork Partitioning. In International Conference on Learning Representations (ICLR) . [10] Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. F lashAttention: Fast and Memory-Ecient Exact Attention with IO- A wareness. In A dvances in Neural Information Processing Systems (NeurIPS) . [11] Rocktim Jyoti Das, Liqun Ma, and Zhiqiang Shen. 2023. Beyond size: How gradients shape pruning decisions in large language mo dels. arXiv preprint arXiv:2311.04902 (2023). [12] Tim Dettmers, Mike Lewis, Y ounes Belkada, and Luke Zettlemoyer . 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35 (2022), 30318–30332. [13] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer . 2023. QLoRA: Ecient Finetuning of Quantized LLMs. In NeurIPS . [14] Peijie Dong, Lujun Li, Zhenheng Tang, Xiang Liu, Xinglin Pan, Qiang W ang, and Xiaowen Chu. 2024. Pruner-Zero: Evolving Symb olic Pruning Metric from Scratch for Large Language Models. In Proceed- ings of the 41st International Conference on Machine Learning . PMLR. hps://arxiv .org/abs/2406.02924 [arXiv: 2406.02924]. [15] Peijie Dong, Lujun Li, Yuedong Zhong, Dayou Du, Ruibo Fan, Y uhan Chen, Zhenheng T ang, Qiang W ang, W ei Xue, Yike Guo, et al . 2024. Stbllm: Breaking the 1-bit barrier with structured binary llms. arXiv preprint arXiv:2408.01803 (2024). [16] Peijie Dong, Zhenheng T ang, Xiang Liu, Lujun Li, Xiaowen Chu, and Bo Li. 2025. Can Compressed LLMs Truly Act? An Empirical Eval- uation of Agentic Capabilities in LLM Compression. arXiv preprint arXiv:2505.19433 (2025). [17] Abhimanyu Dubey , Abhinav Jauhri, Abhinav Pandey , Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur , Alan Schelten, Amy Y ang, Angela Fan, et al . 2024. The llama 3 herd of mo dels. arXiv preprint arXiv:2407.21783 (2024). [18] Jarek Duda, Khalid T ahboub, Neeraj J Gadgil, and Edward J Delp. 2015. The use of asymmetric numeral systems as an accurate replacement for Human coding. In 2015 Picture Coding Symposium (PCS) . IEEE, 65–69. [19] Ali Edalati, Alireza Ghaari, Mahsa Ghazvini Nejad, Lu Hou, Boxing Chen, Masoud Asgharian, and V ahid Partovi Nia. 2025. OA C: Output- adaptive Calibration for Accurate Post-training Quantization. In AAAI . AAAI Press, 16453–16461. [20] Magnus Ekman and Per Stenstrom. 2005. A robust main-memory compression scheme. In 32nd International Symposium on Computer A rchitecture (ISCA ’05) . IEEE, 74–85. [21] Ruibo Fan, Xiangrui Yu, Peijie Dong, Zeyu Li, Gu Gong, Qiang W ang, W ei W ang, and Xiaowen Chu. 2025. SpInfer: Leveraging Low-Le vel Sparsity for Ecient Large Language Model Inference on GP Us. In EuroSys . A CM, 243–260. [22] Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. In ICML . [23] Elias Frantar , Saleh Ashkboos, T orsten Hoeer , and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pr e-trained transformers. arXiv preprint arXiv:2210.17323 (2022). [24] Elias Frantar , Roberto L. Castro, Jiale Chen, T orsten Hoeer , and Dan Alistarh. 2025. MARLIN: Mixed-Precision A uto-Regressive Parallel Inference on Large Language Models. In PPoPP . A CM, 239–251. [25] Y ao Fu, Leyang Xue , Y eqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov , Y uvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Infer ence for Large Language Models. In OSDI . USENIX Association, 135–153. [26] Gerasimos Gerogiannis, Stijn Eyerman, Evangelos Georganas, Wim Heirman, and Josep T orrellas. 2025. DECA: A Near-Core LLM Decom- pression Accelerator Gr ounded on a 3D Rooine Model. In Proceedings of the 58th IEEE/A CM International Symposium on Microarchitecture ® . 184–200. [27] Ruihao Gong, Shihao Bai, Siyu Wu, Y unqian Fan, Zaijun W ang, Xi- uhong Li, Hailong Y ang, and Xianglong Liu. 2025. Past-Future Sched- uler for LLM Serving under SLA Guarantees. In Proceedings of the 30th ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA Ruibo Fan, et al. A CM International Conference on Architectural Support for Program- ming Languages and Operating Systems, V olume 2 . 798–813. [28] Y ongchang Hao, Y anshuai Cao, and Lili Mou. 2024. NeuZip: Memor y- Ecient T raining and Inference with Dynamic Compression of Neural Networks. CoRR abs/2410.20650 (2024). [29] Moshik Hershcovitch, Andrew W ood, Leshem Choshen, Guy Girmon- sky , Roy Leibovitz, Ilias Ennmouri, Michal Malka, Peter Chin, Swami- nathan Sundararaman, and Danny Harnik. 2024. ZipNN: Lossless Compression for AI Models. CoRR abs/2411.05239 (2024). [30] Connor Holmes, Masahiro T anaka, Michael W yatt, Ammar Ahmad A wan, Je Rasley , Samyam Rajbhandari, Reza Y azdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, and Y uxiong He. 2024. DeepSpeed-FastGen: High-throughput T ext Generation for LLMs via MII and DeepSpeed-Inference. arXiv: 2401.08671 [cs.PF] hps://arxiv . org/abs/2401.08671 [31] David A Human. 2007. A method for the construction of minimum- redundancy codes. Procee dings of the IRE 40, 9 (2007), 1098–1101. [32] Aaron Jarmusch, Nathan Graddon, and Sunita Chandrasekaran. 2025. Dissecting the NVIDIA Blackw ell Architecture with Microbenchmarks. arXiv preprint arXiv:2507.10789 (2025). [33] Je Johnson. 2024. DIET -GP U: Ecient Mo del Inference on GP Us. hps://github.com/facebookresearch/dietgpu . [34] Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian T owles, et al . 2023. Tpu v4: An optically recongurable supercom- puter for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture . 1–14. [35] Dhiraj Kalamkar , Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth A vancha, Dharma T eja V o oturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Y uen, et al . 2019. A study of BFLOA T16 for deep learning training. arXiv preprint (2019). [36] Jared Kaplan, Sam McCandlish, T om Henighan, T om B. Brown, Ben- jamin Chess, Rewon Child, Scott Gray , Ale c Radford, Jerey W u, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. CoRR abs/2001.08361 (2020). [37] Hyungyo Kim, Gaohan Y e, Nachuan W ang, Amir Y azdanbakhsh, and Nam Sung Kim. 2024. Exploiting intel advanced matrix extensions (AMX) for large language model infer ence. IEEE Computer A rchitecture Letters 23, 1 (2024), 117–120. [38] Jungrae Kim, Michael Sullivan, Esha Choukse, and Mattan Erez. 2016. Bit-plane compression: Transforming data for better compression in many-core architectures. A CM SIGARCH Computer A rchitecture News 44, 3 (2016), 329–340. [39] W oosuk K won, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Co dy Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Ecient Memor y Management for Large Language Model Ser v- ing with PagedAttention. In SOSP . ACM, 611–626. [40] Hoil Lee, Fadhel A yed, Paul Jung, Juho Lee, Hongseok Y ang, and Fran- cois Caron. 2023. Deep Neural Networks with Dep endent W eights: Gaussian Process Mixture Limit, Heavy Tails, Sparsity and Compress- ibility . J. Mach. Learn. Res. 24 (2023), 289:1–289:78. [41] Zhen Li, Y upeng Su, Runming Y ang, Zhongwei Xie , Ngai W ong, and Hongxia Y ang. 2025. Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning. CoRR abs/2501.03035 (2025). [42] Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Y anping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In OSDI . USENIX Association, 663–679. [43] Ji Lin, Jiaming Tang, Haotian T ang, Shang Y ang, W ei-Ming Chen, W ei-Chen W ang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. A W Q: Activation-awar e W eight Quantization for On- Device LLM Compression and A cceleration. Proceedings of Machine Learning and Systems 6 (2024), 87–100. [44] Ruikang Liu, Y uxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Y u, Tiezheng Y u, Chun Y uan, and Lu Hou. 2025. Quantization Hurts Rea- soning? An Empirical Study on Quantized Reasoning Models. CoRR abs/2504.04823 (2025). [45] Y uhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray , Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Y ao, Shan Lu, Ganesh Anantha- narayanan, et al . 2024. Cachegen: K v cache compression and streaming for fast large language model serving. In Proce edings of the ACM SIG- COMM 2024 Conference . 38–56. [46] Zechun Liu, Changsheng Zhao, Igor Fedorov , Bilge Soran, Dhruv Choudhary , Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. 2024. SpinQuant–LLM quantization with learned rotations. arXiv preprint arXiv:2405.16406 (2024). [47] W eile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang W ang, and Xiaowen Chu. 2024. Benchmarking and dissecting the nvidia hopp er gpu archi- tecture. In 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS) . IEEE, 656–667. [48] Lingxiao Ma, Zhiqiang Xie, Zhi Y ang, Jilong Xue, Y oushan Miao, W ei Cui, W enxiang Hu, Fan Y ang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with r T asks. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) . USENIX Association, 881–897. hps: //www.usenix.org/confer ence/osdi20/presentation/ma [49] Anmol Mekala, Anirudh Atmakuru, Yixiao Song, Marzena Karpinska, and Mohit Iyyer . 2025. Does quantization aect models’ performance on long-context tasks? arXiv preprint arXiv:2505.20276 (2025). [50] NVIDIA. 2020. NVIDIA Ampere GA102 GP U Architecture Whitepa- per . hps://w ww .nvidia.com/content/PDF/nvidia- ampere- ga- 102- gpu- architecture- whitepaper- v2.pdf . [51] NVIDIA. 2023. NVIDIA Ada GP U Architecture Whitepa- per . hps://images.nvidia.com/aem- dam/Solutions/geforce/ada/ nvidia- ada- gpu- architecture.pdf . [52] NVIDIA. 2024. cuBLAS Do cs. hps://docs.nvidia.com/cuda/cublas/ index.html . [53] NVIDIA. 2025. nvcomp: Repository for nvCOMP docs and examples. hps://github.com/NVIDIA/nvcomp . Accessed: 2025-08-18. [54] Op enAI. 2023. GPT -4 T e chnical Report. arXiv: 2303.08774 [cs.CL] [55] Gunho Park, Baese ong Park, Minsub Kim, Sungjae Lee, Je onghoon Kim, Beomse ok K won, Se Jung K won, Byeongwook Kim, Y oungjoo Lee, and Dongsoo Lee. 2024. LU T -GEMM: Quantized Matrix Multipli- cation based on LU T s for Ecient Inference in Large-Scale Generative Language Models. In ICLR . OpenReview .net. [56] Gunho Park, Baeseong Park, Se Jung K won, Byeongwook Kim, Y oungjo o Lee, and Dongsoo Lee. 2022. nuQmm: Quantized MatMul for Ecient Inference of Large-Scale Generative Language Models. CoRR abs/2206.09557 (2022). [57] T ommaso Pegolotti, Elias Frantar , Dan Alistarh, and Markus Püschel. 2023. QIGen: Generating Ecient Kernels for Quantized Inference on Large Language Models. CoRR abs/2307.03738 (2023). [58] Gennady Pekhimenko, Vivek Seshadri, Y oongu Kim, Hongyi Xin, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and T odd C Mowr y . 2013. Linearly compressed pages: A low-complexity , low-latency main memory compression framework. In Proceedings of the 46th A nnual IEEE/A CM International Symposium on Microarchitecture . 172–184. [59] Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and T odd C Mowry . 2012. Base-delta-immediate compression: Practical data compression for on-chip caches. In Pro- ceedings of the 21st international conference on Parallel architectures and compilation techniques . 377–388. [60] Timo Schick, Jane Dwivedi-Y u, Rob erto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer , Nicola Cancedda, and Thomas Scialom. 2023. T oolformer: Language Models Can T each Themselves ZipServ ASPLOS ’26, March 22–26, 2026, Pisburgh, P A, USA to Use T ools. In NeurIPS . [61] Gabin Schieer , Daniel Araújo De Medeiros, Jennifer Faj, Aniruddha Marathe, and Ivy Peng. 2024. On the rise of amd matrix cores: Per- formance, power eciency , and programmability . In 2024 IEEE Inter- national Symposium on Performance A nalysis of Systems and Software (ISP ASS) . IEEE, 132–143. [62] Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar , Pradeep Ra- mani, and Tri Dao. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-pr ecision. In NeurIPS . [63] Chongjie Si, Jingjing Jiang, and W ei Shen. 2025. Unveiling the Mystery of W eight in Large Foundation Models: Gaussian Distribution Never Fades. CoRR abs/2501.10661 (2025). [64] Yixin Song, Ze yu Mi, Haotong Xie, and Haibo Chen. 2024. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GP U. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles (A ustin, TX, USA) (SOSP ’24) . Association for Computing Machinery , New Y ork, N Y , USA, 590–606. doi: 10.1145/3694715.3695964 [65] Foteini Strati, Michal Friedman, and Ana Klimovic. 2025. PCcheck: Persistent Concurrent Checkp ointing for ML. In Proce edings of the 30th A CM International Conference on A rchitectural Support for Pro- gramming Languages and Operating Systems, V olume 1 . 811–827. [66] Biao Sun, Ziming Huang, Hanyu Zhao, W encong Xiao, Xinyi Zhang, Y ong Li, and W ei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In OSDI . USENIX Association, 173–191. [67] Mingjie Sun, Zhuang Liu, Anna Bair , and J. Zico K olter . 2024. A Simple and Eective Pruning Approach for Large Language Models. In ICLR . [68] Gemma T eam. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025). [69] Qwen T eam. 2024. Qwen2.5 te chnical report. arXiv preprint arXiv:2412.15115 (2024). [70] Qwen T eam. 2025. Qwen3 T e chnical Report. arXiv preprint arXiv:2505.09388 (2025). [71] Daniel W addington and Cornel Constantinescu. 2025. Lossless Com- pression for LLM T ensor Incremental Snapshots. arXiv preprint arXiv:2505.09810 (2025). [72] Lei W ang, Lingxiao Ma, Shijie Cao, Quanlu Zhang, Jilong Xue, Yining Shi, Ningxin Zheng, Ziming Miao, Fan Y ang, Ting Cao, et al . 2024. Ladder: Enabling Ecient Low-Precision Deep Learning Computing through Hardware-aware T ensor Transformation. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) . 307–323. [73] Zhuang W ang, Zhaozhuo Xu, Jingyi Xi, Y uke W ang, Anshumali Shri- vastava, and TS Eugene Ng. 2025. { ZEN } : Empowering Distributed Training with Sparsity-driven Data Synchr onization. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25) . 537–556. [74] Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Brian Ichter , Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. 2022. Chain- of- Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS . [75] Thomas W olf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue , Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, et al . 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations . 38–45. [76] Mengdi W u, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Kit Ao, Praveen V elliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. 2025. Mirage: A Multi-Level Superoptimizer for T ensor Programs. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25) . USENIX Association. hps://www .usenix.org/conference/ osdi25/presentation/wu- mengdi [77] Haojun Xia, Zhen Zheng, Y uchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Y ong Li, W ei Lin, and Shuaiwen Leon Song. 2023. Flash- LLM: Enabling Cost-Eective and Highly-Ecient Large Generative Model Inference with Unstructured Sparsity . Proc. VLDB Endow . 17, 2 (Oct. 2023), 211–224. doi: 10.14778/3626292.3626303 [78] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao W u, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and ecient post-training quantization for large language models. In International Conference on Machine Learning . PMLR, 38087–38099. [79] Jiarong Xing, Leyuan W ang, Shang Zhang, Jack Chen, Ang Chen, and Yibo Zhu. 2022. Bolt: Bridging the gap between auto-tuners and hardware-native performance. Proceedings of Machine Learning and Systems 4 (2022), 204–216. [80] Peng Xu, W enqi Shao, Mengzhao Chen, Shitao T ang, Kaipeng Zhang, Peng Gao, Fengwei An, Y u Qiao, and Ping Luo . 2024. BESA: Pruning Large Language Models with Blockwise Parameter-Ecient Sparsity Allocation. In ICLR . [81] Tian Y e, Zicheng Xu, Y uanzhi Li, and Zeyuan Allen-Zhu. 2025. P hysics of Language Models: Part 2.2, How to Learn From Mistakes on Grade- School Math Problems. In ICLR . OpenReview .net. [82] Gyeong-In Y u, Joo Seong Jeong, Geon-W oo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Ser ving System for Transformer-Based Generative Models. In 16th USENIX Symp osium on Operating Systems Design and Implementation (OSDI 22) . USENIX Asso- ciation, Carlsbad, CA, 521–538. hps://www.usenix.org/confer ence/ osdi22/presentation/yu [83] Patrick Yubeaton, T areq Mahmoud, Shehab Naga, Po oria T aheri, Tianhua Xia, Arun George, Yasmein Khalil, Sai Qian Zhang, Sid- dharth Joshi, Chinmay Hegde, and Siddharth Garg. 2025. Hu- LLM: End-to-End Lossless Compression for Ecient LLM Infer ence. arXiv: 2502.00922 [cs.LG] hps://arxiv .org/abs/2502.00922 [84] Lin Zhang, Longteng Zhang, Shaohuai Shi, Xiaowen Chu, and Bo Li. 2023. Evaluation and optimization of gradient compression for distributed deep learning. In 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS) . IEEE, 361–371. [85] Tianyi Zhang, Y ang Sui, Shaochen Zhong, Vipin Chaudhary , Xia Hu, and Anshumali Shrivastava. 2025. 70% Size, 100% Accuracy: Lossless LLM Compression for Ecient GPU Inference via Dynamic-Length Float. arXiv preprint arXiv:2504.11651 (2025). [86] Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, and Carlo Vittorio Cannistraci. 2024. Plug-and-play: An ecient post- training pruning method for large language models. In The T welfth International Conference on Learning Representations . [87] Jishen Zhao, Sheng Li, Jichuan Chang, John L Byrne, Laura L Ramirez, Kevin Lim, Y uan Xie, and Paolo Faraboschi. 2015. Buri: Scaling big- memory computing with hardware-based memory expansion. ACM Transactions on Ar chitecture and Code Optimization (T ACO ) 12, 3 (2015), 1–24. [88] Yilong Zhao, Chien- Y u Lin, Kan Zhu, Zihao Y e, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy , Tianqi Chen, and Baris Kasikci. 2024. Atom: Low-bit quantization for ecient and accurate llm serving. Proceedings of Machine Learning and Systems 6 (2024), 196–209. [89] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Je Huang, Cody Hao Y u, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2025. SGLang: ecient execution of structured language model programs. In Pro- ceedings of the 38th International Conference on Neural Information Processing Systems (V ancouver , BC, Canada) (NIPS ’24) . Curran Asso- ciates Inc., Red Hook, N Y , USA, Article 2000, 27 pages. [90] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prell and Decoding for Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium on Op erating Systems Design and Implementation (OSDI 24) . USENIX Association, Santa Clara, CA, 193– 210. hps://ww w .usenix.org/conference/osdi24/pr esentation/zhong- yinmin

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment