Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms
The increasing adoption of large language models (LLMs) on heterogeneous computing platforms poses significant challenges to achieving high inference efficiency. To address these efficiency bottlenecks across diverse platforms, this paper proposes Opt4GPTQ, a practical optimization method designed for 4-bit GPTQ quantized LLMs inference on heterogeneous AI accelerators. Built upon the vLLM serving system, Opt4GPTQ integrates three platform-level optimization strategies: Shared Memory Buffering Optimization (SMB-Opt), which caches frequently accessed data in shared memory and employs single-threaded writes; Vectorized Memory Loading Optimization (VML-Opt), which utilizes vectorized memory operations for efficient data loading; and Inline Assembly Optimization (ILA-Opt), which directly leverages hardwarenative vector half-precision addition and fused multiply-accumulate instructions. Experimental results show that Opt4GPTQ effectively improves performance across various models while maintaining original model accuracy, achieving throughput gains of up to 84.42%. This work highlights the critical role of platformlevel engineering in enabling efficient LLMs inference on emerging architectures and provides valuable methodologies for future heterogeneous platform adaptation.
💡 Research Summary
The paper introduces Opt4GPTQ, a set of platform‑level optimizations designed to accelerate 4‑bit GPTQ‑quantized large language model (LLM) inference on heterogeneous AI accelerators, with a focus on the HYGON DCU Z100. Built on top of the vLLM serving system, Opt4GPTQ adds three complementary techniques: (1) Shared Memory Buffering Optimization (SMB‑Opt), which replaces thousands of global‑memory atomic adds with a two‑phase reduction that first accumulates partial sums in fast shared memory and then lets a single thread commit the result atomically to global memory; (2) Vectorized Memory Loading Optimization (VML‑Opt), which reinterprets half‑precision data as half2 (32‑bit) vectors, allowing two half elements to be fetched in a single memory transaction and thereby improving load coalescing and reducing instruction count; and (3) Inline Assembly Optimization (ILA‑Opt), which bypasses high‑level compiler code and directly emits the DCU’s native vector half‑precision fused‑multiply‑add (v_mad_f16) and vector addition (v_add_f16) instructions via inline assembly, achieving single‑instruction multiple‑operation (SIMO) execution on half2 data.
The authors evaluate six GPTQ‑quantized models (Meta‑Llama‑3‑8B, Llama‑2‑7B, CodeLlama‑7B, LLaMa‑13B, Qwen1.5‑4B‑Chat, Qwen1.5‑1.8B‑Chat) using a batch size of 32 prompts on the DCU platform. Through 15 repeated runs per configuration, they report average throughput gains of 5–17 % for each individual optimization and up to 84.42 % when all three are combined (Opt4GPTQ). The largest gains appear on the biggest models (LLaMa‑13B, CodeLlama‑7B) where memory bandwidth and compute intensity are highest; smaller models see more modest improvements.
Accuracy is measured on the ARC benchmark (Challenge and Easy sets). Across all models and optimization variants, the change in accuracy stays within ±1 percentage point, with most fluctuations below 0.7 pp. In several cases, SMB‑Opt or VML‑Opt even yields a slight increase, while ILA‑Opt shows essentially no impact. This demonstrates that the low‑level transformations preserve numerical stability of the 4‑bit quantized inference pipeline.
The paper situates its contributions within related work on atomic operation synchronization, memory‑access vectorization, and kernel‑level instruction tuning. It argues that while prior research has identified the theoretical benefits of near‑memory atomics, structure‑of‑arrays layouts, and domain‑specific languages (e.g., Triton), practical deployment on non‑CUDA accelerators still suffers from compiler‑generated inefficiencies. Opt4GPTQ bridges this gap by explicitly managing shared‑memory reductions, handcrafted vector loads, and hand‑written assembly that matches the target ISA.
Limitations include the current reliance on CUDA‑compatible code paths (the DCU claims ROCm compatibility but the implementation uses CUDA‑style intrinsics) and the focus on 4‑bit quantization; extending the approach to more aggressive bit‑widths (2‑bit, 3‑bit) or to other heterogeneous back‑ends (NPUs, FPGAs) will require additional engineering.
In conclusion, Opt4GPTQ provides a concrete, reproducible methodology for extracting high performance from heterogeneous accelerators when running 4‑bit GPTQ‑quantized LLMs. By jointly addressing memory‑bandwidth bottlenecks and compute‑pipeline inefficiencies, it achieves up to an 84 % throughput boost without sacrificing model accuracy, offering a valuable blueprint for future AI‑serving systems on emerging hardware.
Comments & Academic Discussion
Loading comments...
Leave a Comment