Understanding Efficiency: Quantization, Batching, and Serving Strategies in LLM Energy Use
Large Language Models (LLMs) are increasingly deployed in production, contributing towards shifting the burden in terms of computational resources and energy demands from training to inference. While prior work has examined the energy cost of inference per prompt or per token, we highlight how \emph{system-level design choices} - such as numerical precision, batching strategy, and request scheduling - can lead to orders-of-magnitude differences in energy consumption for the same model. We perform a detailed empirical study of LLM inference energy and latency on NVIDIA H100 GPUs, analyzing the impact of quantization, batch size, and serving configuration (e.g., with Hugging Face’s Text Generation Inference server). Our results reveal that lower-precision formats only yield energy gains in compute-bound regimes; that batching improves energy efficiency, especially in memory-bound phases like decoding; and that structured request timing (arrival shaping) can reduce per-request energy by up to 100 times. We argue that sustainable LLM deployment depends not only on model internals, but also on the orchestration of the serving stack. Our findings motivate phase-aware energy profiling and system-level optimizations for greener AI services.
💡 Research Summary
This paper presents a comprehensive, phase‑aware investigation of energy consumption and latency for large language model (LLM) inference on NVIDIA H100 GPUs. The authors focus on three system‑level levers that are often overlooked in prior work: numerical precision (float32, bfloat16, float16, int8, int4), batch size, and request‑arrival shaping when using Hugging Face’s Text Generation Inference (TGI) server.
Experimental setup
The study evaluates a representative set of open‑source instruction‑tuned models (Qwen 2.5 0.5B‑14B, Mistral‑7B‑Instruct‑v0.3, LLaMA 3.1‑8B‑Instruct) across five data types. All experiments run on a dedicated H100 SXM (80 GB) with eight AMD EPYC 7R13 cores, and energy is measured with CodeCarbon (NVML for GPU, pyRAPL for CPU, heuristic for RAM). Each request is warmed up, repeated ten times, and the authors split inference into a “pre‑fill” pass (single‑token forward over the full prompt) and a “decode” pass (autoregressive generation of the remaining tokens). This decomposition isolates the compute‑bound pre‑fill regime from the memory‑bound decode regime.
Precision effects
In the pre‑fill phase, large models (e.g., LLaMA 8B, Qwen 14B) become compute‑bound for typical prompt lengths (~1.2 k tokens). Switching from float32 to lower‑precision formats activates the H100 Tensor Cores, yielding up to a 4× reduction in GPU energy. Small models (≤1.5 B) remain memory‑bound; for them, float16/bfloat16 sometimes increase energy because the specialized kernels add overhead without enough FLOPs to amortize it. Quantized int8/int4 formats introduce on‑the‑fly dequantization kernels. While they can cut memory traffic, the extra kernel launches and irregular memory patterns often offset any theoretical gains, especially when the workload is memory‑bound.
In the decode phase, the workload is uniformly memory‑bound regardless of model size. Each generated token re‑uses cached KV states, leading to small, fragmented memory accesses. Consequently, energy per token is almost identical across all precisions, and int8 even consumes 2–3× more energy than float32 because of the extra dequantization steps. The expected 2× (float16) or 4× (int8) speed‑up from reduced word size does not materialize; GPU idle power (~120 W) dominates when kernels are short, so faster compute does not translate into lower total energy.
Batching effects
Batching improves throughput and energy efficiency by amortizing kernel launch overhead and memory transfers. The authors analyze LLaMA 3.1‑8B (float32) with static batching. When normalizing energy by effective input tokens (excluding padding), pre‑fill energy rises with batch size because padding adds useless work. Normalizing by computed tokens (including padding) shows a flat line for pre‑fill, confirming its compute‑bound nature. In the decode phase, energy per computed token drops sharply up to batch = 4, after which gains plateau; a U‑shaped curve appears when normalizing by effective tokens due to the trade‑off between padding waste and memory‑reuse benefits. Output‑token normalization (where completed sequences are dropped automatically) yields a clear logarithmic decline in energy per token with batch size, highlighting that larger batches produce longer kernels, reduce idle periods, and thus lower average power.
Arrival shaping
The most striking result comes from simulating production traffic with controlled request arrival patterns. By deliberately shaping arrivals—collecting requests over short windows before feeding them to the TGI server—the authors achieve up to a 100× reduction in per‑request energy. This demonstrates that, when latency budgets permit modest queuing, batch quality can be dramatically improved, leading to far lower energy per query without any model or hardware changes.
Implications and recommendations
The paper argues that sustainable LLM deployment cannot rely solely on model‑centric optimizations (e.g., quantization). System‑level decisions—choosing the right precision for each inference phase, selecting an optimal batch size that balances padding overhead against kernel saturation, and implementing intelligent request scheduling—can yield order‑of‑magnitude energy savings. Practical guidance includes:
- Use lower precision (float16/bfloat16/int8) for the pre‑fill of large models; keep decode in float32 or bfloat16.
- Target batch sizes of 2–4 for typical token lengths to maximize decode efficiency while limiting padding waste.
- Deploy arrival‑shaping or micro‑batching in the serving layer when service‑level agreements allow modest latency.
- Continuously profile phase‑wise energy with tools like CodeCarbon to detect idle‑power hotspots and guide kernel‑fusion or kernel‑batching efforts.
Conclusion
Through meticulous measurement and analysis, the authors demonstrate that LLM inference energy is a function of the entire serving stack, not just the neural network. By aligning numerical precision, batching strategy, and request scheduling with the underlying compute or memory characteristics of each inference phase, practitioners can achieve substantial sustainability gains—potentially reducing per‑query energy by two orders of magnitude—without altering the underlying model. Future work should extend these findings to multi‑GPU pipelines, alternative accelerator architectures, and real‑world latency‑energy trade‑offs in production AI services.
Comments & Academic Discussion
Loading comments...
Leave a Comment