Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling

Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

LLM inference exhibits substantial variability across queries and execution phases, yet inference configurations are often applied uniformly. We present a measurement-driven characterization of workload heterogeneity and energy-performance behavior of LLM inference under GPU dynamic voltage and frequency scaling (DVFS). We evaluate five decoder-only LLMs (1B-32B parameters) across four NLP benchmarks using a controlled offline setup. We show that lightweight semantic features predict inference difficulty better than input length, with 44.5% of queries achieving comparable quality across model sizes. At the hardware level, the decode phase dominates inference time (77-91%) and is largely insensitive to GPU frequency. Consequently, reducing GPU frequency from 2842 MHz to 180 MHz achieves an average of 42% energy savings with only a 1-6% latency increase. We further provide a use case with an upper-bound analysis of the potential benefits of combining workload-aware model selection with phase-aware DVFS, motivating future energy-efficient LLM inference systems.


💡 Research Summary

This paper presents a comprehensive, measurement‑driven study of the energy‑performance trade‑offs inherent in large language model (LLM) inference when the GPU’s dynamic voltage and frequency scaling (DVFS) is varied. The authors evaluate five decoder‑only LLMs ranging from 1 B to 32 B parameters across four widely used NLP benchmarks—BoolQ, HellaSwag, TruthfulQA (generative version), and NarrativeQA—using a controlled offline replay setup on a single NVIDIA RTX PRO 6000 (Blackwell) GPU.

Workload Characterization
Instead of relying on the common proxy of input token length, the authors extract lightweight linguistic and semantic features from each query before inference. Five features are retained after correlation analysis: (1) a composite “Complexity Score” (combining token entropy, unique token ratio, entity density, and average sentence length), (2) “Reasoning Complexity” (density of causal/comparative markers), (3) “Entity Density” (named‑entity‑to‑token ratio), (4) Token Entropy, and (5) the proportion of causal questions. These features explain variance in output quality (accuracy for classification, ROUGE‑L for generation) far better than raw length. Notably, 44.5 % of the 3,817 queries achieve comparable quality across all model sizes, indicating that many queries do not require the largest model.

Hardware Measurement Methodology
The GPU’s streaming‑multiprocessor (SM) clock is fixed at seven discrete frequencies (180, 487, 960, 1500, 2000, 2505, 2842 MHz) while the memory clock remains at its default. Power is sampled via NVIDIA Management Library (NVML) at 10 ms intervals and integrated to obtain per‑request energy (Joules). Latency is measured with torch.cuda.synchronize() to ensure accurate timing. Experiments are repeated three times for batch sizes of 1, 4, and 8, and mean values are reported.

Phase‑Aware Findings
LLM decoding consists of two distinct phases: (i) Prefill, where the entire input is processed in parallel, and (ii) Decode, where tokens are generated autoregressively, repeatedly accessing model weights and an ever‑growing KV‑cache. The authors instrument the inference pipeline to separate these phases. Results show that the decode phase dominates overall execution time (77 %–91 % depending on model and workload) and consumes the bulk of GPU power. Crucially, decode time is largely insensitive to SM frequency: lowering the SM clock from the maximum 2842 MHz to the minimum 180 MHz increases total latency by only 1 %–6 % across all configurations. In contrast, the prefilling stage scales more predictably with frequency, as expected for a compute‑bound kernel.

Because energy consumption is dominated by the decode phase, the frequency reduction yields substantial energy savings: on average 42 % less energy per request, with the greatest savings observed at batch size 1. The Energy‑Delay Product (EDP) also improves for most configurations, confirming that the modest latency penalty is outweighed by the energy benefit.

Implications and Upper‑Bound Case Study
Building on these observations, the authors propose a two‑pronged, workload‑aware optimization: (1) Model selection based on the semantic difficulty features (e.g., use a 1 B model for low‑complexity queries, a 32 B model for high‑complexity ones), and (2) Phase‑aware DVFS, keeping a high SM frequency during prefilling and dropping to a low frequency during decoding. An upper‑bound analysis shows that, while preserving the same quality metrics, this combined strategy could reduce total inference energy by more than 55 % compared with a naïve configuration that uses the largest model and a fixed high frequency for all queries.

Contributions and Limitations
The paper’s contributions are threefold: (i) demonstrating that lightweight semantic features are superior predictors of inference difficulty than raw token counts, (ii) providing the first systematic, phase‑level DVFS characterization for LLM inference across multiple model sizes and workloads, and (iii) quantifying the potential gains of coupling workload‑aware model routing with phase‑aware frequency scaling. Limitations include the single‑GPU, offline experimental setup, the absence of multi‑GPU or distributed serving considerations, and the use of simple linear weighting for difficulty prediction rather than a learned model. Future work could extend the methodology to multi‑node clusters, integrate real‑time batch scheduling, and explore learned difficulty estimators.

In summary, this work convincingly argues that energy‑efficient LLM serving must account for both query‑level heterogeneity and the distinct hardware sensitivities of the prefilling and decoding phases. By doing so, substantial energy reductions can be achieved without sacrificing latency or output quality, laying a solid empirical foundation for next‑generation, green AI inference systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment