Towards Green AI: Decoding the Energy of LLM Inference in Software Development

Towards Green AI: Decoding the Energy of LLM Inference in Software Development
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Context: AI-assisted tools are increasingly integrated into software development workflows, but their reliance on large language models (LLMs) introduces substantial computational and energy costs. Understanding and reducing the energy footprint of LLM inference is therefore essential for sustainable software development. Objective: In this study, we conduct a phase-level analysis of LLM inference energy consumption, distinguishing between the (1) prefill, where the model processes the input and builds internal representations, and (2) decoding, where output tokens are generated using the stored state. Method: We investigate six 6B-7B and four 3B-4B transformer-based models, evaluating them on code-centric benchmarks HumanEval for code generation and LongBench for code understanding. Results: Our findings show that, within both parameter groups, models exhibit distinct energy patterns across phases. Furthermore, we observed that increases in prefill cost amplify the energy cost per token during decoding, with amplifications ranging from 1.3% to 51.8% depending on the model. Lastly, three out of ten models demonstrate babbling behavior, adding excessive content to the output that unnecessarily inflates energy consumption. We implemented babbling suppression for code generation, achieving energy savings ranging from 44% to 89% without affecting generation accuracy. Conclusion: These findings show that prefill costs influence decoding, which dominates energy consumption, and that babbling suppression can yield up to 89% energy savings. Reducing inference energy therefore requires both mitigating babbling behavior and limiting impact of prefill on decoding.


💡 Research Summary

**
This paper investigates the energy consumption of large language model (LLM) inference in software‑development contexts, focusing on the two distinct phases that compose a typical request: (1) the prefill phase, during which the model processes the entire input prompt and builds key‑value caches for attention, and (2) the decoding phase, where output tokens are generated autoregressively using the cached representations. By separating these phases, the authors reveal energy‑efficiency patterns that are invisible when inference is treated as a monolithic operation.

The experimental study evaluates ten decoder‑only transformer models—six in the 6‑7 B parameter range and four in the 3‑4 B range—covering four families (Llama, Phi, Gemma, Qwen). All models are run locally on identical GPU hardware (NVIDIA A100) and measured with a high‑precision power meter. Two software‑development benchmarks are used: HumanEval for function‑level code generation and LongBench for code‑repository understanding. The authors record total GPU energy (J), prefill energy (J), energy per generated token (J/token) for the whole inference, and energy per token specifically during decoding (excluding the prefill cost). Accuracy (HumanEval pass@1) is also reported to ensure that energy‑saving interventions do not degrade functional performance.

Key findings are as follows:

  1. Phase‑level energy heterogeneity – Even models with comparable parameter counts exhibit markedly different prefill‑to‑decoding energy ratios. This suggests that low‑level implementation details (operator scheduling, memory layout, kernel fusion, precision choices) have a substantial impact on energy efficiency.

  2. Prefill‑induced amplification – Models that incur higher prefill energy also show a larger increase in per‑token decoding energy. The amplification factor ranges from 1.3 % to 51.8 % across the ten models. The authors interpret this as a “carry‑over” effect: a more expensive cache construction leads to higher memory‑access pressure during the subsequent sequential token generation, inflating the marginal cost of each token.

  3. Workload dependence – Input length primarily drives prefill energy, while output length dominates decoding energy. For code‑generation tasks (short inputs, potentially long outputs), the decoding phase accounts for the majority of total energy, making it the prime target for optimization. Conversely, code‑understanding tasks (long inputs, short outputs) are more sensitive to prefill efficiency.

  4. Babbling behavior – Three models generate excessive, unnecessary tokens after the logical end of a solution, a phenomenon the authors label “babbling.” This behavior inflates the number of decoded tokens and thus the total energy consumption. By applying a simple babbling‑suppression pipeline—enforcing a maximum token limit, strengthening early‑stop detection, and pruning repeated or irrelevant code—the authors achieve energy reductions of 44 % to 89 % on the affected models without any measurable drop in generation accuracy.

  5. Generalizability across families – The observed phase‑level patterns hold across the four model families, indicating that the findings are not confined to a single architecture but reflect broader implementation and workload characteristics.

The paper concludes that sustainable AI for software development must address both phases of inference. Strategies to lower prefill cost (e.g., input compression, KV‑cache reuse, multi‑query caching) can indirectly reduce decoding energy, while direct decoding optimizations (memory‑bandwidth‑aware kernels, mixed‑precision arithmetic, babbling suppression) yield the most immediate savings for code‑generation workloads. The authors release their replication package, encouraging further exploration of larger models, alternative hardware (e.g., CPUs, ASICs), and automated phase‑aware optimization pipelines.

Overall, this work advances the “Green AI” agenda by providing a fine‑grained, empirically validated view of where energy is spent during LLM inference in real‑world software‑engineering tasks, and by demonstrating practical, low‑overhead techniques that can cut that energy consumption by up to nearly 90 % without sacrificing functional quality.


Comments & Academic Discussion

Loading comments...

Leave a Comment