This Is Taking Too Long -- Investigating Time as a Proxy for Energy Consumption of LLMs

The energy consumption of Large Language Models (LLMs) is raising growing concerns due to their adverse effects on environmental stability and resource use. Yet, these energy costs remain largely opaque to users, especially when models are accessed t…

Authors: Lars Krupp, Daniel Geißler, Francisco M. Calatrava-Nicolas

This Is Taking Too Long -- Investigating Time as a Proxy for Energy Consumption of LLMs
This Is T aking T oo Long - In v estigating T ime as a Proxy for Ener gy Consumption of LLMs Lars Krupp Embedded Intelligence DFKI and RPTU Kaiserslautern, Germany lars.krupp@dfki.de Daniel Geißler Embedded Intelligence DFKI and RPTU Kaiserslautern, Germany daniel.geissler@dfki.de Francisco M. Calatrav a-Nicolas Centr e for Applied Autonomous Sensor Systems (AASS) ¨ Or ebr o University ¨ Orebro, Sweden francisco.calatrav a-nicolas@oru.se V ishal Banwari Embedded Intelligence DFKI and RPTU Kaiserslautern, Germany vishal.banwari@dfki.de Paul Lukowicz Embedded Intelligence DFKI and RPTU Kaiserslautern, Germany paul.luko wicz@dfki.de Jakob Karolus Embedded Intelligence DFKI and RPTU Kaiserslautern, Germany jakob .karolus@dfki.de Abstract —The energy consumption of Large Language Models (LLMs) is raising gro wing concerns due to their adverse effects on en vironmental stability and resource use. Y et, these energy costs remain lar gely opaque to users, especially when models are accessed through an API — a black box in which all information depends on what pro viders choose to disclose. In this work, we in vestigate inference time measur ements as a proxy to approximate the associated energy costs of API-based LLMs. W e ground our approach by comparing our estimations with actual energy measur ements fr om locally hosted equivalents. Our results show that time measurements allow us to infer GPU models for API-based LLMs, gr ounding our energy cost estimations. Our work aims to create means for understanding the associated energy costs of API-based LLMs, especially for end users. Index T erms —Large Language models, energy consumption, energy estimation, sustainability I . I N T RO D U C T I O N Recent progress in the field of deep learning and especially Large Language Models (LLMs) has transformed what arti- ficial intelligence is capable of in a wide range of domains, ranging from natural language understanding to other fields such as healthcare and robotics [1], [2]. Society experiences significant changes that move LLMs well be yond research settings into e veryday use [3]. Large, nationally representativ e surve ys show sharp gro wth in consumer adoption of generativ e AI tools for routine tasks, information seeking, and work/study support [4]–[6]. In recent years, the number of proposed models has increased significantly [7], further accelerating the adoption of generativ e AI across society . This technological development has been fueled by in- creasingly large-scale architecture, big corpora of data, and high-performance computing (HPC) infrastructure. As a result, many companies and research institutions ha ve entered a competitiv e race to develop more formidable models, aiming to push the boundaries of the current state-of-the-art. Howe ver , this growth comes at significant computational costs, arising from both the training and inference stages of these models. The resulting energy demands, both from GPU-lev el po wer draw and system-lev el electricity consumption [8], [9], ha ve reached new lev els. Some companies, such as Google, are ev en constructing dedicated nuclear reactors to sustain their AI infrastructure [10]. Beyond technical and economic challenges, the en vironmental implications of this escalating energy con- sumption are becoming a great concern. Regrettably , company policies regarding the disclosure of energy consumption are often opaque, hiding true costs behind inconspicuous web user interfaces, cf. ChatGPT [11]. Closed- source models and sparse details on data center infrastructure make robust ener gy estimation impossible [7], [8]. No direct access to weights and the model’ s inference configuration further prevent researchers from conducting transparent mea- surements of the LLM’ s energy footprint and, ultimately , assessing the efficienc y of API-based LLM interactions. Y et, there remains one metric that can be directly measured: task completion time. I.e., the time it takes for a model to process a single token. Consequently , computation time can prov e valuable when estimating gross energy consumption of API-based LLMs. Howe ver , factors such as employed GPUs, model size and architecture, load factor , and many others still play a piv otal role and can drastically change computation times. Grounding results via locally deployed models with predefined options is crucial to consolidating time-based estimations. In this work, we in vestigate the potential of inference completion time as a robust proxy for estimating energy consumption in lar ge language models (LLMs). T o this end, we present a benchmark designed to analyze the relationship between inference time and energy usage under controlled experimental conditions for both locally hosted and API-based models. W e focus on Mistral models [12], as they are a vailable in both open-weight and API-accessible versions. The goal of this work is to e valuate whether inference time can serve as an effecti ve proxy for energy estimation in API-based Mistral models and consolidate our time-based estimates via locally measured computation times under representativ e conditions. Our code and ev aluation results are av ailable on GitHub 1 . While our work is not the first to propose API-based energy estimation, cf. [13], our approach grounds API-based energy estimation through local measurements using the same model run on representative computing infrastructure. Our results show that inference time measurements on API-based LLMs, grounded by equivalent measurements on local models, allow us to infer the GPU configurations used on external servers. Our work provides e vidence of a direct correspondence be- tween API-based large language models and their local coun- terparts in terms of computation time and, ultimately , energy consumption. I I . R E L AT E D W O R K Starting with the inception of the attention mechanism [14], large language models, such as ChatGPT , hav e been gain- ing traction, leading to fierce competition between dif ferent model families like OpenAI [11], Llama [15], deepseek [16], qwen [17], and mistral [18]. Howe ver , outside of competing on different benchmarks, there is ongoing discussion about the topic of open-source and proprietary models. While pro- prietary models often show superior performance on bench- marks [19], their inner workings frequently remain opaque, in contrast to open-source LLMs. Especially re garding the ener gy consumption of proprietary models, information is sparse and often unreliable, if av ailable at all. The energy footprint of LLMs spans both training and inference phases and constitutes one of the most pressing sustainability challenges in AI. Recent large-scale systems such as LLaMA-65B are estimated to consume around 5 × 10 5 kWh during training [9], [20]. Despite significant gains from modern datacenter efficienc y and AI-optimized hardware, such runs can still emit tens to hundreds of tons of C O 2 when powered by fossil-based energy [20], [21]. Advances in model sparsity (e.g., mixture-of-experts), efficient accelerator architectures, and low-carbon datacenters can reduce energy intensity by one to three orders of magnitude [20], [22]. For inference, e ven though a single LLM query may con- sume only 0.3 to 1 Wh, the cumulati ve demand of millions of daily requests may scale to annual petawatt-hour lev els by 2026 [23], [24]. Inefficiencies from idle GPU power draw , throughput limits, and non-proportional energy scaling further amplify emissions, whereas techniques such as quan- tization, batching, and dynamic scheduling can mitigate these effects [23], [25]. Accurately assessing the energy footprint of large language models remains challenging. Recent tools such as Carbon- T racker [26] and CodeCarbon [27] aim to provide real-time monitoring of po wer consumption and C O 2 emissions by in- tegrating hardware-lev el energy data with regional grid carbon intensity . 1 https://github .com/DFKIEI/LLM- Energy Recent studies hav e begun to quantify the energy and en vironmental costs associated with both the training and ev aluation stages of deep learning systems. On the training side, Geißler et al. [28] analyzed ho w hyperparameters can influence the energy demand of neural networks, showing that suboptimal hyperparameter choice can increase the en- ergy consumption ev en when achie ving similar accurac y . As model scales hav e grown, these concerns have extended from con ventional deep networks to large language models (LLMs). Fernandez et al. [29] presented a comprehensiv e study of the energy implications of LLM inference, examining how model architecture, decoding strategies, and software framew orks affect power use. Jegham et al. [13] further introduced time- based estimation methods to assess the energy consumption of API-served models, yet lacking grounding through actual measurements. Here, Krupp et al. [30] depict an approach for benchmarking the energy consumption of locally run web agents and expose the complexity of estimating the energy cost for web agents that use API-based LLMs. I I I . M E T H O D O L O G Y Building upon these works, we propose a methodology that lev erages locally ex ecuted models to inform the energy estimation of API-based systems, using inference computation time as a common factor for energy consumption across both setups. Our core assumption is that when computational hardware operates at or near optimal utilization, its power draw remains approximately constant, allowing total energy consumption to be estimated as a function of ex ecution time. Howe ver , since the exact hardware specifications for API- based models are undisclosed, we additionally benchmark models locally across dif ferent GPU architectures to establish a plausible mapping of energy estimates to GPU fleets. W e then compare these results with the inference completion times measured from API-based models. Our benchmark consists of synthetic prompts (generated with llama3-70b) from four input–output configurations (cov- ering technical, creati ve, educational, and business-oriented tasks with injected topics) that differ in sequence length (short = 2,048 tokens, long = 8,192 tokens). An example educational prompt may look like this: ”Provide a thorough explanation of advanced mathematics”. W e selected the LLMs based on their av ailability in both open-weight and API-based variants, as well as their model size. These criteria enabled us to assess the accuracy of our time-based proxy energy estimation approach more reliably by testing both models locally and through their API, while ensuring that the models remain lightweight enough to fit on a single GPU when tested locally . Specifically , we used Mistral-7B-Instruct-v0.3 (Local), equiv alent to Open-Mistral- 7B (API), and Mistral-NeMo-Instruct-2407 (Local), equiv alent to Open-Mistral-NeMo (API). A. Experiment Pr otocol T o ensure reproducibility and consistency , all tests are conducted following a structured experimental protocol. This applies to both locally executed runs as well as runs initiated on API-based models. Each run processes a fix ed sequence of 100 synthetic prompts from our benchmark. This determin- istic ordering eliminates sampling v ariance, ensuring that all models and hardware platforms receive identical workloads. Execution proceeds in structured passes, where each con- figuration is ex ecuted repeatedly to ensure statistical relia- bility . For every configuration, the full benchmark is run independently ten times, allo wing mean and standard de viation values to be computed for all runtime and energy metrics. In particular , for API-based models, we schedule runs throughout the day to capture v ariances due to times of high load. For each locally e xecuted run, all prompts are processed in parallel batches of eight to ensure deployment with sufficient GPU utilization. Idle cycles between queries are minimized to ensure consistent energy-per -token comparability . B. Local Har dwar e Ar chitectur e Selection T o assess the architectural impact on runtime and en- ergy efficienc y , local experiments are performed across six NVIDIA GPUs representing two generations (Ampere, Hop- per) and form f actors (SXM, PCIe): A100-40GB, A100- 80GB, A100-PCI, H100, H100-PCI, and H200. These devices differ markedly in thermal design power (TDP) and optimal sustained load range as shown in T able I. The comparison focuses on the di vergence between server- grade SXM modules and PCIe variants (A100-PCI, H100- PCI). Server v ersions benefit from higher TDP ceilings, ex- ternal cooling, and high-bandwidth NVLink interconnects, whereas PCIe versions have less power and much larger auxiliary losses, such as power con versions and cooling, which are addressed on-chip. Public sources do not disclose the exact GPU fleet serving Mistral APIs. Ho wev er , Mistral reports training on H100 GPUs [31], and MLPerf v4.1 shows H100/H200-class acceler- ators dominate datacenter LLM inference [32]. Since Mistral models are offered via major clouds [33], H100/H200-class GPUs are most likely used for API-hosted deployments. C. Local LLM Initialization The ev aluated models, Mistral-7B-v0.3 and Mistral-NeMo- Instruct-2407 (12B), are ex ecuted in FP16 precision with TF32 acceleration on Ampere and Hopper GPUs, using de vice- map=”auto” for balanced memory allocation. KV -caching and left-padding are applied to optimize batch consistency and reduce redundant computation. T o ensure comparability , temperature is fixed at 0.7 across all runs, for both local and API-based models, mimicking real use cases. Random seeds (42) are synchronized across Python, NumPy , PyT orch, and CUDA. Prompts are passed directly to the model without system instructions, ensuring that runtime and energy variations result solely from model architecture, precision, and decoding configuration. D. Local Ener gy T racking Energy consumption is recorded with CarbonT racker [26], which samples GPU power via NVIDIA System Management Interface (SMI) and integrates the full board-level energy . In contrast, TDP reflects only the chip’ s power env elope, serving as a theoretical lower bound. The difference between the two indicates the additional po wer drawn by auxiliary components. This gap is most evident for PCIe GPUs, where less efficient power deli very and higher con version losses cause CarbonT racker values to exceed TDP estimates, while server - grade GPUs show closer alignment. Comparing both metrics thus reveals ho w chip-level specifications underestimate real- world energy use, especially in PCIe configurations. E. API-Based Ener gy Estimation As outlined in our experiment protocol, we performed N = 10 independent executions on both a local open-weights deployment and an API-based deployment . For each local run i ∈ { 1 , 2 , . . . , N } , we tracked energy consumption during code ex ecution. The resulting energy consumption is denoted as E loc i , and the wall-clock duration of each run as T loc i . Consequently , average power draw for a local run can be calculated as follows: P loc i = E loc i T loc i , averaging ov er all runs yields P loc . Based on this local po wer calculation, we can predict the ener gy consumption of the equiv alent API-based model, if run on a comparable computing setup: b E api = P loc · ¯ T api , (1) where ¯ T api denotes the mean API ex ecution time over all API runs. T o allo w comparability between runs and models 2 , we normalized our results by token count. Consequently , for each run, we normalized both energy E i and time T i by the token count of this run N ( i ) tokens . This yields mean energy per token ¯ E token and mean inference time per token ¯ T token . I V . R E S U LT S This section presents the e valuation of energy consumption for large language models (LLMs) from the Mistral family , ex- ecuted both locally and through API-based services. W e begin by analyzing the computational time required to complete the benchmark under both scenarios (local vs. API), as it forms the basis for subsequent energy estimation. Next, we present the energy consumption analysis, reporting both the total energy and energy per token (see Section III-E). Finally , we validate the energy predictions for the local setup by comparing them with estimates deri ved from the Thermal Design Power (TDP) parameter of each GPU. A. Benchmarking Computation T ime T able II shows the runtime on our proposed benchmark ov er ten runs for both scenarios (local and API). The com- parison between local and API-based executions highlights a substantial dif ference in computational efficienc y . Both free and paid API run the benchmark faster than any local GPU configuration. Compared to the best-performing local setup, 2 Output token count was not consistent across models. T ABLE I T D P A N D O P T IM A L W O RK L OA D R A N GE GPU VRAM Thermal Design Power (TDP) [W] Optimal TDP Load Range [W] A100-40GB 40 GB 400 340–380 A100-80GB 80 GB 400 340–380 A100-PCI 40 GB 250 213–238 H100 80 GB 700 595–665 H100-PCI 94 GB 400 340–380 H200 140 GB 700 595–665 T ABLE II T A B L E D E P I CT I N G M E AN A N D S T A N DA RD D E V IATI O N O F T I M E ( S ) T O C O MP U T E T H E B E N C HM A R K . G P U S A R E O RD E R E D C H RO N O LO G I C AL LY B Y R E LE A S E DATE . Model T ype GPU mean ± sd Mistral-7B Local A100-40GB 2447. ± 8.85 A100-80GB 2308. ± 8.36 A100-PCI 2525. ± 14.7 H100 1662. ± 22.5 H100-PCI 1820. ± 9.24 H200 1559. ± 4.72 Free-API - 718. ± 43.8 Paid-API - 727. ± 28.8 Mistral-NeMo Local A100-40GB 3865. ± 8.13 A100-80GB 3612. ± 14.0 A100-PCI 4003. ± 16.4 H100 2140. ± 23.1 H100-PCI 2353. ± 11.9 H200 2018. ± 8.17 Free-API - 1260. ± 161. Paid-API - 1159. ± 31.2 the API execution achieves a reduction in computation time of approximately 54% for Mistral-7B and approximately 43% for Mistral-NeMo, indicating differences in the setup, such as system prompts or parallelization. The marginal difference between the free and paid APIs (on the order of a fe w seconds, within one standard deviation) suggests that both rely on similar backend resources, with variations likely due to server load or queueing effects. In contrast, local executions display longer but more predictable runtimes (smaller standard deviation), where performance depends on the underlying hardware. W ithin the local setup, a clear hierarchy emerges: newer GPUs (H100 and H200) consistently complete the benchmark faster than the older Ampere series models. In addition, we offer a fine-grained comparison using the av erage time per token. As seen in Figure 1, the ¯ T token of the different configurations are clustered by GPU generation for local runs. For both LLMs, we observe two local clusters (see Figure 1), the A-Cluster and the H-Cluster . The API-based ¯ T token for Mistral-NeMo are close to the H-cluster , howe ver , this is not the case for Mistral-7B, where runtimes are slightly higher than for the H-Cluster GPUs. Section IV -C puts these results into perspectiv e regarding energy consumption. Cluster H Cluster H Cluster A Cluster A 0.006 0.008 0.010 0.012 Mistral-7B Mistral-NeMo LLM T ime/T oken (s) GPU A100-40GB A100-80GB A100-PCI H100 H100-PCI H200 Free API Paid API Fig. 1. Boxplot showing the computation time per token ¯ T token in seconds for running the same benchmark on both models across different local GPUs and API executions. Cluster A includes A100 GPUs, while Cluster H groups H100 and H200 GPUs. B. T ranslating T ime to Ener gy In this section, we present the results related to the use of time in energy estimation for the API models (c.f. Sec- tion III-E). In addition, we compare this value with the energy consumption based on the Thermal Design Power (TDP) for each GPU. W e use an average sustained power equal to 0.9 × TDP as a conservati ve estimate where cooling cost and energy losses through heat radiation are not included. Empirically grounded estimates of actual energy use during high-throughput inference are reported between 85 % and 95 %, cf. [8], [31], [34]. Based on these findings, T able I reports the optimal TDP load values adopted in our analysis. Applying this information to the calculated runtimes, we can calculate a lo wer bound for the energy consumption. As sho wn in T able III, the dif ferences in energy consumption between TDP and local measurements can vary widely , depending on the GPU architecture (SXM or PCIe), as outlined in Section III-B. C. Estimating the Ener gy Consumption for API-Based LLMs By contrasting the measured time per T oken ¯ T token of our GPU setups with the results from the API-based models, we can pinpoint likely GPU setups for the gi ven LLMs. W e do note that, while ¯ T token might be comparable, we cannot make any evidence-based statements on the actual GPU setups behind the API-based models. As such, this comparison is based on reports about compute clusters and our confirming findings through inference time measurements regarding those T ABLE III A V E R AG E S US TA IN E D P OW E R A N D E N ER G Y U S E E S TI M A T E D V IA T D P V S . L O C A LLY M E AS U R E D ( C A R BO N T R AC K E R ) . E AC H S U PE R - L EV E L G RO U P P E R T OK E N A N D T OT A L E N E RG Y . Model GPU T otal tokens TDP (calculated) Carbon T racker (measured) Difference (% of local) mean [%] Per token (mWh) T otal (Wh) Per token (mWh) T otal (Wh) Mistral-7B A100-40GB 286219 0.86 ± 0.003 245. ± 0.885 1.06 ± 0.023 303. ± 6.54 19 A100-80GB 286219 0.81 ± 0.003 231. ± 0.836 1.10 ± 0.024 314. ± 6.75 26 A100-PCI 286219 0.55 ± 0.003 158. ± 0.921 0.949 ± 0.007 272. ± 1.95 42 H100 327674 0.88 ± 0.012 291. ± 3.94 1.07 ± 0.016 350. ± 5.37 16 H100-PCI 327674 0.56 ± 0.003 182. ± 0.924 0.904 ± 0.009 296. ± 2.89 39 H200 327674 0.83 ± 0.003 273. ± 0.827 1.04 ± 0.011 340. ± 3.54 20 Mistral-NeMo A100-40GB 329345 1.15 ± 0.002 387. ± 0.813 1.47 ± 0.027 485. ± 8.75 20 A100-80GB 329345 1.10 ± 0.004 361. ± 1.40 1.54 ± 0.015 507. ± 4.83 29 A100-PCI 329345 0.75 ± 0.003 250. ± 1.02 1.32 ± 0.007 435. ± 2.23 43 H100 332860 1.13 ± 0.012 375. ± 4.04 1.37 ± 0.019 456. ± 6.28 18 H100-PCI 332860 0.71 ± 0.004 235. ± 1.19 1.17 ± 0.013 388. ± 4.25 39 H200 332860 1.06 ± 0.004 353. ± 1.43 1.33 ± 0.021 444. ± 7.00 20 T ABLE IV C O MPA R IS O N B E T WE E N M E A SU R E D A N D C A L C UL ATE D T OTA L E N E R GY C O ST S F O R T H E A P I V E R S IO N ( F R E E A N D PA ID ) O F M I ST R A L -7 B A N D M I ST R A L -N E M O . Model API cost Number of T okens H100-PCI T otal energy (Wh) Mistral-7B free M 111272. ± 6351. 100.60 ± 6.00 Mistral-7B free C 111272. ± 6351. 62.31 ± 3.56 Mistral-7B paid M 110882. ± 5922. 100.23 ± 5.60 Mistral-7B paid C 110882. ± 5922. 62.09 ± 3.32 Mistral-NeMo free M 173743 ± 4967. 203.28 ± 5.81 Mistral-NeMo free C 173743 ± 4967. 123.36 ± 3.53 Mistral-NeMo paid M 174496. ± 5815. 204.16 ± 6.80 Mistral-NeMo paid C 174496. ± 5815. 123.89 ± 4.13 M = measured; C = calculated. GPU setups. Like wise, Figure 1 rev eals that the H100-PCI is a good representativ e setup for both Mistral models. Finally , we calculated the estimate of mean energy con- sumption for the API-based models b E api as seen in Equa- tion (1) using the average number of tokens computed by the paid and free API and energy per token results for both TDP calculation and local measurements. For Mistral-7B and for Mistral-NeMo, the results are sho wn in T able IV showcasing differing energy consumptions. Note that TDP calculations are likely underestimated, cf. Section III-D. W e will discuss additional factors for these results in the next section. V . D I S C U S S I O N Our results confirmed the necessity for a strict experiment protocol. De viations in deployment-specific factors, such as variations in system prompts and minor model version differ - ences between API and open-source releases, may easily lead to ske wed time measurements as depicted in the comparison between T able III and T able IV. Here, the API and local ex ecutions produce different numbers of tokens, making raw total energy values misleading across settings. Consequently , normalizing energy values given the amount of processed tokens is essential for accurate results. Figure 1 shows these normalized values for API-based and local models. In particular, it highlights a close alignment between the API-based deplo yment and Cluster H (local deployment equipped with NVIDIA Hopper series GPUs) for Mistral-NeMo. Howe ver , we note that this does not imply that both deployments necessarily consume the same energy per token. Server -side optimization techniques, such as tensor or pipeline parallelism, may be applied to reduce latency at the expense of higher power consumption. For instance, hosting Mistral-7B on multiple GPUs shortens the runtime but proportionally increases total energy consumption per second. Nonetheless, we know that most data centers are using a mixture of Nvidia Ampere and Nvidia Hopper series GPUs. Further , our local time per token results are within one stan- dard deviation of the API-based results. As we select models that fit into a single GPU, it is reasonable to conclude that the two measurements directly correspond, ultimately confirming our hypothesis that inference time per token is a reasonable proxy for energy estimation in API-based LLMs. A. Limitations This work relies on the assumption that the time we measure for the API-based models is dominated by GPU computation time, with other contrib utions being negligible. This assump- tion allowed us to reduce the complexity of our ev aluation, such as only one GPU and a single batch size. In practice, howe ver , real-world deployments may introduce additional sources of latency and energy consumption that are not directly observable in our proposed experimental setup. These include multi-GPU execution, as well as tensor , pipeline, and expert parallelism strategies, which can change the runtime–energy trade-off (e.g., reducing latency at the cost of increased po wer draw). While our approach therefore provides a reasonable gross estimate of the energy consumption of LLMs accessed through an API, a more fine-grained characterization of these effects can improve accuracy . As future work, systematic ablation studies, including parallelism strategies, hardware configurations, and batch sizes, are necessary to quantify the impact of these factors on both runtime and energy use. V I . C O N C L U S I O N In this work, we inv estigated the potential of inference time as a proxy for estimating the energy consumption of API- based language models. Our results show that time per token can be used to imply the type of GPU configurations used behind API-based models. Likewise, gross energy estimations for these particular models are possible, providing researchers with a means to judge the energy efficienc y of API-based models. V I I . A C K N OW L E D G M E N T S This work is funded and supported by the EU’ s Hori- zon Europe research and innov ation program through the ”SustainML” project (101070408), the Carl Zeiss Foundation through the project ”Sustainable Embedded AI”, the German BMFTR through the ”Cross-Act” project (01IW25001) and the W allenberg AI, Autonomous Systems and Software Program (W ASP) under the Knut and Alice W allenberg Foundation. R E F E R E N C E S [1] J. W ang, E. Shi, H. Hu, C. Ma, Y . Liu, X. W ang, Y . Y ao, X. Liu, B. Ge, and S. Zhang, “Lar ge language models for robotics: Opportunities, challenges, and perspectiv es, ” Journal of Automation and Intelligence , vol. 4, no. 1, pp. 52–64, 2025. [2] J. Haltaufderheide and R. Ranisch, “The ethics of chatgpt in medicine and healthcare: a systematic re view on large language models (llms), ” NPJ digital medicine , vol. 7, no. 1, p. 183, 2024. [3] J. Howarth, “Most visited websites in the world (august 2025), ” W ebpage, 2025. [Online]. A vailable: https://explodingtopics.com/blog/ most- visited- websites [4] F . Simon, R. K. Nielsen, and R. Fletcher, “Generativ e ai and news report 2025: how people think about ai’ s role in journalism and society , ” 2025. [5] O. McClain, “34% of us adults hav e used chatgpt, about double the share in 2023, ” P ew Research Center , available at: Link to the cited article , 2025. [6] Ofcom, “Online nation 2024 report, ” Ofcom, London, UK, T ech. Rep., Nov . 2024, accessed: October 23, 2025. [Online]. A vailable: https://www .ofcom.org.uk/siteassets/resources/ documents/research- and- data/online- research/online- nation/2024/ online- nation- 2024- report.pdf?v=386238 [7] N. Maslej, L. Fattorini, R. Perrault, Y . Gil, V . Parli, N. Kariuki, E. Capstick, A. Reuel, E. Brynjolfsson, J. Etchemendy et al. , “ Artificial intelligence index report 2025, ” arXiv preprint , 2025. [8] S. Samsi, D. Zhao, J. McDonald, B. Li, A. Michaleas, M. Jones, W . Bergeron, J. K epner , D. T iwari, and V . Gadepally , “From words to watts: Benchmarking the energy costs of large language model infer- ence, ” in 2023 IEEE High P erformance Extreme Computing Conference (HPEC) . IEEE, 2023, pp. 1–9. [9] E. Strubell, A. Ganesh, and A. McCallum, “Energy and policy con- siderations for modern deep learning research, ” in Pr oceedings of the AAAI conference on artificial intelligence , vol. 34, no. 09, 2020, pp. 13 693–13 696. [10] A. Hunt, “Google to fund dev elopment of three nuclear power sites, ” 2025. [Online]. A vailable: https://www .world- nuclear- ne ws.org/articles/ google- to- fund- elementl- to- prepare- three- nuclear- po wer- sites [11] O. et al., “Gpt-4 technical report, ” 2024. [Online]. A vailable: https://arxiv .org/abs/2303.08774 [12] M. AI, “Google to fund de velopment of three nuclear power sites, ” W ebpage, 2025. [Online]. A vailable: https://mistral.ai/ [13] N. Jegham, M. Abdelatti, L. Elmoubarki, and A. Hendawi, “How hungry is ai? benchmarking energy , water, and carbon footprint of llm inference, ” arXiv pr eprint arXiv:2505.09598 , 2025. [14] A. V aswani, N. Shazeer , N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser , and I. Polosukhin, “ Attention is all you need, ” Advances in neural information pr ocessing systems , vol. 30, 2017. [15] A. Dubey , A. Jauhri, A. Pande y , A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Y ang, A. Fan et al. , “The llama 3 herd of models, ” arXiv e-prints , pp. arXiv–2407, 2024. [16] A. Liu, B. Feng, B. Xue, B. W ang, B. W u, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al. , “Deepseek-v3 technical report, ” arXiv preprint arXiv:2412.19437 , 2024. [17] A. Y ang, A. Li, B. Y ang, B. Zhang, B. Hui, B. Zheng, B. Y u, C. Gao, C. Huang, C. Lv et al. , “Qwen3 technical report, ” arXiv preprint arXiv:2505.09388 , 2025. [18] Mistral, “Mistral nemo, ” 2024. [Online]. A vailable: https://mistral.ai/ news/mistral- nemo [19] C. White, S. Dooley , M. Roberts, A. Pal, B. Feuer, S. Jain, R. Shwartz-Ziv , N. Jain, K. Saifullah, S. Dey et al. , “Liv ebench: A challenging, contamination-limited llm benchmark, ” arXiv preprint arXiv:2406.19314 , 2024. [20] D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. T exier, and J. Dean, “Carbon emissions and large neural network training, ” arXiv preprint , 2021. [21] P . Henderson, J. Hu, J. Romoff, E. Brunskill, D. Jurafsky , and J. Pineau, “T ow ards the systematic reporting of the energy and carbon footprints of machine learning, ” Journal of Machine Learning Resear ch , vol. 21, no. 248, pp. 1–43, 2020. [22] D. Zhao, N. C. Frey , J. McDonald, M. Hubbell, D. Bestor, M. Jones, A. Prout, V . Gadepally , and S. Samsi, “ A green (er) world for ai, ” in 2022 IEEE International P arallel and Distributed Pr ocessing Symposium W orkshops (IPDPSW) . IEEE, 2022, pp. 742–750. [23] M. ¨ Ozcan, P . W iesner , P . W eiß, and O. Kao, “Quantifying the energy consumption and carbon emissions of llm inference via simulations, ” arXiv pr eprint arXiv:2507.11417 , 2025. [24] R. Desislavov , F . Mart ´ ınez-Plumed, and J. Hern ´ andez-Orallo, “Trends in ai inference energy consumption: Beyond the performance-vs-parameter laws of deep learning, ” Sustainable Computing: Informatics and Sys- tems , vol. 38, p. 100857, 2023. [25] C.-J. W u, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng, G. Chang, F . Aga, J. Huang, C. Bai et al. , “Sustainable ai: En vironmental implications, challenges and opportunities, ” Proceedings of machine learning and systems , vol. 4, pp. 795–813, 2022. [26] L. F . W . Anthony , B. Kanding, and R. Selvan, “Carbontracker: Tracking and predicting the carbon footprint of training deep learning models, ” arXiv pr eprint arXiv:2007.03051 , 2020. [27] B. Courty , V . Schmidt, S. Luccioni, Goyal-Kamal, MarionCoutarel, B. Feld, J. Lecourt, LiamConnell, A. Saboni, Inimaz, supatomic, M. L ´ eval, L. Blanche, A. Cruveiller , ouminasara, F . Zhao, A. Joshi, A. Bogroff, H. de Lavoreille, N. Laskaris, E. Abati, D. Blank, Z. W ang, A. Catovic, M. Alencon, M. Stechly , C. Bauer, L. O. N. de Ara ´ ujo, JPW , and MinervaBooks, “mlco2/codecarbon: v2.4.1, ” May 2024. [Online]. A vailable: https://doi.org/10.5281/zenodo.11171501 [28] D. Geißler, B. Zhou, M. Liu, S. Suh, and P . Luko wicz, “The power of training: How different neural network setups influence the energy demand, ” in International Confer ence on Architectur e of Computing Systems . Springer , 2024, pp. 33–47. [29] J. Fernandez, C. Na, V . T iwari, Y . Bisk, S. Luccioni, and E. Strubell, “Energy considerations of large language model inference and efficiency optimizations, ” arXiv pr eprint arXiv:2504.17674 , 2025. [30] L. Krupp, D. Geißler , V . Banwari, P . Lukowicz, and J. Karolus, “Pro- moting sustainable web agents: Benchmarking and estimating energy consumption through empirical and theoretical analysis, ” arXiv preprint arXiv:2511.04481 , 2025. [31] NVIDIA, “Mistral ai and nvidia unveil mistral nemo 12b, ” https://blogs. n vidia.com/blog/mistral- n vidia- ai- model/, 2024, accessed: 2025-10-29. [32] ——, “Nvidia blackwell sets ne w standard for gen ai in mlperf inference v4.1, ” https://blogs.n vidia.com/blog/ mlperf- inference- benchmark- blackwell/, 2024, accessed: 2025-10- 29. [33] Amazon W eb Services, “Mistral ai models no w av ailable on amazon bedrock, ” https://aws.amazon.com/blogs/aws/ mistral- ai- models- now- a vailable- on- amazon- bedrock/, 2024, accessed: 2025-10-29. [34] NVIDIA, “Nvidia a100 tensor core gpu architecture, ” NVIDIA Corporation, T ech. Rep., 2020. [Online]. A vail- able: https://images.nvidia.com/aem- dam/en- zz/Solutions/data- center/ n vidia- ampere- architecture- whitepaper .pdf

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment