Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Conventional approaches to building energy retrofit decision making suffer from limited generalizability and low interpretability, hindering adoption in diverse residential contexts. With the growth of Smart and Connected Communities, generative AI, especially large language models (LLMs), may help by processing contextual information and producing practitioner readable recommendations. We evaluate seven LLMs (ChatGPT, DeepSeek, Gemini, Grok, Llama, and Claude) on residential retrofit decisions under two objectives: maximizing CO2 reduction (technical) and minimizing payback period (sociotechnical). Performance is assessed on four dimensions: accuracy, consistency, sensitivity, and reasoning, using a dataset of 400 homes across 49 US states. LLMs generate effective recommendations in many cases, reaching up to 54.5 percent top 1 match and 92.8 percent within top 5 without fine tuning. Performance is stronger for the technical objective, while sociotechnical decisions are limited by economic trade offs and local context. Agreement across models is low, and higher performing models tend to diverge from others. LLMs are sensitive to location and building geometry but less sensitive to technology and occupant behavior. Most models show step by step, engineering style reasoning, but it is often simplified and lacks deeper contextual awareness. Overall, LLMs are promising assistants for energy retrofit decision making, but improvements in accuracy, consistency, and context handling are needed for reliable practice.

💡 Research Summary

This paper investigates whether large language models (LLMs) can serve as reliable assistants for residential energy‑retrofit decision‑making. Conventional physics‑based simulations and data‑driven methods suffer from heavy data requirements, limited scalability, poor sensitivity to micro‑climatic variations, static occupant‑behavior assumptions, and limited interpretability. The authors argue that the emerging Smart and Connected Communities (S&C C) ecosystem—combining digital twins, Internet‑of‑Things sensors, and multi‑agent systems—provides rich, real‑time contextual data that LLMs could synthesize into human‑readable retrofit recommendations.

To test this hypothesis, seven state‑of‑the‑art LLMs (ChatGPT o1 and o3, DeepSeek R1, Gemini 2.0, Grok 3, Llama 3.2, and Claude 3.7) were evaluated on a curated subset of the NREL ResStock 2024.2 dataset. Four hundred homes from 49 U.S. states were selected, each described by 389 parameters (building envelope, HVAC equipment, occupant demographics, location, etc.). Sixteen retrofit packages—varying heat‑pump efficiency, envelope sealing, insulation, and appliance electrification—were defined, and EnergyPlus simulations supplied baseline and retrofit energy, CO₂, and cost outcomes. Missing cost data were filled using the National Residential Efficiency Measures database.

Each LLM received an identical three‑part prompt: (1) a concise description of the 16 packages, (2) a role assignment (e.g., “energy‑efficiency consultant”), and (3) the specific house’s data. The models were asked to select the optimal package under two distinct objectives: (a) maximize CO₂ reduction (technical goal) and (b) minimize payback period (sociotechnical goal). Performance was measured along four dimensions.

Accuracy was quantified by the proportion of model suggestions that matched the benchmark optimal package in the top‑1 and top‑5 rankings. The best‑performing model achieved 54.5 % top‑1 accuracy and 92.8 % top‑5 accuracy, with higher scores for the CO₂‑reduction objective. Consistency—the agreement among different LLMs—was low (Cohen’s κ ≈ 0.12), and higher‑performing models tended to diverge more from the rest. Sensitivity analyses showed strong dependence on geographic location and building geometry, but relatively weak responses to variations in heat‑pump efficiency, insulation levels, or occupant behavior, indicating that LLMs capture broad statistical patterns but lack fine‑grained technical nuance. Reasoning quality was assessed qualitatively; most models produced step‑by‑step, engineering‑style explanations (e.g., “first improve envelope sealing, then upgrade the heat pump”), yet these narratives often omitted deeper physical interactions and micro‑climate effects, rendering the reasoning superficial.

Model‑specific observations: ChatGPT o1/o3 displayed the most coherent multi‑step reasoning and stable context handling; DeepSeek R1 excelled at numerical cost calculations but offered terse explanations; Gemini 2.0’s multimodal capabilities were not leveraged in the text‑only test; Grok 3 advertised real‑time data integration but did not demonstrate it; Llama 3.2 was fast but prone to logical errors in complex optimization; Claude 3.7 offered a hybrid of quick and deep reasoning with moderate accuracy.

The study highlights that LLM outputs are highly sensitive to prompt design, and without domain‑specific fine‑tuning the models still achieve surprisingly high top‑5 coverage, but fall short of the reliability required for professional engineering decisions (e.g., >95 % top‑1 accuracy). Consequently, LLMs are best positioned as front‑end tools for idea generation, option comparison, and stakeholder communication, while final design and implementation should still rely on physics‑based simulation and expert validation.

Future work suggested includes domain‑specific fine‑tuning to improve sensitivity to technical parameters, structured output formats (e.g., JSON) for downstream automation, tighter integration with live sensor and weather feeds to enhance context awareness, and ensemble or voting mechanisms to boost inter‑model consistency. If these advances materialize, LLMs could become a core AI engine within S&C C platforms, accelerating carbon‑reduction and cost‑effective retrofits across diverse residential stocks.

Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment