AI Research

Can AI optimize building retrofits? Research shows promise in CO₂ reduction but gaps in economic reasoning

Published

9 hours ago

September 11, 2025

Researchers from Michigan State University have conducted one of the first systematic evaluations of large language models (LLMs) in the domain of building energy retrofits, where decisions on upgrades such as insulation, heat pumps, and electrification can directly impact energy savings and carbon reduction.

The study, titled “Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models,” published on arXiv, examines whether LLMs can reliably guide retrofit decision-making across diverse U.S. housing stock. It addresses the limitations of conventional methods, which are often too technical, data-heavy, or opaque for practical adoption, particularly at large scale.

How accurate are AI models in selecting retrofit measures?

The researchers tested seven widely used LLMs, ChatGPT o1, ChatGPT o3, DeepSeek R1, Grok 3, Gemini 2.0, Llama 3.2, and Claude 3.7, on a dataset of 400 homes drawn from 49 states. Each home profile included details such as construction vintage, floor area, insulation levels, heating and cooling systems, and occupant patterns. The models were asked to recommend retrofit measures under two separate objectives: maximizing carbon dioxide reduction (technical context) and minimizing payback period (sociotechnical context).

The analysis found that LLMs were able to deliver effective results in technical optimization tasks. Accuracy reached 54.5 percent when looking at the single best solution and as high as 92.8 percent when top five matches were considered, even without fine-tuning. This reflects the models’ ability to align with physics-based benchmarks in scenarios where clear engineering goals, such as cutting carbon emissions, are prioritized.

On the other hand, when the focus shifted to minimizing payback period, results weakened substantially. Top-1 accuracy fell as low as 6.5 percent in some models, with only Gemini 2.0 surpassing 50 percent at the broader Top-5 threshold. The study concludes that economic trade-offs, which require balancing upfront investment against long-term savings, remain difficult for LLMs to interpret accurately.

How consistent and reliable are AI-generated decisions?

The study also examined whether different LLMs converged on the same recommendations. Here, performance was less encouraging. Consistency between models was low, and in some cases their agreement was worse than chance. Interestingly, the models that performed best in terms of accuracy, such as ChatGPT o3 and Gemini 2.0, were also the ones most likely to diverge from other systems. This indicates that while some models may excel, they do not necessarily produce results that align with peers, creating challenges for standardization in real-world applications.

The findings underscore the difficulty of relying on AI for high-stakes energy decisions when consensus is lacking. In practice, building owners, policymakers, and utility companies require not just accurate but also consistent recommendations. Low inter-model reliability highlights the importance of developing frameworks that validate and harmonize AI outputs before they can be integrated into large-scale retrofit programs.

What shapes AI reasoning in retrofit decisions?

The researchers also explored how LLMs arrive at their decisions. Sensitivity analysis showed that most models, like physics-based baselines, prioritized location and building geometry. Variables such as county, state, and floor space were consistently weighted as the most influential factors. However, the models paid less attention to occupant behaviors and technology choices, even though these can be critical in shaping real-world outcomes.

The reasoning patterns offered further insight. Among the tested systems, ChatGPT o3 and DeepSeek R1 provided the most structured, step-by-step explanations. Their workflows followed an engineering-like logic, beginning with baseline energy assumptions, adjusting for envelope improvements, calculating system efficiency, incorporating appliance impacts, and finally comparing outcomes. Yet, while the logic mirrored engineering principles, it was often simplified, overlooking nuanced contextual dependencies such as occupant usage levels or detailed climate variations.

The authors also noted that prompt design played a key role in outcomes. Slight adjustments in how questions were phrased could significantly shift model reasoning. For example, if not explicitly instructed to consider both upfront cost and energy savings, some models defaulted to choosing the lowest-cost option when evaluating payback. This sensitivity suggests that successful deployment of AI in retrofit contexts will depend heavily on careful prompt engineering and domain-specific adaptation.

A cautious but forward-looking conclusion

The evaluation highlights both the promise and the limitations of current LLMs in building energy retrofits. On one hand, the ability to achieve near 93 percent alignment with top retrofit measures in technical contexts shows significant potential for AI to streamline decision-making and improve energy efficiency strategies. On the other, weak performance in sociotechnical trade-offs, low inter-model consistency, and simplified reasoning demonstrate that these tools are not yet ready to replace domain expertise.

To sum up, LLMs can complement but not substitute traditional methods and expert judgment in retrofit planning. They recommend further development of domain-specific models, fine-tuning with validated datasets, and hybrid approaches that integrate AI with physics-based simulations to ensure accuracy and traceability.

For policymakers and practitioners, the study provides an important benchmark: AI can indeed assist in advancing retrofit strategies, especially for carbon reduction, but its current shortcomings demand careful oversight. As cities and communities push toward energy transition goals, ensuring that AI systems are transparent, consistent, and context-aware will be essential before they can be deployed at scale.

Source link

Up Next

Artificial Intelligence Stocks Rally as Nvidia, TSMC Gain on Oracle Growth Forecast

Don't Miss

F5 to acquire CalypsoAI for advanced AI security capabilities

CO-EDP, VisionRI

Click to comment

AI Research

Brown awarded $20 million to lead artificial intelligence research institute aimed at mental health support

Published

7 minutes ago

September 12, 2025

The Editors

A $20 million grant from the National Science Foundation will support the new AI Research Institute on Interaction for AI Assistants, called ARIA, based at Brown to study human-artificial intelligence interactions and mental health. The initiative, announced in July, aims to help develop AI support for mental and behavioral health.

“The reason we’re focusing on mental health is because we think this represents a lot of the really big, really hard problems that current AI can’t handle,” said Associate Professor of Computer Science and Cognitive and Psychological Sciences Ellie Pavlick, who will lead ARIA. After viewing news stories about AI chatbots’ damage to users’ mental health, Pavlick sees renewed urgency in asking, “What do we actually want from AI?”

The initiative is part of a bigger investment from the NSF to support the goals of the White House’s AI Action Plan, according to a NSF press release. This “public-private investment,” the press release says, will “sustain and enhance America’s global AI dominance.”

According to Pavlick, she and her fellow researchers submitted the proposal for ARIA “years ago, long before the administration change,” but the response was “very delayed” due to “a lot of uncertainty at (the) NSF.”

One of these collaborators was Michael Frank, the director of the Center for Computational Brain Science at the Carney Institute and a professor of psychology.

Frank, who was already working with Pavlick on projects related to AI and human learning, said that the goal is to tie together collaborations of members from different fields “more systematically and more broadly.”

According to Roman Feiman, an assistant professor of cognitive and psychological sciences and linguistics and another member of the ARIA team, the goal of the initiative is to “develop better virtual assistants.” But that goal includes various obstacles to ensure the machines “treat humans well,” behave ethically and remain controllable.

Within the study, some “people work basic cognitive neuroscience, other people work more on human machine interaction (and) other people work more on policy and society,” Pavlick explained.

Although the ARIA team consists of many faculty and students at Brown, according to Pavlick, other institutions like Carnegie Mellon University, University of New Mexico and Dartmouth are also involved. On top of “basic science” research, ARIA’s research also examines the best practices for patient safety and the legal implications of AI.

“As everybody currently knows, people are relying on (large language models) a lot, and I think many people who rely on them don’t really know how best to use them, and don’t entirely understand their limitations,” Feiman said.

According to Frank, the goal is not to “replace human therapists,” but rather to assist them.

Assistant Professor of the Practice of Computer Science and Philosophy Julia Netter, who studies the ethics of technology and responsible computing and is not involved in ARIA, said that ARIA has “the right approach.”

Netter said ARIA approach differs from previous research “in that it really tried to bring in experts from other areas, people who know about mental health” and others, rather than those who focus solely on computer science.

But the ethics of using AI in a mental health context is a “tricky question,” she added.

“This is an area that touches people at a point in time when they are very, very vulnerable,” Netter said, adding that any interventions that arise from this research should be “well-tested.”

“You’re touching an area of a person’s life that really has the potential of making a huge difference, positive or negative,” she added.

Because AI is “not going anywhere,” Frank said he is excited to “understand and control it in ways that are used for good.”

“My hope is that there will be a shift from just trying stuff and seeing what gets a better product,” Feiman said. “I think there’s real potential for scientific enterprise — not just a profit-making enterprise — of figuring out what is actually the best way to use these things to improve people’s lives.”

aistoriz.com

Can AI optimize building retrofits? Research shows promise in CO₂ reduction but gaps in economic reasoning

AI Research

Can AI optimize building retrofits? Research shows promise in CO₂ reduction but gaps in economic reasoning

How accurate are AI models in selecting retrofit measures?

How consistent and reliable are AI-generated decisions?

What shapes AI reasoning in retrofit decisions?

A cautious but forward-looking conclusion

Leave a Reply
Cancel reply

Leave a Reply

AI Research

Brown awarded $20 million to lead artificial intelligence research institute aimed at mental health support

AI Research

BITSoM launches AI research and innovation lab to shape future leaders

AI Research

AI grading issue affects hundreds of MCAS essays in Mass. – NBC Boston

Trending

aistoriz.com

Can AI optimize building retrofits? Research shows promise in CO₂ reduction but gaps in economic reasoning

How accurate are AI models in selecting retrofit measures?

How consistent and reliable are AI-generated decisions?

What shapes AI reasoning in retrofit decisions?

A cautious but forward-looking conclusion

You may like

Leave a Reply Cancel reply

Leave a Reply

AI Research

Brown awarded $20 million to lead artificial intelligence research institute aimed at mental health support

AI Research

BITSoM launches AI research and innovation lab to shape future leaders

AI Research

AI grading issue affects hundreds of MCAS essays in Mass. – NBC Boston

Trending

Leave a Reply
Cancel reply