AI Research
Can AI optimize building retrofits? Research shows promise in CO₂ reduction but gaps in economic reasoning

Researchers from Michigan State University have conducted one of the first systematic evaluations of large language models (LLMs) in the domain of building energy retrofits, where decisions on upgrades such as insulation, heat pumps, and electrification can directly impact energy savings and carbon reduction.
The study, titled “Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models,” published on arXiv, examines whether LLMs can reliably guide retrofit decision-making across diverse U.S. housing stock. It addresses the limitations of conventional methods, which are often too technical, data-heavy, or opaque for practical adoption, particularly at large scale.
How accurate are AI models in selecting retrofit measures?
The researchers tested seven widely used LLMs, ChatGPT o1, ChatGPT o3, DeepSeek R1, Grok 3, Gemini 2.0, Llama 3.2, and Claude 3.7, on a dataset of 400 homes drawn from 49 states. Each home profile included details such as construction vintage, floor area, insulation levels, heating and cooling systems, and occupant patterns. The models were asked to recommend retrofit measures under two separate objectives: maximizing carbon dioxide reduction (technical context) and minimizing payback period (sociotechnical context).
The analysis found that LLMs were able to deliver effective results in technical optimization tasks. Accuracy reached 54.5 percent when looking at the single best solution and as high as 92.8 percent when top five matches were considered, even without fine-tuning. This reflects the models’ ability to align with physics-based benchmarks in scenarios where clear engineering goals, such as cutting carbon emissions, are prioritized.
On the other hand, when the focus shifted to minimizing payback period, results weakened substantially. Top-1 accuracy fell as low as 6.5 percent in some models, with only Gemini 2.0 surpassing 50 percent at the broader Top-5 threshold. The study concludes that economic trade-offs, which require balancing upfront investment against long-term savings, remain difficult for LLMs to interpret accurately.
How consistent and reliable are AI-generated decisions?
The study also examined whether different LLMs converged on the same recommendations. Here, performance was less encouraging. Consistency between models was low, and in some cases their agreement was worse than chance. Interestingly, the models that performed best in terms of accuracy, such as ChatGPT o3 and Gemini 2.0, were also the ones most likely to diverge from other systems. This indicates that while some models may excel, they do not necessarily produce results that align with peers, creating challenges for standardization in real-world applications.
The findings underscore the difficulty of relying on AI for high-stakes energy decisions when consensus is lacking. In practice, building owners, policymakers, and utility companies require not just accurate but also consistent recommendations. Low inter-model reliability highlights the importance of developing frameworks that validate and harmonize AI outputs before they can be integrated into large-scale retrofit programs.
What shapes AI reasoning in retrofit decisions?
The researchers also explored how LLMs arrive at their decisions. Sensitivity analysis showed that most models, like physics-based baselines, prioritized location and building geometry. Variables such as county, state, and floor space were consistently weighted as the most influential factors. However, the models paid less attention to occupant behaviors and technology choices, even though these can be critical in shaping real-world outcomes.
The reasoning patterns offered further insight. Among the tested systems, ChatGPT o3 and DeepSeek R1 provided the most structured, step-by-step explanations. Their workflows followed an engineering-like logic, beginning with baseline energy assumptions, adjusting for envelope improvements, calculating system efficiency, incorporating appliance impacts, and finally comparing outcomes. Yet, while the logic mirrored engineering principles, it was often simplified, overlooking nuanced contextual dependencies such as occupant usage levels or detailed climate variations.
The authors also noted that prompt design played a key role in outcomes. Slight adjustments in how questions were phrased could significantly shift model reasoning. For example, if not explicitly instructed to consider both upfront cost and energy savings, some models defaulted to choosing the lowest-cost option when evaluating payback. This sensitivity suggests that successful deployment of AI in retrofit contexts will depend heavily on careful prompt engineering and domain-specific adaptation.
A cautious but forward-looking conclusion
The evaluation highlights both the promise and the limitations of current LLMs in building energy retrofits. On one hand, the ability to achieve near 93 percent alignment with top retrofit measures in technical contexts shows significant potential for AI to streamline decision-making and improve energy efficiency strategies. On the other, weak performance in sociotechnical trade-offs, low inter-model consistency, and simplified reasoning demonstrate that these tools are not yet ready to replace domain expertise.
To sum up, LLMs can complement but not substitute traditional methods and expert judgment in retrofit planning. They recommend further development of domain-specific models, fine-tuning with validated datasets, and hybrid approaches that integrate AI with physics-based simulations to ensure accuracy and traceability.
For policymakers and practitioners, the study provides an important benchmark: AI can indeed assist in advancing retrofit strategies, especially for carbon reduction, but its current shortcomings demand careful oversight. As cities and communities push toward energy transition goals, ensuring that AI systems are transparent, consistent, and context-aware will be essential before they can be deployed at scale.
AI Research
Brown awarded $20 million to lead artificial intelligence research institute aimed at mental health support

A $20 million grant from the National Science Foundation will support the new AI Research Institute on Interaction for AI Assistants, called ARIA, based at Brown to study human-artificial intelligence interactions and mental health. The initiative, announced in July, aims to help develop AI support for mental and behavioral health.
“The reason we’re focusing on mental health is because we think this represents a lot of the really big, really hard problems that current AI can’t handle,” said Associate Professor of Computer Science and Cognitive and Psychological Sciences Ellie Pavlick, who will lead ARIA. After viewing news stories about AI chatbots’ damage to users’ mental health, Pavlick sees renewed urgency in asking, “What do we actually want from AI?”
The initiative is part of a bigger investment from the NSF to support the goals of the White House’s AI Action Plan, according to a NSF press release. This “public-private investment,” the press release says, will “sustain and enhance America’s global AI dominance.”
According to Pavlick, she and her fellow researchers submitted the proposal for ARIA “years ago, long before the administration change,” but the response was “very delayed” due to “a lot of uncertainty at (the) NSF.”
One of these collaborators was Michael Frank, the director of the Center for Computational Brain Science at the Carney Institute and a professor of psychology.
Frank, who was already working with Pavlick on projects related to AI and human learning, said that the goal is to tie together collaborations of members from different fields “more systematically and more broadly.”
According to Roman Feiman, an assistant professor of cognitive and psychological sciences and linguistics and another member of the ARIA team, the goal of the initiative is to “develop better virtual assistants.” But that goal includes various obstacles to ensure the machines “treat humans well,” behave ethically and remain controllable.
Within the study, some “people work basic cognitive neuroscience, other people work more on human machine interaction (and) other people work more on policy and society,” Pavlick explained.
Although the ARIA team consists of many faculty and students at Brown, according to Pavlick, other institutions like Carnegie Mellon University, University of New Mexico and Dartmouth are also involved. On top of “basic science” research, ARIA’s research also examines the best practices for patient safety and the legal implications of AI.
“As everybody currently knows, people are relying on (large language models) a lot, and I think many people who rely on them don’t really know how best to use them, and don’t entirely understand their limitations,” Feiman said.
According to Frank, the goal is not to “replace human therapists,” but rather to assist them.
Assistant Professor of the Practice of Computer Science and Philosophy Julia Netter, who studies the ethics of technology and responsible computing and is not involved in ARIA, said that ARIA has “the right approach.”
Netter said ARIA approach differs from previous research “in that it really tried to bring in experts from other areas, people who know about mental health” and others, rather than those who focus solely on computer science.
But the ethics of using AI in a mental health context is a “tricky question,” she added.
“This is an area that touches people at a point in time when they are very, very vulnerable,” Netter said, adding that any interventions that arise from this research should be “well-tested.”
“You’re touching an area of a person’s life that really has the potential of making a huge difference, positive or negative,” she added.
Because AI is “not going anywhere,” Frank said he is excited to “understand and control it in ways that are used for good.”
“My hope is that there will be a shift from just trying stuff and seeing what gets a better product,” Feiman said. “I think there’s real potential for scientific enterprise — not just a profit-making enterprise — of figuring out what is actually the best way to use these things to improve people’s lives.”
Get The Herald delivered to your inbox daily.
AI Research
BITSoM launches AI research and innovation lab to shape future leaders

Mumbai: The BITS School of Management (BITSoM), under the aegis of BITS Pilani, a leading private university, will inaugurate its new BITSoM Research in AI and Innovation (BRAIN) Lab in its Kalyan Campus on Friday. The lab is designed to prepare future leaders for workplaces transformed by artificial intelligence, on Friday on its Kalyan campus.
While explaining the concept of the laboratory, professor Saravanan Kesavan, dean of BITSoM, said that the BRAIN Lab had three core pillars–teaching, research, and outreach. Kesavan said, “It provides MBA (masters in business administration) students a dedicated space equipped with high-performance AI computers capable of handling tasks such as computer vision and large-scale data analysis. Students will not only learn about AI concepts in theory but also experiment with real-world applications.” Kesavan added that each graduating student would be expected to develop an AI product as part of their coursework, giving them first-hand experience in innovation and problem-solving.
The BRAIN lab is also designed to be a hub of collaboration where researchers can conduct projects in partnership with various companies and industries, creating a repository of practical AI tools to use. Kesavan said, “The initial focus areas (of the lab) include manufacturing, healthcare, banking and financial services, and Global Capability Centres (subsidiaries of multinational corporations that perform specialised functions).” He added that the case studies and research from the lab will be made freely available to schools, colleges, researchers, and corporate partners, ensuring that the benefits of the lab reach beyond the BITSoM campus.
BITSoM also plans to use the BRAIN Lab as a launchpad for startups. An AI programme will support entrepreneurs in developing solutions as per their needs while connecting them to venture capital networks in India and Silicon Valley. This will give young companies the chance to refine their ideas with guidance from both academics and industry leaders.
The centre’s physical setup resembles a modern computer lab, with dedicated workspaces, collaborative meeting rooms, and brainstorming zones. It has been designed to encourage creativity, allowing students to visualise how AI works, customise tools for different industries, and allow their technical capabilities to translate into business impacts.
In the context of a global workplace that is embracing AI, Kesavan said, “Future leaders need to understand not just how to manage people but also how to manage a workforce that combines humans and AI agents. Our goal is to ensure every student graduating from BITSoM is equipped with the skills to build AI products and apply them effectively in business.”
Kesavan said that advisors from reputed institutions such as Harvard, Johns Hopkins, the University of Chicago, and industry professionals from global companies will provide guidance to students at the lab. Alongside student training, BITSoM also plans to run reskilling programmes for working professionals, extending its impact beyond the campus.
AI Research
AI grading issue affects hundreds of MCAS essays in Mass. – NBC Boston

The use of artificial intelligence to score statewide standardized tests resulted in errors that affected hundreds of exams, the NBC10 Investigators have learned.
The issue with the Massachusetts Comprehensive Assessment System (MCAS) surfaced over the summer, when preliminary results for the exams were distributed to districts.
The state’s testing contractor, Cognia, found roughly 1,400 essays did not receive the correct scores, according to a spokesperson with the Department of Elementary and Secondary Education.
DESE told NBC10 Boston all the essays were rescored, affected districts received notification, and all their data was corrected in August.
So how did humans detect the problem?
We found one example in Lowell. Turns out an alert teacher at Reilly Elementary School was reading through her third-grade students’ essays over the summer. When the instructor looked up the scores some of the students received, something did not add up.
The teacher notified the school principal, who then flagged the issue with district leaders.
“We were on alert that there could be a learning curve with AI,” said Wendy Crocker-Roberge, an assistant superintendent in the Lowell school district.
AI essay scoring works by using human-scored exemplars of what essays at each score point look like, according to DESE.
DESE pointed out the affected exams represent a small percentage of the roughly 750,000 MCAS essays statewide.
The AI tool uses that information to score the essays. In addition, humans give 10% of the AI-scored essays a second read and compare their scores with the AI score to make sure there aren’t discrepancies. AI scoring was used for the same amount of essays in 2025 as in 2024, DESE said.
Crocker-Roberge said she decided to read about 1,000 essays in Lowell, but it was tough to pinpoint the exact reason some students did not receive proper credit.
However, it was clear the AI technology was deducting points without justification. For instance, Crocker-Roberge said she noticed that some essays lost a point when they did not use quotation marks when referencing a passage from the reading excerpt.
“We could not understand why an individual score was scored a zero when it should have gotten six out of seven points,” Crocker-Roberge said. “There just wasn’t any rhyme or reason to that.”
District leaders notified DESE about the problem, which resulted in approximately 1,400 essays being rescored. The state agency says the scoring problem was the result of a “temporary technical issue in the process.”
According to DESE, 145 districts were notified that had at least one student essay that was not scored correctly.
“As one way of checking that MCAS scores are accurate, DESE releases preliminary MCAS results to districts and gives them time to report any issues during a discrepancy period each year,” a DESE spokesperson wrote in a statement.
Mary Tamer, the executive director of MassPotential, an organization that advocates for educational improvement, said there are a lot of positives to using AI and returning scores back to school districts faster so appropriate action can be taken. For instance, test results can help identify a child in need of intervention or highlight a lesson plan for a teacher that did not seem to resonate with students.
“I think there’s a lot of benefits that outweigh the risks,” said Tamer. “But again, no system is perfect and that’s true for AI. The work always has to be doublechecked.”
DESE pointed out the affected exams represent a small percentage of the roughly 750,000 MCAS essays statewide.
However, in districts like Lowell, there are certain schools tracked by DESE to ensure progress is being made and performance standards are met.
That’s why Crocker-Roberge said every score counts.
With MCAS results expected to be released to parents in the coming weeks, the assistant superintendent is encouraging other districts to do a deep dive on their student essays to make sure they don’t notice any scoring discrepancies.
“I think we have to always proceed with caution when we’re introducing new tools and techniques,” Crocker-Roberge said. “Artificial intelligence is just a really new learning curve for everyone, so proceed with caution.”
There’s a new major push for AI training in the Bay State, where educators are getting savvier by the second. NBC10 Boston education reporter Lauren Melendez has the full story.
-
Business2 weeks ago
The Guardian view on Trump and the Fed: independence is no substitute for accountability | Editorial
-
Tools & Platforms1 month ago
Building Trust in Military AI Starts with Opening the Black Box – War on the Rocks
-
Ethics & Policy2 months ago
SDAIA Supports Saudi Arabia’s Leadership in Shaping Global AI Ethics, Policy, and Research – وكالة الأنباء السعودية
-
Events & Conferences4 months ago
Journey to 1000 models: Scaling Instagram’s recommendation system
-
Jobs & Careers2 months ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Podcasts & Talks2 months ago
Happy 4th of July! 🎆 Made with Veo 3 in Gemini
-
Education2 months ago
Macron says UK and France have duty to tackle illegal migration ‘with humanity, solidarity and firmness’ – UK politics live | Politics
-
Education2 months ago
VEX Robotics launches AI-powered classroom robotics system
-
Funding & Business2 months ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries
-
Podcasts & Talks2 months ago
OpenAI 🤝 @teamganassi