AI Research

Popular AI model performance benchmark may be flawed, Meta researchers warn

Published

3 days ago

September 9, 2025

A popular benchmark for measuring the performance of artificial intelligence models could be flawed, a group of Meta Platforms researchers warned, raising fresh questions on the veracity of evaluations that have been made on major AI systems.

“We’ve identified multiple loopholes with SWE-bench Verified,” wrote Jacob Kahn, manager at Meta AI research lab Fair, in a post last week on the developer platform GitHub.

The post from Fair, which stands for Fundamental AI Research, found several prominent AI models – including Anthropic’s Claude and Alibaba Cloud’s Qwen – had “cheated” on SWE-bench Verified. Alibaba Cloud is the AI and cloud computing services unit of Alibaba Group Holding, owner of the South China Morning Post.

OpenAI-backed SWE-bench Verified, a human-validated subset of the large language model benchmark SWE-bench, evaluates AI models based on how these systems fix hundreds of real-world software issues collected from GitHub, a Microsoft subsidiary.

Fair’s post, however, claimed that models evaluated using SWE-bench Verified directly searched for known solutions shared elsewhere on the GitHub platform and passed them off as their own, instead of using their built-in coding capabilities to fix the issues.

The AI models found to have shown such behaviour included Anthropic’s Claude 4 Sonnet, Z.ai’s GLM-4.5 and Alibaba Cloud’s Qwen3-Coder-30B-A3B – with official scores of 70.4 per cent, 64.2 per cent and 51.6 per cent, respectively, on SWE-bench Verified.

“We’re still assessing [the] broader impact on evaluations and understanding trajectories for sources of leakage,” Kahn wrote.

Source link

Up Next

Apple is teaching its artificial intelligence to adapt to the Trump era – POLITICO

Don't Miss

Only 29% Of Marketers Have Formal AI Training: ADMA Releases Landmark AI Research

Vincent Chow

Click to comment

AI Research

A Unified Model for Robot Interaction, Reasoning and Planning

Published

2 hours ago

September 12, 2025

Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, Hang Li

[Submitted on 1 Sep 2025 (v1), last revised 11 Sep 2025 (this version, v2)]

View a PDF of the paper titled Robix: A Unified Model for Robot Interaction, Reasoning and Planning, by Huang Fang and 8 other authors

View PDF

Abstract:We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.

Submission history

From: Wei Li [view email]
[v1]
Mon, 1 Sep 2025 03:53:47 UTC (29,592 KB)
[v2]
Thu, 11 Sep 2025 12:40:54 UTC (29,592 KB)

Source link

AI Research

Brown awarded $20 million to lead artificial intelligence research institute aimed at mental health support

Published

4 hours ago

September 12, 2025

The Editors

A $20 million grant from the National Science Foundation will support the new AI Research Institute on Interaction for AI Assistants, called ARIA, based at Brown to study human-artificial intelligence interactions and mental health. The initiative, announced in July, aims to help develop AI support for mental and behavioral health.

“The reason we’re focusing on mental health is because we think this represents a lot of the really big, really hard problems that current AI can’t handle,” said Associate Professor of Computer Science and Cognitive and Psychological Sciences Ellie Pavlick, who will lead ARIA. After viewing news stories about AI chatbots’ damage to users’ mental health, Pavlick sees renewed urgency in asking, “What do we actually want from AI?”

The initiative is part of a bigger investment from the NSF to support the goals of the White House’s AI Action Plan, according to a NSF press release. This “public-private investment,” the press release says, will “sustain and enhance America’s global AI dominance.”

According to Pavlick, she and her fellow researchers submitted the proposal for ARIA “years ago, long before the administration change,” but the response was “very delayed” due to “a lot of uncertainty at (the) NSF.”

One of these collaborators was Michael Frank, the director of the Center for Computational Brain Science at the Carney Institute and a professor of psychology.

Frank, who was already working with Pavlick on projects related to AI and human learning, said that the goal is to tie together collaborations of members from different fields “more systematically and more broadly.”

According to Roman Feiman, an assistant professor of cognitive and psychological sciences and linguistics and another member of the ARIA team, the goal of the initiative is to “develop better virtual assistants.” But that goal includes various obstacles to ensure the machines “treat humans well,” behave ethically and remain controllable.

Within the study, some “people work basic cognitive neuroscience, other people work more on human machine interaction (and) other people work more on policy and society,” Pavlick explained.

Although the ARIA team consists of many faculty and students at Brown, according to Pavlick, other institutions like Carnegie Mellon University, University of New Mexico and Dartmouth are also involved. On top of “basic science” research, ARIA’s research also examines the best practices for patient safety and the legal implications of AI.

“As everybody currently knows, people are relying on (large language models) a lot, and I think many people who rely on them don’t really know how best to use them, and don’t entirely understand their limitations,” Feiman said.

According to Frank, the goal is not to “replace human therapists,” but rather to assist them.

Assistant Professor of the Practice of Computer Science and Philosophy Julia Netter, who studies the ethics of technology and responsible computing and is not involved in ARIA, said that ARIA has “the right approach.”

Netter said ARIA approach differs from previous research “in that it really tried to bring in experts from other areas, people who know about mental health” and others, rather than those who focus solely on computer science.

But the ethics of using AI in a mental health context is a “tricky question,” she added.

“This is an area that touches people at a point in time when they are very, very vulnerable,” Netter said, adding that any interventions that arise from this research should be “well-tested.”

“You’re touching an area of a person’s life that really has the potential of making a huge difference, positive or negative,” she added.

Because AI is “not going anywhere,” Frank said he is excited to “understand and control it in ways that are used for good.”

“My hope is that there will be a shift from just trying stuff and seeing what gets a better product,” Feiman said. “I think there’s real potential for scientific enterprise — not just a profit-making enterprise — of figuring out what is actually the best way to use these things to improve people’s lives.”

aistoriz.com

Popular AI model performance benchmark may be flawed, Meta researchers warn

AI Research

Popular AI model performance benchmark may be flawed, Meta researchers warn

Leave a Reply
Cancel reply

Leave a Reply

AI Research

A Unified Model for Robot Interaction, Reasoning and Planning

Submission history

AI Research

Brown awarded $20 million to lead artificial intelligence research institute aimed at mental health support

AI Research

BITSoM launches AI research and innovation lab to shape future leaders

Trending

aistoriz.com

Popular AI model performance benchmark may be flawed, Meta researchers warn

You may like

Leave a Reply Cancel reply

Leave a Reply

AI Research

A Unified Model for Robot Interaction, Reasoning and Planning

Submission history

AI Research

Brown awarded $20 million to lead artificial intelligence research institute aimed at mental health support

AI Research

BITSoM launches AI research and innovation lab to shape future leaders

Trending

Leave a Reply
Cancel reply