Connect with us

AI Research

Popular AI model performance benchmark may be flawed, Meta researchers warn

Published

on


A popular benchmark for measuring the performance of artificial intelligence models could be flawed, a group of Meta Platforms researchers warned, raising fresh questions on the veracity of evaluations that have been made on major AI systems.
“We’ve identified multiple loopholes with SWE-bench Verified,” wrote Jacob Kahn, manager at Meta AI research lab Fair, in a post last week on the developer platform GitHub.
The post from Fair, which stands for Fundamental AI Research, found several prominent AI models – including Anthropic’s Claude and Alibaba Cloud’s Qwen – had “cheated” on SWE-bench Verified. Alibaba Cloud is the AI and cloud computing services unit of Alibaba Group Holding, owner of the South China Morning Post.
OpenAI-backed SWE-bench Verified, a human-validated subset of the large language model benchmark SWE-bench, evaluates AI models based on how these systems fix hundreds of real-world software issues collected from GitHub, a Microsoft subsidiary.

Fair’s post, however, claimed that models evaluated using SWE-bench Verified directly searched for known solutions shared elsewhere on the GitHub platform and passed them off as their own, instead of using their built-in coding capabilities to fix the issues.

The AI models found to have shown such behaviour included Anthropic’s Claude 4 Sonnet, Z.ai’s GLM-4.5 and Alibaba Cloud’s Qwen3-Coder-30B-A3B – with official scores of 70.4 per cent, 64.2 per cent and 51.6 per cent, respectively, on SWE-bench Verified.

“We’re still assessing [the] broader impact on evaluations and understanding trajectories for sources of leakage,” Kahn wrote.



Source link

AI Research

A Unified Model for Robot Interaction, Reasoning and Planning


View a PDF of the paper titled Robix: A Unified Model for Robot Interaction, Reasoning and Planning, by Huang Fang and 8 other authors

View PDF

Abstract:We introduce Robix, a unified model that integrates robot reasoning, task planning, and natural language interaction within a single vision-language architecture. Acting as the high-level cognitive layer in a hierarchical robot system, Robix dynamically generates atomic commands for the low-level controller and verbal responses for human interaction, enabling robots to follow complex instructions, plan long-horizon tasks, and interact naturally with human within an end-to-end framework. Robix further introduces novel capabilities such as proactive dialogue, real-time interruption handling, and context-aware commonsense reasoning during task execution. At its core, Robix leverages chain-of-thought reasoning and adopts a three-stage training strategy: (1) continued pretraining to enhance foundational embodied reasoning abilities including 3D spatial understanding, visual grounding, and task-centric reasoning; (2) supervised finetuning to model human-robot interaction and task planning as a unified reasoning-action sequence; and (3) reinforcement learning to improve reasoning-action consistency and long-horizon task coherence. Extensive experiments demonstrate that Robix outperforms both open-source and commercial baselines (e.g., GPT-4o and Gemini 2.5 Pro) in interactive task execution, demonstrating strong generalization across diverse instruction types (e.g., open-ended, multi-stage, constrained, invalid, and interrupted) and various user-involved tasks such as table bussing, grocery shopping, and dietary filtering.

Submission history

From: Wei Li [view email]
[v1]
Mon, 1 Sep 2025 03:53:47 UTC (29,592 KB)
[v2]
Thu, 11 Sep 2025 12:40:54 UTC (29,592 KB)



Source link

Continue Reading

AI Research

Brown awarded $20 million to lead artificial intelligence research institute aimed at mental health support

Published

on


A $20 million grant from the National Science Foundation will support the new AI Research Institute on Interaction for AI Assistants, called ARIA, based at Brown to study human-artificial intelligence interactions and mental health. The initiative, announced in July, aims to help develop AI support for mental and behavioral health. 

“The reason we’re focusing on mental health is because we think this represents a lot of the really big, really hard problems that current AI can’t handle,” said Associate Professor of Computer Science and Cognitive and Psychological Sciences Ellie Pavlick, who will lead ARIA. After viewing news stories about AI chatbots’ damage to users’ mental health, Pavlick sees renewed urgency in asking, “What do we actually want from AI?”

The initiative is part of a bigger investment from the NSF to support the goals of the White House’s AI Action Plan, according to a NSF press release. This “public-private investment,” the press release says, will “sustain and enhance America’s global AI dominance.”

According to Pavlick, she and her fellow researchers submitted the proposal for ARIA “years ago, long before the administration change,” but the response was “very delayed” due to “a lot of uncertainty at (the) NSF.” 

One of these collaborators was Michael Frank, the director of the Center for Computational Brain Science at the Carney Institute and a professor of psychology. 

Frank, who was already working with Pavlick on projects related to AI and human learning, said that the goal is to tie together collaborations of members from different fields “more systematically and more broadly.”

According to Roman Feiman, an assistant professor of cognitive and psychological sciences and linguistics and another member of the ARIA team, the goal of the initiative is to “develop better virtual assistants.” But that goal includes various obstacles to ensure the machines “treat humans well,” behave ethically and remain controllable. 

Within the study, some “people work basic cognitive neuroscience, other people work more on human machine interaction (and) other people work more on policy and society,” Pavlick explained. 

Although the ARIA team consists of many faculty and students at Brown, according to Pavlick, other institutions like Carnegie Mellon University, University of New Mexico and Dartmouth are also involved. On top of “basic science” research, ARIA’s research also examines the best practices for patient safety and the legal implications of AI. 

“As everybody currently knows, people are relying on (large language models) a lot, and I think many people who rely on them don’t really know how best to use them, and don’t entirely understand their limitations,” Feiman said.

According to Frank, the goal is not to “replace human therapists,” but rather to assist them.

Assistant Professor of the Practice of Computer Science and Philosophy Julia Netter, who studies the ethics of technology and responsible computing and is not involved in ARIA, said that ARIA has “the right approach.” 

Netter said ARIA approach differs from previous research “in that it really tried to bring in experts from other areas, people who know about mental health” and others, rather than those who focus solely on computer science.

But the ethics of using AI in a mental health context is a “tricky question,” she added.

“This is an area that touches people at a point in time when they are very, very vulnerable,” Netter said, adding that any interventions that arise from this research should be “well-tested.” 

“You’re touching an area of a person’s life that really has the potential of making a huge difference, positive or negative,” she added.

Because AI is “not going anywhere,” Frank said he is excited to “understand and control it in ways that are used for good.”

“My hope is that there will be a shift from just trying stuff and seeing what gets a better product,” Feiman said. “I think there’s real potential for scientific enterprise — not just a profit-making enterprise — of figuring out what is actually the best way to use these things to improve people’s lives.”

Get The Herald delivered to your inbox daily.



Source link

Continue Reading

AI Research

BITSoM launches AI research and innovation lab to shape future leaders

Published

on


Mumbai: The BITS School of Management (BITSoM), under the aegis of BITS Pilani, a leading private university, will inaugurate its new BITSoM Research in AI and Innovation (BRAIN) Lab in its Kalyan Campus on Friday. The lab is designed to prepare future leaders for workplaces transformed by artificial intelligence, on Friday on its Kalyan campus.

BITSoM launches AI research and innovation lab to shape future leaders

While explaining the concept of the laboratory, professor Saravanan Kesavan, dean of BITSoM, said that the BRAIN Lab had three core pillars–teaching, research, and outreach. Kesavan said, “It provides MBA (masters in business administration) students a dedicated space equipped with high-performance AI computers capable of handling tasks such as computer vision and large-scale data analysis. Students will not only learn about AI concepts in theory but also experiment with real-world applications.” Kesavan added that each graduating student would be expected to develop an AI product as part of their coursework, giving them first-hand experience in innovation and problem-solving.

The BRAIN lab is also designed to be a hub of collaboration where researchers can conduct projects in partnership with various companies and industries, creating a repository of practical AI tools to use. Kesavan said, “The initial focus areas (of the lab) include manufacturing, healthcare, banking and financial services, and Global Capability Centres (subsidiaries of multinational corporations that perform specialised functions).” He added that the case studies and research from the lab will be made freely available to schools, colleges, researchers, and corporate partners, ensuring that the benefits of the lab reach beyond the BITSoM campus.

BITSoM also plans to use the BRAIN Lab as a launchpad for startups. An AI programme will support entrepreneurs in developing solutions as per their needs while connecting them to venture capital networks in India and Silicon Valley. This will give young companies the chance to refine their ideas with guidance from both academics and industry leaders.

The centre’s physical setup resembles a modern computer lab, with dedicated workspaces, collaborative meeting rooms, and brainstorming zones. It has been designed to encourage creativity, allowing students to visualise how AI works, customise tools for different industries, and allow their technical capabilities to translate into business impacts.

In the context of a global workplace that is embracing AI, Kesavan said, “Future leaders need to understand not just how to manage people but also how to manage a workforce that combines humans and AI agents. Our goal is to ensure every student graduating from BITSoM is equipped with the skills to build AI products and apply them effectively in business.”

Kesavan said that advisors from reputed institutions such as Harvard, Johns Hopkins, the University of Chicago, and industry professionals from global companies will provide guidance to students at the lab. Alongside student training, BITSoM also plans to run reskilling programmes for working professionals, extending its impact beyond the campus.



Source link

Continue Reading

Trending