Connect with us

AI Research

Teaching AI to admit uncertainty

Published

on


In high-stakes situations like health care—or weeknight Jeopardy!—it can be safer to say “I don’t know” than to answer incorrectly. Doctors, game show contestants, and standardized test-takers understand this, but most artificial intelligence applications still prefer to give a potentially wrong answer rather than admit uncertainty.

Johns Hopkins computer scientists think they have a solution: a new method that allows AI models to spend more time thinking through problems and uses a confidence score to determine when the AI should say “I don’t know” rather than risking a wrong answer—crucial for high-stakes domains like medicine, law, or engineering.

The research team will present its findings at the 63rd Annual Meeting of the Association for Computational Linguistics, to be held July 27 through Aug. 1 in Vienna, Austria.

“When you demand high confidence, letting the system think longer means it will provide more correct answers and more incorrect answers.”

William Jurayj

PhD student, Whiting School of Engineering

“It all started when we saw that cutting-edge large language models spend more time thinking to solve harder problems. So we wondered—can this additional thinking time also help these models determine whether or not a problem has been solved correctly so they can report that back to the user?” says first author William Jurayj, a PhD student studying computer science who is affiliated with the Whiting School of Engineering’s Center for Language and Speech Processing.

To investigate, the team had large language models generate reasoning chains of different lengths as they answered difficult math problems and then measured how the chain length affected both the model’s final answer and its confidence in it. The researchers had the models answer only when their confidence exceeded a given threshold—meaning “I don’t know” was an acceptable response.

They found that thinking more generally improves models’ accuracy and confidence. But even with plenty of time to consider, models can still make wild guesses or give wrong answers, especially without penalties for incorrect responses. In fact, the researchers found that when they set a high bar for confidence and let models think for even longer, the models’ accuracy actually decreased.

“This happens because answer accuracy is only part of a system’s performance,” Jurayj explains. “When you demand high confidence, letting the system think longer means it will provide more correct answers and more incorrect answers. In some settings, the extra correct answers are worth the risk. But in other, high-stakes environments, this might not be the case.”

Motivated by this finding, the team suggested three different “odds” settings to penalize wrong answers: exam odds, where there’s no penalty for an incorrect answer; Jeopardy! odds, where correct answers are rewarded at the same rate incorrect ones are penalized; and high-stakes odds, where an incorrect answer is penalized far more than a correct answer is rewarded.

They found that under stricter odds, a model should decline to answer a question if it isn’t confident enough in its answer after expending its compute budget. And at higher confidence thresholds, this will mean that more questions go unanswered—but that isn’t necessarily a bad thing.

“A student might be mildly annoyed to wait 10 minutes only to find out that she needs to solve a math problem herself because the AI model is unsure,” Jurayj says. “But in high-stakes environments, this is infinitely preferable to waiting five minutes for an answer that looks correct but is not.”

Now, the team is encouraging the greater AI research community to report their models’ question-answering performance under exam and Jeopardy! odds so that everyone can benefit from AI with better-calibrated confidence.

“We hope the research community will accept our invitation to report performance in settings with non-zero costs for incorrect answers, as this will naturally motivate the development of better methods for uncertainty quantification,” says Jurayj.

Additional authors of this work include graduate student Jeffrey Cheng and Benjamin Van Durme, an associate professor of computer science affiliated with CLSP and the Human Language Technology Center of Excellence.



Source link

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

AI Research

Cognigy Leads in Opus Research’s 2025 Conversational AI Intelliview

Published

on


Distinguished for Innovation, Enterprise Readiness, and Visionary Approach to Agentic AI

Cognigy, a global leader in AI-powered customer service solutions, has been recognized as the leader in the newly released 2025 Conversational AI Intelliview from Opus Research. The report, titled “Decision-Maker’s Guide to Self-Service & Enterprise Intelligent Assistants,” shows Cognigy as the leading platform across critical evaluation areas including product capability, enterprise fit, GenAI maturity, and deployment performance.

This recognition underscores Cognigy’s commitment to empowering enterprises with production-ready, scalable AI solutions that go far beyond chatbot basics. The report cites Cognigy’s strengths in visual AI agent orchestration, tool and function calling, AI Ops and observability, and a deep commitment to enterprise-grade control—all delivered through a platform built to scale real-time customer interactions across voice and digital channels.

“Cognigy exemplifies the next stage of conversational AI maturity,” said Ian Jacobs, VP & Lead Analyst at Opus Research. “Their agentic approach—combining real-time reasoning, orchestration, and observability—demonstrates how GenAI can move beyond experimentation into meaningful, measurable transformation in the contact center.”

Cognigy was one of the few vendors identified in the report as a “True Believer” in the evolution of GenAI-driven self-service, with tools designed to simplify deployment while giving enterprises full control. The platform’s AI Agent Manager enables businesses to create, configure, and continuously improve intelligent agents—defining persona, memory scope, and access to tools and knowledge—all through a flexible, low-code interface. Cognigy uniquely blends deterministic logic with generative capabilities, ensuring both speed and reliability in automation.

“This recognition from Opus Research is more than a milestone—it’s validation that our strategy is working,” said Alan Ranger, Vice President at Cognigy. “We’re delivering real-world, enterprise-grade automation that’s transforming contact centers. From financial services to healthcare to global retail, our customers are scaling faster, resolving issues in real time, and delivering truly modern service experiences.”

With global Fortune 500 customers and partnerships across the CCaaS and AI ecosystem, Cognigy continues to lead the way in delivering enterprise-ready AI that combines usability, speed, and impact. This latest industry acknowledgment further solidifies its position as the go-to platform for intelligent self-service.

To download a copy of the report, visit https://www.cognigy.com/opus-research-2025-conversational-ai-intelliview.



Source link

Continue Reading

AI Research

MIT researchers say using ChatGPT can rot your brain, truth is little more complicated – The Economic Times

Published

on



MIT researchers say using ChatGPT can rot your brain, truth is little more complicated  The Economic Times



Source link

Continue Reading

AI Research

Frontiers broadens AI‑driven integrity checks with dual integration

Published

on


Image: Shutterstock.com/EtiAmmos

Frontiers has announced that external fraud‑screening tools – Cactus Communications’ Paperpal Preflight, and Clear Skies’ Papermill Alarm and Oversight – have been integrated into its own Artificial Intelligence Review Assistant (AIRA) submission-screening system.

The expansion delivers what the companies describe as “an unprecedented, multilayered defence against organised research fraud, strengthening the reliability and integrity of every manuscript submitted to Frontiers”.

AIRA was launched in 2018, making Frontiers one of the early adopters of AI in submission checking. In 2022, Frontiers added its own papermill check to its comprehensive catalogue of AIRA checks, with the aim of tackling the industry-wide problem of manufactured manuscripts. The latest version, released in 2025, uses more than 15 data points and signals of potential manufactured manuscripts to be investigated and validated by a human expert.

Dr Elena Vicario, Head of Research Integrity at Frontiers, said: “Maintaining trust in the scholarly record demands constant innovation. By combining the unique strengths of Clear Skies and Cactus with our own AI capabilities, we are raising the bar for integrity screening and giving editors and reviewers the confidence that every submission has been rigorously vetted.”

Commenting on the importance of the partnership, Nikesh Gosalia, President, Global Academic and Publisher Relations at Cactus Communications, said: “This partnership with Frontiers reflects the confidence leading publishers have in our AI-driven solutions. Paperpal Preflight is a vital tool that supports editorial teams and existing homegrown solutions in identifying and addressing potential issues early in the publishing workflow.

“As one of the world’s largest and most impactful research publishers, Frontiers is taking an important step in strengthening research integrity, and we are proud to collaborate with them in this mission of safeguarding research.”

Adam Day, Founder and CEO of Clear Skies, added: “Clear Skies is thrilled to be working with the innovative team at Frontiers to integrate AIRA with Oversight. This integration makes our multi-award-winning services, including the Papermill Alarm, available across the Frontiers portfolio.

“Oversight is the first index of research integrity and recipient of the inaugural EPIC Award for integrity tools from the Society for Scholarly Publishing (SSP). As well as providing strategic Oversight to publishers, our detailed article reports support human Oversight of research integrity investigations on publications as well as journal submissions.”



Source link

Continue Reading

Trending