Tools & Platforms

New Method for Evaluating AI Can Cut Costs, Improve Fairness

Published

2 months ago

July 16, 2025

Insider Brief

Stanford University researchers have introduced a new evaluation method for large language models (LLMs) that reduces costs and enhances fairness by assigning difficulty scores to benchmark questions, according to a paper presented at the International Conference on Machine Learning.
Funded by the MacArthur Foundation, Stanford HAI, and Google Inc., the method uses Item Response Theory, a concept from standardized testing, to select adaptive question subsets that deliver more accurate comparisons across AI models while cutting evaluation costs by up to 80%.
Applied across 22 datasets and 172 models spanning medicine, mathematics, and law, the system improves the integrity of AI assessments by identifying and removing previously seen questions and allows for tracking model safety metrics over time, supporting more transparent and trustworthy AI development.

A new method for evaluating artificial intelligence models promises to cut costs and improve fairness, according to Stanford University researchers who developed the approach with funding from the MacArthur Foundation, Stanford HAI, and Google Inc.

Detailed in a paper presented at the International Conference on Machine Learning, the method introduces an adaptive question selection system that assesses the difficulty of benchmark questions to more accurately compare language model performance.

As AI developers release increasingly advanced language models, they often claim improved performance based on benchmark testing. These evaluations, which use large banks of test questions, typically require extensive human review, making them both time-consuming and expensive. According to Stanford researchers, the evaluation process can cost as much or more than model training itself. Moreover, practical limitations force developers to use only subsets of questions, which can skew results if easier questions are overrepresented.

Led by computer science assistant professor Sanmi Koyejo, the Stanford team developed a system that assigns difficulty scores to benchmark questions using Item Response Theory, a concept long employed in standardized testing. This enables evaluators to account for question difficulty when comparing model results, leveling the playing field between models and reducing the chance of misleading outcomes due to easier test sets.

“The key observation we make is that you must also account for how hard the questions are,” said Sanmi Koyejo, said lead researcher and assistant professor of computer science in the School of Engineering.. “Some models may do better or worse just by luck of the draw. We’re trying to anticipate that and adjust for it to make fairer comparisons.”

The researchers indicated they applied their approach across 22 datasets and 172 different language models, demonstrating its adaptability across varied domains such as medicine, mathematics, and law. The system uses AI-generated questions, calibrated to specific difficulty levels, which both lowers costs and automates the replenishment of question banks. This method also allows for the identification and removal of previously seen, or “contaminated,” questions from the datasets, improving the integrity of evaluations.

Co-author Sang Truong, a doctoral candidate at the Stanford Artificial Intelligence Lab, emphasized that this adaptive method reduces evaluation costs by up to 80% in some cases while delivering more consistent comparisons. Additionally, the system was able to detect nuanced changes in the safety metrics of versions of GPT-3.5, highlighting its capacity for tracking performance shifts over time. Safety in this context refers to the robustness of models against manipulation, exploitation, and other vulnerabilities, researchers noted.

The Stanford researchers argue that better evaluation tools will benefit both AI developers and end-users by improving diagnostics and providing more transparent assessments of AI models. By reducing costs and enhancing fairness in model evaluation, the system could help accelerate AI development while increasing trust in the technology.

In addition to Stanford, contributors to the research include collaborators from the University of California, Berkeley, and the University of Illinois Urbana-Champaign. Koyejo and co-author Bo Li are affiliated with Virtue AI, which also supported the project.

“And, for everyone else,” Koyejo said. “It will mean more rapid progress and greater trust in the quickly evolving tools of artificial intelligence.”

Source link

Tools & Platforms

Google and California Community Colleges launch largest higher education AI partnership in the US, equipping millions of students with access to free training

Published

34 minutes ago

September 16, 2025

Rachel Lawler

In the largest higher education deal of its kind in the US, Google is investing in workforce development for the future, putting California’s community college students at the forefront of the AI-driven economy.

“This collaboration with Google is a monumental step forward for the California Community Colleges,” explains Don Daves-Rougeaux, Senior Advisor to the Chancellor of the California Community Colleges on Workforce Development, Strategic Partnerships, and GenAI.

“Providing our students with access to world-class AI training and professional certificates ensures they have the skills necessary to thrive in high-growth industries and contribute to California’s economic prosperity. This partnership directly supports our Vision 2030 commitment to student success and workforce readiness. Additionally, offering access to AI tools with data protections and advanced functionality for free ensures that all learners have equitable access to the tools they need to leverage the skills they’re learning, and saves California’s community colleges millions of dollars in potential tool costs.”

All students, faculty, staff and classified professionals at the colleges will be able to access Gemini, Google’s generative AI tool, with data protections, to ensure they can safely use AI tools.

All students and faculty will also receive free access to Google Career Certificates, Google AI Essentials, and Prompting Essentials, providing practical training for in-demand jobs.

“Technology skills, especially in areas like artificial intelligence, are critical for the future workforce,” adds Bryan Lee, Vice President of Google for Education Go-to-Market. “We are thrilled to partner with the California Community Colleges, the nation’s largest higher education system, to bring valuable training and tools like Google Career Certificates, AI Essentials, and Gemini to millions of students. This collaboration underscores our commitment to creating economic opportunity for everyone.”

The ETIH Innovation Awards 2026

The EdTech Innovation Hub Awards celebrate excellence in global education technology, with a particular focus on workforce development, AI integration, and innovative learning solutions across all stages of education.

Now open for entries, the ETIH Innovation Awards 2026 recognize the companies, platforms, and individuals driving transformation in the sector, from AI-driven assessment tools and personalized learning systems, to upskilling solutions and digital platforms that connect learners with real-world outcomes.

Submissions are open to organizations across the UK, the Americas, and internationally. Entries should highlight measurable impact, whether in K–12 classrooms, higher education institutions, or lifelong learning settings.

Winners will be announced on 14 January 2026 as part of an online showcase featuring expert commentary on emerging trends and standout innovation. All winners and finalists will also be featured in our first print magazine, to be distributed at BETT 2026.

Source link

Tools & Platforms

Why artificial intelligence is increasing cybersecurity defence and dangers and why human skills are needed

Published

39 minutes ago

September 16, 2025

The AFR View

An undercurrent of the Financial Review Cyber Summit was that the best firewall is only as strong as the human behind it. It’s a particularly potent message as we enter a new frontier of cyber risks that are constant and evolving.

As Home Affairs and Cyber Security Minister Tony Burke told the audience of corporate executives and cyber professionals, “it doesn’t matter how good your electronic systems are if you haven’t trained your people to be part of the human firewall”.

Loading…

Source link

Tools & Platforms

CISOs grapple with the realities of applying AI to security functions

Published

1 hour ago

September 16, 2025

John Leyden

Turbo boost telemetry

Security AI and automation are beginning to demonstrate significant value, especially in minimizing dwell time and accelerating triage and containment processes, says Myke Lyons, CISO at telemetry and observability pipeline software vendor Cribl.

Their success, however, depends heavily on the prioritization and accuracy of the underlying telemetry, Lyons cautions.

“Within my team, we follow a structured approach to data management: High-priority, time-sensitive telemetry — such as identity, authentication, and key application logs — is directed to high-assurance systems for real-time detection,” Lyons explains. “Meanwhile, less critical data is stored in data lakes to optimize costs while retaining forensic value.”

Source link