AI Insights

How to build a better AI benchmark

Published

2 months ago

May 8, 2025

The limits of traditional testing

If AI companies have been slow to respond to the growing failure of benchmarks, it’s partially because the test-scoring approach has been so effective for so long.

One of the biggest early successes of contemporary AI was the ImageNet challenge, a kind of antecedent to contemporary benchmarks. Released in 2010 as an open challenge to researchers, the database held more than 3 million images for AI systems to categorize into 1,000 different classes.

Crucially, the test was completely agnostic to methods, and any successful algorithm quickly gained credibility regardless of how it worked. When an algorithm called AlexNet broke through in 2012, with a then unconventional form of GPU training, it became one of the foundational results of modern AI. Few would have guessed in advance that AlexNet’s convolutional neural nets would be the secret to unlocking image recognition—but after it scored well, no one dared dispute it. (One of AlexNet’s developers, Ilya Sutskever, would go on to cofound OpenAI.)

A large part of what made this challenge so effective was that there was little practical difference between ImageNet’s object classification challenge and the actual process of asking a computer to recognize an image. Even if there were disputes about methods, no one doubted that the highest-scoring model would have an advantage when deployed in an actual image recognition system.

But in the 12 years since, AI researchers have applied that same method-agnostic approach to increasingly general tasks. SWE-Bench is commonly used as a proxy for broader coding ability, while other exam-style benchmarks often stand in for reasoning ability. That broad scope makes it difficult to be rigorous about what a specific benchmark measures—which, in turn, makes it hard to use the findings responsibly.

Where things break down

Anka Reuel, a PhD student who has been focusing on the benchmark problem as part of her research at Stanford, has become convinced the evaluation problem is the result of this push toward generality. “We’ve moved from task-specific models to general-purpose models,” Reuel says. “It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder.”

Like the University of Michigan’s Jacobs, Reuel thinks “the main issue with benchmarks is validity, even more than the practical implementation,” noting: “That’s where a lot of things break down.” For a task as complicated as coding, for instance, it’s nearly impossible to incorporate every possible scenario into your problem set. As a result, it’s hard to gauge whether a model is scoring better because it’s more skilled at coding or because it has more effectively manipulated the problem set. And with so much pressure on developers to achieve record scores, shortcuts are hard to resist.

For developers, the hope is that success on lots of specific benchmarks will add up to a generally capable model. But the techniques of agentic AI mean a single AI system can encompass a complex array of different models, making it hard to evaluate whether improvement on a specific task will lead to generalization. “There’s just many more knobs you can turn,” says Sayash Kapoor, a computer scientist at Princeton and a prominent critic of sloppy practices in the AI industry. “When it comes to agents, they have sort of given up on the best practices for evaluation.”

Source link

Up Next

How a new type of AI is helping police skirt facial recognition bans

Don't Miss

Why the humanoid workforce is running late

The Editors

Click to comment

AI Insights

Russia allegedly field-testing deadly next-gen AI drone powered by Nvidia Jetson Orin — Ukrainian military official says Shahed MS001 is a ‘digital predator’ that identifies targets on its own

Published

45 minutes ago

July 7, 2025

The Editors

Ukrainian Major General Vladyslav (Владислав Клочков) Klochkov says Russia is field-testing a deadly new drone that can use AI and thermal vision to think on its own, identifying targets without coordinates and bypassing most air defense systems. According to the senior military figure, inside you will find the Nvidia Jetson Orin, which has enabled the MS001 to become “an autonomous combat platform that sees, analyzes, decides, and strikes without external commands.”

Digital predator dynamically weighs targets

With the Jetson Orin as its brain, the upgraded MS001 drone doesn’t just follow prescribed coordinates, like some hyper-accurate doodle bug. It actually thinks. “It identifies targets, selects the highest-value one, adjusts its trajectory, and adapts to changes — even in the face of GPS jamming or target maneuvers,” says Klochkov. “This is not a loitering munition. It is a digital predator.”

Even worse, the MS001 is allegedly operating in coordinated drone groups, persisting in its maximum destructive purpose despite the best efforts of Ukraine’s electronic warfare and other anti-drone systems.

Frustrated with warfare tech development speeds

Klochkov signs off his post by informing his LinkedIn followers that “We are not only fighting Russia. We are fighting inertia.” What he appears to wish for is an acceleration of Ukraine’s own assault drone capabilities. The Major General seems particularly disappointed in the Ukrainian system of procurement rounds, slowing field-testing and deployment of improved responses to new Shahed drone generations.

Shahed drones are originally an Iranian design but have gained great notoriety due to their sustained use by the Russian army to attack Ukrainian targets. The MS001 is substantially upgraded in the ‘smarts’ department thanks to Western/allies technologies.

Klochkov says the MS001 is powered by the following key technologies:

Nvidia Jetson Orin — machine learning, video processing, object recognition
Thermal imager — operates at night and in low visibility
Nasir GPS with CRPA antenna — spoof-resistant navigation
FPGA chips — onboard adaptive logic
Radio modem — for telemetry and swarm communication

Cute AI dev board with deadly potential (Image credit: Nvidia)

Western tech sanctions are supposed to neuter this kind of military threat from nations like Russia and Iran. This news indicates that such trade barriers are leaky, at best, and probably not taken seriously enough.

Not the first Russia-deployed drone discovered using Nvidia AI

This isn’t the first Russian drone system that is thought to have adopted Nvidia’s Jetson Orin as a key component.

A month ago, Ukraine’s Defense Express site said that a new “smart suicide attack unmanned aerial vehicle with artificial intelligence,” dubbed the V2U, was powered by Nvidia’s little AI computer.

While the Shahed MS001s use an Iranian design, the V2U looks like it is more reliant on Chinese tech, including the Chinese-made Leetop A603 carrier board.

Follow Tom’s Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.

Source link

AI Insights

Artificial Intelligence Predicts the Packers’ 2025 Season!!!

Published

58 minutes ago

July 7, 2025

The Editors

On today’s show, Andy simulates the Packers 2025 season utilizing artificial intelligence. Find out the results on today’s all-new Pack-A-Day Podcast! #Packers #GreenBayPackers #ai To become a member of the Pack-A-Day Podcast, click here: https://www.youtube.com/channel/UCSGx5Pq0zA_7O726M3JEptA/join Don’t forget to subscribe!!! Twitter/BlueSky: @andyhermannfl If you’d like to support my channel, please donate to: PayPal: https://paypal.me/andyhermannfl Venmo: @Andrew_Herman Email: [email protected] Discord: https://t.co/iVVltoB2Hg

Source link

AI Insights

WHO Director-General’s remarks at the XVII BRICS Leaders’ Summit, session on Strengthening Multilateralism, Economic-Financial Affairs, and Artificial Intelligence – 6 July 2025

Published

2 hours ago

July 7, 2025

The Editors

Your Excellency President Lula da Silva,

Excellencies, Heads of State, Heads of Government,

Heads of delegation,

Dear colleagues and friends,

Thank you, President Lula, and Brazil’s BRICS Presidency for your commitment to equity, solidarity, and multilateralism.

My intervention will focus on three key issues: challenges to multilateralism, cuts to Official Development Assistance, and the role of AI and other digital tools.

First, we are facing significant challenges to multilateralism.

However, there was good news at the World Health Assembly in May.

WHO’s Member States demonstrated their commitment to international solidarity through the adoption of the Pandemic Agreement. South Africa co-chaired the negotiations, and I would like to thank South Africa.

It is time to finalize the next steps.

We ask the BRICS to complete the annex on Pathogen Access and Benefit Sharing so that the Agreement is ready for ratification at next year’s World Health Assembly. Brazil is co-chairing the committee, and I thank Brazil for their leadership.

Second, are cuts to Official Development Assistance.

Compounding the chronic domestic underinvestment and aid dependency in developing countries, drastic cuts to foreign aid have disrupted health services, costing lives and pushing millions into poverty.

The recent Financing for Development conference in Sevilla made progress in key areas, particularly in addressing the debt trap that prevents vital investments in health and education.

Going forward, it is critical for countries to mobilize domestic resources and foster self-reliance to support primary healthcare as the foundation of universal health coverage.

Because health is not a cost to contain, it’s an investment in people and prosperity.

Third, is AI and other digital tools.

Planning for the future of health requires us to embrace a digital future, including the use of artificial intelligence. The future of health is digital.

AI has the potential to predict disease outbreaks, improve diagnosis, expand access, and enable local production.

AI can serve as a powerful tool for equity.

However, it is crucial to ensure that AI is used safely, ethically, and equitably.

We encourage governments, especially BRICS, to invest in AI and digital health, including governance and national digital public infrastructure, to modernize health systems while addressing ethical, safety, and equity issues.

WHO will be by your side every step of the way, providing guidance, norms, and standards.

Excellencies, only by working together through multilateralism can we build a healthier, safer, and fairer world for all.

Thank you. Obrigado.

Source link

Funding & Business6 days ago

Kayak and Expedia race to build AI travel agents that turn social posts into itineraries

Jobs & Careers6 days ago

Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding

Mergers & Acquisitions6 days ago

Donald Trump suggests US government review subsidies to Elon Musk’s companies

Funding & Business6 days ago

Rethinking Venture Capital’s Talent Pipeline

Jobs & Careers6 days ago

Why Agentic AI Isn’t Pure Hype (And What Skeptics Aren’t Seeing Yet)

Funding & Business4 days ago

Sakana AI’s TreeQuest: Deploy multi-model teams that outperform individual LLMs by 30%

Funding & Business7 days ago

From chatbots to collaborators: How AI agents are reshaping enterprise work

Jobs & Careers6 days ago

Astrophel Aerospace Raises ₹6.84 Crore to Build Reusable Launch Vehicle

Tools & Platforms6 days ago

Winning with AI – A Playbook for Pest Control Business Leaders to Drive Growth

Jobs & Careers4 days ago

Ilya Sutskever Takes Over as CEO of Safe Superintelligence After Daniel Gross’s Exit

aistoriz.com

How to build a better AI benchmark

AI Insights

How to build a better AI benchmark

The limits of traditional testing

Where things break down

Leave a Reply
Cancel reply

Leave a Reply

AI Insights

Russia allegedly field-testing deadly next-gen AI drone powered by Nvidia Jetson Orin — Ukrainian military official says Shahed MS001 is a ‘digital predator’ that identifies targets on its own

Digital predator dynamically weighs targets

AI Insights

Artificial Intelligence Predicts the Packers’ 2025 Season!!!

AI Insights

WHO Director-General’s remarks at the XVII BRICS Leaders’ Summit, session on Strengthening Multilateralism, Economic-Financial Affairs, and Artificial Intelligence – 6 July 2025

Trending

aistoriz.com

How to build a better AI benchmark

The limits of traditional testing

Where things break down

You may like

Leave a Reply Cancel reply

Leave a Reply

AI Insights

Russia allegedly field-testing deadly next-gen AI drone powered by Nvidia Jetson Orin — Ukrainian military official says Shahed MS001 is a ‘digital predator’ that identifies targets on its own

Digital predator dynamically weighs targets

AI Insights

Artificial Intelligence Predicts the Packers’ 2025 Season!!!

AI Insights

WHO Director-General’s remarks at the XVII BRICS Leaders’ Summit, session on Strengthening Multilateralism, Economic-Financial Affairs, and Artificial Intelligence – 6 July 2025

Trending

Leave a Reply
Cancel reply