AI Research
Stanford Develops Real-World Benchmarks for Healthcare AI Agents

Beyond the hype and hope surrounding the use of artificial intelligence in medicine lies the real-world need to ensure that, at the very least, AI in a healthcare setting can carry out tasks that a doctor would in electronic health records.
Creating benchmark standards to measure that is what drives the work of a team of Stanford researchers. While the researchers note the enormous potential of this new technology to transform medicine, the tech ethos of moving fast and breaking things doesn’t work in healthcare. Ensuring that these tools are capable of doing these tasks is vital, and then they can be used as tools that augment the care clinicians provide every day.
“Working on this project convinced me that AI won’t replace doctors anytime soon,” said Kameron Black, co-author on the new benchmark paper and a Clinical Informatics Fellow at Stanford Health Care. “It’s more likely to augment our clinical workforce.”
MedAgentBench: Testing AI Agents in Real-World Clinical Systems
Black is one of a multidisciplinary team of physicians, computer scientists, and researchers from across Stanford University who worked on the new study, MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents, published in the New England Journal of Medicine AI.
Although large language models (LLMs) have performed well on the United States Medical Licensing Examination (USMLE) and at answering medical-related questions in studies, there is currently no benchmark testing how well LLMs can function as agents by performing tasks that a doctor would normally do, such as ordering medications, inside a real-world clinical system where data input can be messy.
Unlike chatbots or LLMs, AI agents can work autonomously, performing complex, multistep tasks with minimal supervision. AI agents integrate multimodal data inputs, process information, and then utilize external tools to accomplish tasks, Black explained.
Overall Success Rate (SR) Comparison of State-of-the-Art LLMs on MedAgentBench |
|
---|---|
Model |
Overall SR |
Claude 3.5 Sonnet v2 |
69.67% |
GPT-4o |
64.00% |
DeepSeek-V3 (685B, open) |
62.67% |
Gemini-1.5 Pro |
62.00% |
GPT-4o-mini |
56.33% |
o3-mini |
51.67% |
Qwen2.5 (72B, open) |
51.33% |
Llama 3.3 (70B, open) |
46.33% |
Gemini 2.0 Flash |
38.33% |
Gemma2 (27B, open) |
19.33% |
Gemini 2.0 Pro |
18.00% |
Mistral v0.3 (7B, open) |
4.00% |
While previous tests only assessed AI’s medical knowledge through curated clinical vignettes, this research evaluates how well AI agents can perform actual clinical tasks such as retrieving patient data, ordering tests, and prescribing medications.
“Chatbots say things. AI agents can do things,” said Jonathan Chen, associate professor of medicine and biomedical data science and the paper’s senior author. “This means they could theoretically directly retrieve patient information from the electronic medical record, reason about that information, and take action by directly entering in orders for tests and medications. This is a much higher bar for autonomy in the high-stakes world of medical care. We need a benchmark to establish the current state of AI capability on reproducible tasks that we can optimize toward.”
The study tested this by evaluating whether AI agents could utilize FHIR (Fast Healthcare Interoperability Resources) API endpoints to navigate electronic health records.
The team created a virtual electronic health record environment that contained 100 realistic patient profiles (containing 785,000 records, including labs, vitals, medications, diagnoses, procedures) to test about a dozen large language models on 300 clinical tasks developed by physicians. In initial testing, the best model, in this case, Claude 3.5 Sonnet v2, achieved a 70% success rate.
“We hope this benchmark can help model developers track progress and further advance agent capabilities,” said Yixing Jiang, a Stanford PhD student and co-author of the paper.
Many of the models struggled with scenarios that required nuanced reasoning, involved complex workflows, or necessitated interoperability between different healthcare systems, all issues a clinician might face regularly.
“Before these agents are used, we need to know how often and what type of errors are made so we can account for these things and help prevent them in real-world deployments,” Black said.
What does this mean for clinical care? Co-author James Zou and Dr. Eric Topol claim that AI is shifting from a tool to a teammate in care delivery. With MedAgentBench, the Stanford team has shown this is a much more near-term reality by showcasing several frontier LLMs in their ability to carry out many day-to-day clinical tasks that a physician would perform.
Already the team has noticed improvements in performance of the newest versions of models. With this in mind, Black believes that AI agents might be ready to handle basic clinical “housekeeping” tasks in a clinical setting sooner than previously expected.
“In our follow-up studies, we’ve shown a surprising amount of improvement in the success rate of task execution by newer LLMs, especially when accounting for specific error patterns we observed in the initial study,” Black said. “With deliberate design, safety, structure, and consent, it will be feasible to start moving these tools from research prototypes into real-world pilots.”
The Road Ahead
Black says benchmarks like these are necessary as more hospitals and healthcare systems are incorporating AI into tasks including note-writing and chart summarization.
Accurate and trustworthy AI could also help alleviate a looming crisis, he adds. Pressed by patient needs, compliance demands, and staff burnout, healthcare providers are seeing a worsening global staffing shortage, estimated to exceed 10 million by 2030.
Instead of replacing doctors and nurses, Black hopes that AI can be a powerful tool for clinicians, lessening the burden of some of their workload and bringing them back to the patient bedside.
“I’m passionate about finding solutions to clinician burnout,” Black said. “I hope that by working on agentic AI applications in healthcare that augment our workforce, we can help offload burden from clinicians and divert this impending crisis.”
Paper authors: Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, and Jonathan H. Chen
Read the piece in the New England Journal of Medicine AI.
AI Research
Arista touts liquid cooling, optical tech to reduce power consumption for AI networking

Both technologies will likely find a role in future AI and optical networks, experts say, as both promise to reduce power consumption and support improved bandwidth density. Both have advantages and disadvantages as well – CPOs are more complex to deploy given the amount of technology included in a CPO package, whereas LPOs promise more simplicity.
Bechtolsheim said that LPO can provide an additional 20% power savings over other optical forms. Early tests show good receiver performance even under degraded conditions, though transmit paths remain sensitive to reflections and crosstalk at the connector level, Bechtolsheim added.
At the recent Hot Interconnects conference, he said: “The path to energy-efficient optics is constrained by high-volume manufacturing,” stressing that advanced optics packaging remains difficult and risky without proven production scale.
“We are nonreligious about CPO, LPO, whatever it is. But we are religious about one thing, which is the ability to ship very high volumes in a very predictable fashion,” Bechtolsheim said at the investor event. “So, to put this in quantity numbers here, the industry expects to ship something like 50 million OSFP modules next calendar year. The current shipment rate of CPO is zero, okay? So going from zero to 50 million is just not possible. The supply chain doesn’t exist. So, even if the technology works and can be demonstrated in a lab, to get to the volume required to meet the needs of the industry is just an incredible effort.”
“We’re all in on liquid cooling to reduce power, eliminating fan power, supporting the linear pluggable optics to reduce power and cost, increasing rack density, which reduces data center footprint and related costs, and most importantly, optimizing these fabrics for the AI data center use case,” Bechtolsheim added.
“So what we call the ‘purpose-built AI data center fabric’ around Ethernet technology is to really optimize AI application performance, which is the ultimate measure for the customer in both the scale-up and the scale-out domains. Some of this includes full switch customization for customers. Other cases, it includes the power and cost optimization. But we have a large part of our hardware engineering department working on these things,” he said.
AI Research
Learning by Doing: AI, Knowledge Transfer, and the Future of Skills | American Enterprise Institute

In a recent blog, I discussed Stanford University economist Erik Brynjolfsson’s new study showing that young college graduates are struggling to gain a foothold in a job market shaped by artificial intelligence (AI). His analysis found that, since 2022, early-career workers in AI-exposed roles have seen employment growth lag 13 percent behind peers in less-exposed fields. At the same time, experienced workers in the same jobs have held steady or even gained ground. The conclusion: AI isn’t eliminating work outright, but it is affecting the entry-level rungs that young workers depend on as they begin climbing career ladders.
The potential consequences of these findings, assuming they bear out, become clearer when read alongside Enrique Ide’s recent paper, Automation, AI, and the Intergenerational Transmission of Knowledge. Ide argues that when firms automate entry-level tasks, the opportunity for new workers to gain the tacit knowledge—the kind of workplace norms and rhythms of team-based work that aren’t necessarily written down—isn’t passed on. Thus, productivity gains accrue to seasoned workers while would-be novices lose the hands-on training they need to build the foundation for career progress.
This short-circuiting of early career experiences, Ide says, has macro-economic consequences. He estimates that automating even five percent of entry-level tasks reduces long-run US output growth by an estimated 0.05 percentage points per year; at 30 percent automation, growth slows by more than 0.3 points. Over a hundred year timeline, this would reduce total output by 20 percent relative to a world without AI automation. In other words: automating the bottom rungs might lift firms’ quarterly performance, but at the cost of generational growth.
This is where we need to pause and take a breath. While Ide’s results sound dramatic, it is critical to remember that the dynamics and consequences of AI adoption are unpredictable, and that a century is a very long time. For instance, who would have said in 2022 that one of the first effects of AI automation would be to benefit less tech-savvy boomer and Gen-X managers and harm freshly minted Gen-Z coders?
Given the history of positive, automation-induced wealth and employment effects, why would this time be different?
Finally, it’s important to remember that in a dynamic market-driven economy, skill requirements are always changing and firms are always searching for ways to improve their efficiency relative to competitors. This is doubly true as we enter the era of cognitive, as opposed to physical, automation. AI-driven automation is part of the pathway to a more prosperous economy and society for ourselves and for future generations. As my AEI colleague Jim Pethokoukis recently said, “A supposedly powerful general-purpose technology that left every firm’s labor demand utterly unchanged wouldn’t be much of a GPT.” Said another way, unless AI disrupts our economy and lives, it cannot deliver its promised benefits.
What then should we do? I believe the most important step we can take right now is to begin “stress-testing” our current workforce development policies and programs and building scenarios for how industry and government will respond should significant AI-related job disruptions occur. Such scenario planning could be shaped into a flexible “playbook” of options to guide policymakers geared to the types and numbers of affected workers. Such planning didn’t occur prior to the automation and trade shocks of the 1990s and 2000s with lasting consequences for factory workers and American society. We should try to make sure this doesn’t happen again with AI.
Pessimism is easy and cheap. We should resist the lure of social media-monetized AI doomerism and focus on building the future we want to see by preparing for and embracing change.
AI Research
SBU Researchers Use AI to Advance Alzheimer’s Detection

Alzheimer’s disease is one of the most urgent public health challenges for aging Americans. Nearly seven million Americans over the age of 65 are currently living with the disease, and that number is projected to nearly double by 2060, according to the Alzheimer’s Association.
Early diagnosis and continuous monitoring are crucial to improving care and extending independence, but there isn’t enough high-quality, Alzheimer’s-specific data to train artificial intelligence systems that could help detect and track the disease.
Shan Lin, associate professor of Electrical and Computer Engineering at Stony Brook University, along with PhD candidate Heming Fu, are working with Guoliang Xing from The Chinese University of Hong Kong to create a network of data based on Alzheimer’s patients. Together they developed SHADE-AD (Synthesizing Human Activity Datasets Embedded with AD features), a generative AI framework designed to create synthetic, realistic data that reflects the motor behaviors of Alzheimer’s patients.

Movements like stooped posture, reliance on armrests when standing from sitting, or slowed gait may appear subtle, but can be early indicators of the disease. By identifying and replicating these patterns, SHADE-AD provides researchers and physicians with the data required to improve monitoring and diagnosis.
Unlike existing generative models, which often rely on and output generic datasets drawn from healthy individuals, SHADE-AD was trained to embed Alzheimer’s-specific traits. The system generates three-dimensional “skeleton videos,” simplified figures that preserve details of joint motion. These 3D skeleton datasets were validated against real-world patient data, with the model proving capable of reproducing the subtle changes in speed, angle, and range of motion that distinguish Alzheimer’s behaviors from those of healthy older adults.
The results and findings, published and presented at the 23rd ACM Conference on Embedded Networked Sensor Systems (SenSys 2025), have been significant. Activity recognition systems trained with SHADE-AD’s data achieved higher accuracy across all major tasks compared with systems trained on traditional data augmentation or general open datasets. In particular, SHADE-AD excelled at recognizing actions like walking and standing up, which often reveal the earliest signs of decline for Alzheimer’s patients.

Lin believes this work could have a significant impact on the daily lives of older adults and their families. Technologies built on SHADE-AD could one day allow doctors to detect Alzheimer’s sooner, track disease progression more accurately, and intervene earlier with treatments and support. “If we can provide tools that spot these changes before they become severe, patients will have more options, and families will have more time to plan,” he said.
With September recognized nationally as Healthy Aging Month, Lin sees this research as part of an effort to use technology to support older adults in living longer, healthier, and more independent lives. “Healthy aging isn’t only about treating illness, but also about creating systems that allow people to thrive as they grow older,” he said. “AI can be a powerful ally in that mission.”
— Beth Squire
-
Business2 weeks ago
The Guardian view on Trump and the Fed: independence is no substitute for accountability | Editorial
-
Tools & Platforms1 month ago
Building Trust in Military AI Starts with Opening the Black Box – War on the Rocks
-
Ethics & Policy2 months ago
SDAIA Supports Saudi Arabia’s Leadership in Shaping Global AI Ethics, Policy, and Research – وكالة الأنباء السعودية
-
Events & Conferences4 months ago
Journey to 1000 models: Scaling Instagram’s recommendation system
-
Jobs & Careers3 months ago
Mumbai-based Perplexity Alternative Has 60k+ Users Without Funding
-
Podcasts & Talks2 months ago
Happy 4th of July! 🎆 Made with Veo 3 in Gemini
-
Education3 months ago
VEX Robotics launches AI-powered classroom robotics system
-
Education2 months ago
Macron says UK and France have duty to tackle illegal migration ‘with humanity, solidarity and firmness’ – UK politics live | Politics
-
Podcasts & Talks2 months ago
OpenAI 🤝 @teamganassi
-
Funding & Business3 months ago
Kayak and Expedia race to build AI travel agents that turn social posts into itineraries