AI Research

Thinking Machines Lab Reveals Research On Eliminating Randomness In AI Model Responses

Published

4 days ago

September 11, 2025

Thinking Machines Lab, backed by $2 billion in seed funding and staffed with former OpenAI researchers, has shared its first detailed research insights.

The lab released a blog post Wednesday examining how to create AI models that produce more consistent and reproducible responses, addressing a fundamental challenge in artificial intelligence development.

AI model consistency research targets nondeterminism in large language models

The blog post, titled “Defeating Nondeterminism in LLM Inference,” investigates why AI models often generate varied answers to identical questions. While this variability has been accepted as an inherent characteristic of large language models, Thinking Machines Lab views this nondeterminism as a solvable problem rather than an unavoidable limitation.

GPU kernel orchestration causes response randomness

Researcher Horace He authored the post, arguing that randomness in AI models stems from how GPU kernels are orchestrated during inference processing. Inference processing refers to the computational steps that occur after users submit queries, such as pressing enter in ChatGPT.

GPU kernels are specialized programs running on Nvidia computer chips. He believes careful management of this orchestration layer can enable AI models to generate more predictable and consistent outputs.

Consistent responses improve reinforcement learning training

Beyond enhancing reliability for enterprise and scientific applications, He suggests reproducible responses can streamline reinforcement learning (RL) training. Reinforcement learning rewards AI models for correct answers, but inconsistent responses introduce noise into training data.

More consistent responses could improve the RL process, which aligns with The Information’s previous reporting that Thinking Machines Lab plans to use RL for tailoring AI models to specific business needs.

First product launch planned for coming months

Former OpenAI Chief Technology Officer Mira Murati announced in July that Thinking Machines Lab will release its first product soon. She indicated the product will be “useful for researchers and startups developing custom models,” though specific details and whether it incorporates the reproducibility techniques remain undisclosed.

Open research commitment mirrors early OpenAI approach

Thinking Machines Lab announced plans to regularly publish blog posts, code, and research outputs to “benefit the public, but also improve our own research culture.” The recent post launches a new series called “Connectionism,” reflecting this transparency commitment.

This approach mirrors OpenAI’s early open research pledge, though OpenAI became less transparent as it grew. The research blog provides rare insight into Thinking Machines Lab’s operations and indicates the company is tackling significant AI research challenges while working toward products that justify its $12 billion valuation.

Source link

AI Research

OpenAI makes $300 billion gamble on Oracle computing power to expand artificial intelligence capacity

Published

19 minutes ago

September 15, 2025

Wayne Williams

OpenAI signs $300 billion Oracle contract starting in 2027 to expand AI capacity
Oracle shares jump over 40 percent after reporting $317 billion in future revenue
Deal raises risks as OpenAI loses money and Oracle takes on heavy debt

OpenAI has signed a contract with Oracle to buy $300 billion worth of computing power over the next five years, according to the Wall Street Journal.

This makes it one of the largest cloud deals ever struck.

The contract will begin in 2027 and is expected to reshape how OpenAI builds and runs its artificial intelligence models.

A huge gamble

The agreement will require 4.5 gigawatts of power capacity, which is enough electricity to supply about four million homes.

It shows how the rush to build AI data centers is driving new highs in technology spending even as questions remain over whether demand will justify such commitments.

Oracle disclosed in its latest earnings report that it added $317 billion in future contract revenue during the quarter ending August 31, partly due to the OpenAI deal.

The news sent Oracle shares soaring by more than 40 percent in a single day. That surge increased Oracle Chairman Larry Ellison’s wealth by more than $100 billion, and saw him overtake Elon Musk as the world’s richest person with a net worth close to $400 billion.

The deal is not without massive risk for both parties, however.

For OpenAI, the agreement provides a new source of computing power after years of relying exclusively on Microsoft’s Azure cloud, but WSJ says the company, which reported about $10 billion in revenue this year, will owe Oracle an average of $60 billion annually under the agreement.

The startup is losing money and has told investors it does not expect to turn a profit until 2029.

Oracle, meanwhile, will have to borrow heavily to finance the AI chips and infrastructure needed to deliver the contract.

Plus, as WSJ reported, “The deal rests on the assumption ChatGPT will continue its explosive growth and be adopted by billions of people across the world, as well as major businesses and governments.”

Industry analysts say the partnership underlines both the promise and the strain of the AI boom. Spending on chips, servers, and data centers worldwide is projected to reach $2.9 trillion by 2028.

Whether OpenAI’s growth can keep pace with its commitments remains an open question.

Source link

AI Research

Artificial Intelligence Technology Solutions Inc. Announces Commercial Availability of Radcam Enterprise

Published

43 minutes ago

September 15, 2025

S&P Capital IQ

Artificial Intelligence Technology Solutions Inc. along with its subsidiary, Robotic Assistance Devices Inc. (RAD-I), announced the commercial availability of RADCam? Enterprise, a proactive video security platform now compatible with the industry’s leading Video Management Systems (VMS). The intelligent talking camera can be integrated quickly and seamlessly into virtually any professional-grade video system.

The Company first introduced the RADCam Enterprise initiative on May 5, 2025, highlighting its expansion beyond residential applications into small medium business (SMB) and enterprise markets. With today’s availability, RAD-I will deliver the solution through an untapped niche in the security industry, specifically security system integrators and security system distributors. RADCam Enterprise brings an intelligent “operator in the box” capability, enabling immediate talk-down to potential threats before human intervention is required.

The device integrates a speaker, microphone, and high-intensity lighting, allowing it not only to record but also to actively engage. At the same time, the solution is expected to deliver gross margins consistent with the Company’s established benchmarks. RADCam Enterprise distinguishes itself from the original residential version of RADCam by integrating RAD’s agentic AI platform, SARA (Speaking Autonomous Responsive Agent) as well as being compatible with RADSoC and industry leading Video Management Systems. RADCam Enterprise is available immediately through RAD-I’s network of channel partners and distributors.

Pre-orders are open at giving clients the opportunity to be among the first to deploy the solution. Designed for broad use across industries including logistics, retail, education, and commercial real estate, RADCam Enterprise provides clients and integrators with new ways to modernize security operations using proven AI-driven tools. RAD delivers these cost savings via a suite of stationary and mobile robotic solutions that complement, and at times, directly replace the need for human personnel in environments better suited for machines.

All RAD technologies, AI-based analytics and software platforms are developed in-house. The Company’s operations and internal controls have been validated through successful completion of its SOC 2 Type 2 audit, which is a formal, independent audit that evaluates a service organization’s internal controls for handling customer data and determines if the controls are not only designed properly but also operating effectively to protect customer data. Each Fortune 500 client has the potential of making numerous orders over time.

AITX is an innovator in the delivery of artificial intelligence-based solutions that empower organizations to gain new insight, solve complex challenges and fuel new business ideas. Through its next-generation robotic product offerings, AITX’s RAD, RAD-R, RAD-M and RAD-G companies help organizations streamline operations, increase ROI, and strengthen business. The Company has no obligation to provide the recipient with additional updated information.

No information in this publication should be interpreted as any indication whatsoever of the Company’s future revenues, results of operations, or stock price.

Source link

AI Research

Stanford Develops Real-World Benchmarks for Healthcare AI Agents

Published

1 hour ago

September 15, 2025

The Editors

Beyond the hype and hope surrounding the use of artificial intelligence in medicine lies the real-world need to ensure that, at the very least, AI in a healthcare setting can carry out tasks that a doctor would in electronic health records.

Creating benchmark standards to measure that is what drives the work of a team of Stanford researchers. While the researchers note the enormous potential of this new technology to transform medicine, the tech ethos of moving fast and breaking things doesn’t work in healthcare. Ensuring that these tools are capable of doing these tasks is vital, and then they can be used as tools that augment the care clinicians provide every day.

“Working on this project convinced me that AI won’t replace doctors anytime soon,” said Kameron Black, co-author on the new benchmark paper and a Clinical Informatics Fellow at Stanford Health Care. “It’s more likely to augment our clinical workforce.”

MedAgentBench: Testing AI Agents in Real-World Clinical Systems

Black is one of a multidisciplinary team of physicians, computer scientists, and researchers from across Stanford University who worked on the new study, MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents, published in the New England Journal of Medicine AI.

Although large language models (LLMs) have performed well on the United States Medical Licensing Examination (USMLE) and at answering medical-related questions in studies, there is currently no benchmark testing how well LLMs can function as agents by performing tasks that a doctor would normally do, such as ordering medications, inside a real-world clinical system where data input can be messy.

Unlike chatbots or LLMs, AI agents can work autonomously, performing complex, multistep tasks with minimal supervision. AI agents integrate multimodal data inputs, process information, and then utilize external tools to accomplish tasks, Black explained.

Overall Success Rate (SR) Comparison of State-of-the-Art LLMs on MedAgentBench
Model	Overall SR
Claude 3.5 Sonnet v2	69.67%
GPT-4o	64.00%
DeepSeek-V3 (685B, open)	62.67%
Gemini-1.5 Pro	62.00%
GPT-4o-mini	56.33%
o3-mini	51.67%
Qwen2.5 (72B, open)	51.33%
Llama 3.3 (70B, open)	46.33%
Gemini 2.0 Flash	38.33%
Gemma2 (27B, open)	19.33%
Gemini 2.0 Pro	18.00%
Mistral v0.3 (7B, open)	4.00%

While previous tests only assessed AI’s medical knowledge through curated clinical vignettes, this research evaluates how well AI agents can perform actual clinical tasks such as retrieving patient data, ordering tests, and prescribing medications.

“Chatbots say things. AI agents can do things,” said Jonathan Chen, associate professor of medicine and biomedical data science and the paper’s senior author. “This means they could theoretically directly retrieve patient information from the electronic medical record, reason about that information, and take action by directly entering in orders for tests and medications. This is a much higher bar for autonomy in the high-stakes world of medical care. We need a benchmark to establish the current state of AI capability on reproducible tasks that we can optimize toward.”

The study tested this by evaluating whether AI agents could utilize FHIR (Fast Healthcare Interoperability Resources) API endpoints to navigate electronic health records.

The team created a virtual electronic health record environment that contained 100 realistic patient profiles (containing 785,000 records, including labs, vitals, medications, diagnoses, procedures) to test about a dozen large language models on 300 clinical tasks developed by physicians. In initial testing, the best model, in this case, Claude 3.5 Sonnet v2, achieved a 70% success rate.

“We hope this benchmark can help model developers track progress and further advance agent capabilities,” said Yixing Jiang, a Stanford PhD student and co-author of the paper.

Many of the models struggled with scenarios that required nuanced reasoning, involved complex workflows, or necessitated interoperability between different healthcare systems, all issues a clinician might face regularly.

“Before these agents are used, we need to know how often and what type of errors are made so we can account for these things and help prevent them in real-world deployments,” Black said.

What does this mean for clinical care? Co-author James Zou and Dr. Eric Topol claim that AI is shifting from a tool to a teammate in care delivery. With MedAgentBench, the Stanford team has shown this is a much more near-term reality by showcasing several frontier LLMs in their ability to carry out many day-to-day clinical tasks that a physician would perform.

Already the team has noticed improvements in performance of the newest versions of models. With this in mind, Black believes that AI agents might be ready to handle basic clinical “housekeeping” tasks in a clinical setting sooner than previously expected.

“In our follow-up studies, we’ve shown a surprising amount of improvement in the success rate of task execution by newer LLMs, especially when accounting for specific error patterns we observed in the initial study,” Black said. “With deliberate design, safety, structure, and consent, it will be feasible to start moving these tools from research prototypes into real-world pilots.”

The Road Ahead

Black says benchmarks like these are necessary as more hospitals and healthcare systems are incorporating AI into tasks including note-writing and chart summarization.

Accurate and trustworthy AI could also help alleviate a looming crisis, he adds. Pressed by patient needs, compliance demands, and staff burnout, healthcare providers are seeing a worsening global staffing shortage, estimated to exceed 10 million by 2030.

Instead of replacing doctors and nurses, Black hopes that AI can be a powerful tool for clinicians, lessening the burden of some of their workload and bringing them back to the patient bedside.

“I’m passionate about finding solutions to clinician burnout,” Black said. “I hope that by working on agentic AI applications in healthcare that augment our workforce, we can help offload burden from clinicians and divert this impending crisis.”

Paper authors: Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, and Jonathan H. Chen

Read the piece in the New England Journal of Medicine AI.

Source link