AI Research

OpenAI Model Earns Gold-Medal Score at International Math Olympiad and Advances Path to Artificial General Intelligence

Published

4 weeks ago

August 21, 2025

A few months before the 2025 International Mathematical Olympiad (IMO) in July, a three-person team at OpenAI made a long bet that they could use the competition’s brutally tough problems to train an artificial intelligence model to think on its own for hours so that it was capable of writing math proofs. Their goal wasn’t simply to create an AI that could do complex math but one that could evaluate ambiguity and nuance—skills AIs will need if they are to someday take on many challenging real-world tasks. In fact, these are precisely the skills required to create artificial general intelligence, or AGI: human-level understanding and reasoning.

The IMO, held this year on Australia’s Sunshine Coast, is the world’s premier math competition for high schoolers, bringing together top contenders from more than 100 countries. All are given the same six problems—three per day, each worth seven points—to solve over two days. But these problems are nothing like what you probably remember from high school. Rather than a brief numeric answer, each demands sustained reasoning and creativity in the form of a pages-long written proof. These logical, step-by-step arguments have to span many fields of mathematics—exactly the sort of problems that, until just this year, AI systems failed at spectacularly.

The OpenAI team of researchers and engineers—Alex Wei, Sheryl Hsu and Noam Brown—used a general-purpose reasoning model: an AI designed to “think” through challenging problems by breaking them into steps, checking its own work and adapting its approach as it goes. Though AI systems couldn’t officially compete as participants, the notoriously tough test served as a demonstration of what they can do, and the AIs tackled this year’s questions in the same test format and with the same constraints as human participants. Upon receiving the questions, the team’s experimental system worked for two 4.5‑hour sessions (just as the student contestants did), without tools or the Internet—it had absolutely no external assistance from tools such as search engines or software designed for math. The proofs it produced were graded by three former IMO medalists and posted online. The AI completed five of the six problems correctly, receiving 35 out of 42 points—the minimum required for an IMO gold medal. (Google’s DeepMind AI system also achieved that score this year.) Out of 630 competitors, only 26 students, or 4 percent, outperformed the AI; five students achieved perfect 42s. Given that a year ago language-based AI systems like OpenAI’s struggled to do elementary math, the results were a dramatic leap in performance.

On supporting science journalism

If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

In the following conversation, Scientific American spoke with two members of the OpenAI team, Alex Wei and Sheryl Hsu, to discuss how they conducted their work, why the model’s lack of response to the sixth question was actually a major step toward addressing AI’s “hallucination” problem and how developing a system capable of writing complex proofs could help lead to artificial general intelligence.

[An edited transcript of the interview follows.]

What led you to suddenly begin preparing an AI model for the IMO just a few months before the competition? What was the spark?

WEI: I had been thinking about math proofs for quite a while. I’m on a team at OpenAI called MathGen. We had just seen the results progress a lot. We felt like we had a shot to get a model that could do really well at the IMO, and we wanted to make a mad dash to get there.

HSU: I used to do math competitions. [Wei] used to do math competitions—he was a lot better than me. The IMO is definitely well known within the [AI research] community, including among researchers at OpenAI. So it was really inspiring to push specifically for that.

Can you talk about your decision to work with a general‑purpose AI system rather than a system that was specifically designed to answer math problems?

WEI: The philosophy is that we want to build general‑purpose AI and develop methods that don’t just work for math. Math is a very good proving ground for AI because it’s fairly objective: if you have a proof, it’s easier to get consensus on whether it’s correct. That’s harder for, say, poetry—you’ll have more disagreement among readers. And IMO problems are very hard, so we wanted to tackle hard problems with general‑purpose methods in the hope that they’ll also apply to domains beyond math.

HSU: I’d also say the goal at OpenAI is to build AGI—it’s not necessarily to write papers or win competitions. It was important that everything we did for this project also be useful for the bigger goal of building AGI and better models that users can actually use.

In what ways could a reasoning model winning a gold in the IMO help lead to AGI?

WEI: One perspective is to think in terms of how long tasks take. A year ago, ChatGPT could only do very basic math problems. Two years ago—and even a year and a half ago—we were often thinking about grade‑school math problems you’d find on fifth‑grade homework. For someone really good at math, those take a second or two to read and solve. Then we started evaluating using AIME [the American Invitational Mathematics Examination, a 15-question high school math contest]. That takes around 10 minutes per problem, with about three hours for 15 problems. The IMO is four and a half hours for just three problems—that’s 90 minutes per problem. ChatGPT started off being good for quick questions. Now it’s better at longer‑running tasks, such as “Can you edit this paragraph for me?” As AI improves, you can expand the time horizon of tasks, and you can see that progression clearly in math.

HSU: Another aspect is that reasoning models were previously very good at tasks that are easy to verify. If you’re solving a non‑proof‑based math problem, there’s one numerically correct answer. It’s easy to check. But in the real world—and in the tasks people actually want help with—it’s more complex. There’s nuance: maybe it’s mostly correct but has some errors; maybe it’s correct but could be stylized better. Proof‑based math isn’t trivial to evaluate. If we think about AGI, those tasks won’t be easy to judge as correct or not; they’ll be more loosely specified and harder overall.

What was the process for training the model?

WEI: In general, reinforcement learning trains a model by rewarding good behavior and penalizing bad behavior. If you repeatedly reinforce good behavior and discourage bad behavior, the model becomes more likely to exhibit the good behavior.

HSU: Toward the end, we also scaled up test‑time compute [how long the AI model was able to “think” before answering]. Previously, for a human, problems of this sort might be a few minutes; now we were scaling to hours. That extra thinking time gave surprising gains. There was a moment when we ran evaluations on our internal test set that took a long time because of the increased test‑time compute. When we finally looked at the results—and Alex graded them—seeing the progress made me think gold might be within reach. That was pretty exciting.

On the IMO test, the model you developed got five out of six answers correct. But with the sixth question, the model didn’t try to provide an answer. Can you tell me more about the significance of this response?

WEI: The model knowing what it doesn’t know was one of the early signs of [progress] we saw. Today if you use ChatGPT, you’ll sometimes see “hallucinations”—models don’t reliably know when they don’t know. That capability isn’t specific to math. I’d love it if, for everyday questions, the model could honestly say when it doesn’t know instead of giving an answer I must verify independently.

What kind of impact could your work on this model have on future models?

HSU: Everything we did for this project is fairly general‑purpose—being able to grade outputs that aren’t single answers and to work on hard problems for a long time while making steady progress. Those contributed a lot to the success here, and now we and others at OpenAI are applying them beyond math. It’s not in GPT‑5, but in future models, we’re excited to integrate these capabilities.

WEI: If you look at the solutions we publicly posted for the IMO problems, some are very long—five to 10 pages. This model can generate long outputs that are consistent and coherent, without mistakes. Many current state‑of‑the‑art models can’t produce a totally coherent five‑page report. I’m excited that this care and precision will help in many other domains.

Source link

AI Research

Chair File: Using Innovation and AI to Advance Health

Published

5 minutes ago

September 15, 2025

Tina Freese Decker, Chair, American Hospital Association

With all of the challenges facing health care — a shrinking workforce population, reduced funding, new technologies and pharmaceuticals — it’s no longer an option to change, but an imperative. In order to keep caring for our communities well into the future, we need to transform how we provide care to people. Technology, artificial intelligence and digital transformation can not only help us mitigate these trends but truly innovate and find new ways of making health better.

There are many exciting capabilities already making their way into our field. Ambient listening technology for providers and other automation and AI reduce administrative burden and free up people and resources to improve front-line care. Within the next five years, we expect hospital “smart rooms” to be the norm; they leverage cameras and AI-assisted alerting to improve safety, enable virtual care models across our footprint and allow us to boost efficiency while also improving quality and outcomes.

It’s easy to get caught up in shiny new tools or cutting-edge treatments, but often the most impactful innovations are smaller — adapting or designing our systems and processes to empower our teams to do what they do best.

That’s exactly what a new collaboration with the AHA and Epic is aiming to do. A set of point-of-care tools in the electronic health record is helping providers prevent, detect and treat postpartum hemorrhage, which is responsible for 11% of maternal deaths in the U.S. Early detection and treatment of PPH is key to a full recovery. One small innovation — incorporating tools into your EHR and labor and delivery workflows — is having a big impact: enhancing providers’ ability to effectively diagnose and treat PPH.

It’s critical to leverage technology advancements like this to navigate today’s challenging environment and advance health care into the future. However, at the same time, we also need to focus on how these opportunities can deliver measurable value to our patients, members and the communities we serve.

I will be speaking with Jackie Gerhart, M.D., chief medical officer at Epic, later this month for a Leadership Dialogue conversation. Listen in to learn more about how AI and other technological innovations can better serve patients and make actions more efficient for care providers.

Helping You Help Communities – Key AHA Resources

Source link

AI Research

Artificial Intelligence Stocks To Add to Your Watchlist – September 14th – MarketBeat

Published

50 minutes ago

September 15, 2025

The Editors

Artificial Intelligence Stocks To Add to Your Watchlist – September 14th MarketBeat

Source link

AI Research

AI-Augmented Cybersecurity: A Human-Centered Approach

Published

1 hour ago

September 15, 2025

Ben Doane

The integration of artificial intelligence (AI) is fundamentally transforming the cybersecurity landscape. While AI brings unparalleled speed and scale in threat detection, an effective strategy potentially lies in cultivating collaboration between people with specialized knowledge and AI systems rather than full AI automation. This article explores AI’s evolving role in cybersecurity, the importance of blending human oversight with technological capabilities, and frameworks to consider.

AI & Human Roles

The role of AI has expanded far beyond simple task automation. It can now serve as a powerful tool for augmenting human-led analysis and decision making and can help organizations process and go over vast volumes of security logs and data quickly. This capability can help significantly enhance early threat detection and accelerate incident response. With AI-augmented cybersecurity, organizations can identify and address potential threats with unprecedented speed and precision.

Despite these advancements, the vision of a fully autonomous security operations center (SOC) currently remains more aspirational than practical. AI-powered systems often lack the nuanced contextual understanding and intuitive judgment essential for handling novel or complex attack scenarios. This is where human oversight becomes indispensable. Skilled analysts play an essential role in interpreting AI findings, making strategic decisions, and bringing automated actions in line with the organization’s particular context and policies.

This is where human oversight becomes indispensable.

As the cybersecurity industry shifts toward augmentation, a best-fit model is one that utilizes AI to handle repetitive, high-volume tasks while simultaneously preserving human control over critical decisions and direction. This balanced approach combines the speed and efficiency of automation with the insight and experience of human reasoning, creating a scalable, resilient security posture.

Robust Industry Frameworks for AI Integration

The transition toward AI-augmented, human-centered cybersecurity is well represented by frameworks from leading industry platforms. These models provide a road map for organizations to incrementally integrate AI while maintaining the much-needed role of human oversight.

SentinelOne’s Autonomous SOC Maturity Model provides a framework to help support organizations on their journey to an autonomous SOC. This model emphasizes the strategic use of AI and automation to help strengthen human security teams. It outlines the progression from manual, reactive security practices to advanced, automated, and proactive approaches, where AI can handle repetitive tasks and free up human analysts for strategic work.

SentinelOne has defined its Autonomous SOC Maturity Model as consisting of the following five levels:

Level 0 (Manual Operations): Security teams rely entirely on manual processes for threat detection, investigation, and response.
Level 1 (Basic Automation): Introduction of rule-based alerts and simple automated responses for known threat patterns.
Level 2 (Enhanced Detection): AI-assisted threat detection that flags anomalies while analysts maintain investigation control.
Level 3 (Orchestrated Response): Automated workflows handle routine incidents while complex cases require human intervention.
Level 4 (Autonomous Operations): Advanced AI manages most security operations with strategic human oversight and exception handling.

This progression demonstrates that achieving sophisticated security automation requires gradual capability building rather than a full-scale overhaul of systems and processes. At each level, humans remain essential for strategic decision making, policy alignment, and handling cases that fall outside of the automated parameters. Even at Level 4, the highest maturity level, human oversight remains a must for effective, accurate operations.

Another leading platform centers on supporting security analysts via AI-driven insights rather than replacing human judgment. Elastic’s AI-driven approach integrates machine learning algorithms to automatically detect anomalies, correlate events, and uncover subtle threats within large data sets. For example, when unusual network patterns emerge, the system doesn’t automatically initiate response actions but instead presents analysts with enriched data, relevant context, and suggested investigation paths.

A key strength of Elastic’s model is its emphasis on analyst empowerment. Rather than automating decisions, the platform provides security professionals with enhanced visibility and context. This approach recognizes that cybersecurity fundamentally remains a strategic challenge requiring human insight, creativity, and contextual understanding. AI serves as a force multiplier, helping analysts process information efficiently so they can focus their time on high-value activities.

The Modern SOC

While AI in cybersecurity can be seen as a path toward full automation, security operations can be structured instead to bolster human-AI collaboration in a way that doesn’t replace humans but boosts human capabilities to help improve efficiency. This view recognizes that security remains a human-versus-human challenge. Harvard Business School professor Karim Lakhani states that “AI won’t replace humans, but humans with AI will replace humans without AI.” Applying this principle to security operations, the question is, who will win in cyberspace? It may be the team that responsibly adapts and evolves its operational process by understanding and incorporating the advantages of AI. This team will be well positioned to defend against quickly evolving threat tactics, techniques, and procedures. The rhetoric of a non-human, fully autonomous SOC is not a current reality. However, the SOC that embraces AI as complementing people, not replacing people, may likely be the SOC that creates a competitive advantage in cyber defense.

In practice, this approach can simplify traditional tiered SOC structures, helping analysts handle incidents end-to-end while leveraging AI for speed, context, and insight. This can help organizations improve efficiency, accountability, and resilience against evolving threats.

Create a tactical competitive advantage in security operations with AI.

Best Practices for AI-Augmented Security

Building effective, AI-augmented security operations requires intentional design principles that prioritize human capabilities alongside technological advancements.

Successful implementations often focus AI automation on high-volume, routine activities that take up analyst time and don’t require complex reasoning. Some of these activities include the following:

Initial alert triage: AI systems can categorize and prioritize incoming security alerts based on severity, asset importance, and historical patterns.
Data enrichment: Automating the gathering of relevant contextual information from multiple sources can support analyst investigations.
Standard response actions: Predetermined responses can be triggered for well-understood threats, e.g., isolating compromised endpoints or blocking known malicious IP addresses.
Report generation: Investigation findings and incident summaries can be compiled for stakeholder communication.

By handling these routine tasks, AI can give analysts time to focus on activities that require advanced reasoning and skill, such as threat hunting, strategic planning, policy development, and navigating attack scenarios.

In addition, traditional SOC structures often fragment incident handling across multiple tiers, sometimes leading to communication gaps and delayed responses. Human-centered security operations may benefit from enabling individual analysts with inclusive case ownership, supported by AI tools that can help streamline the steps needed for investigation and response actions.

By allowing more extensive case ownership, security teams can reduce handoff delays and scale incident management. AI-embedded tools can support security teams with enhanced reporting, investigation assistance, and intelligent recommendations throughout the incident lifecycle.

Practical Recommendations

Implementing AI-augmented cybersecurity requires systematic planning and deployment. Security leaders can follow these practical steps to build human-centered security operations. To begin, review your organization’s current SOC maturity across key dimensions, including:

Automation Readiness

What percentage of security alerts get a manual review currently?
Which routine tasks take the most analyst time?
How standardized are your operations playbooks and/or incident response procedures?

Data Foundation

Do you have the complete and verified asset inventory with network visibility?
Are security logs centralized and easily searchable?
Can you correlate events across disparate data sources and security tools?

Team Capabilities

What is your analyst retention rate and average tenure?
How quickly can new team members get up to speed?
What skills gaps exist in your current team?

Tool Selection Considerations

Effective AI-augmented security requires tools that can support human-AI collaboration rather than promising unrealistic automation. Review potential solutions based on:

Integration Capabilities

How well do tools integrate with your existing security infrastructure?
Can the platform adapt to your organization’s specific policies and procedures?
Does the vendor provide application programming interface (API) integrations?

Transparency & Explainable AI

Can analysts understand how AI systems reach their conclusions?
Are there clear mechanisms for providing feedback to improve AI accuracy?
Can you audit and validate automated decisions?

Scalability & Flexibility

Can the platform grow with your organization’s needs?
How easily can you modify automated workflows as threats evolve?
What support is available for ongoing use?

Measuring Outcomes

Tool selection is only part of the equation. Measuring outcomes is just as important. To help align your AI-augmented security strategy with your organization’s goals, consider tracking metrics that demonstrate both operational efficiency and the enhanced effectiveness of analysts, such as:

Operational Metrics

Mean time to detect
Mean time to respond
Mean time to investigate
Mean time to close
Percentage of alerts that can be automatically triaged and prioritized
Analyst productivity measured by high-value activities rather than ticket volume

Strategic Metrics

Analyst job satisfaction and retention rates
Time invested in proactive threat hunting versus reactive incident response
Organizational resilience measured through red/blue/purple team exercises and simulations

How Forvis Mazars Can Help

The future of proactive cybersecurity isn’t about choosing between human skill and AI, but rather lies in thoughtfully combining their complementary strengths. AI excels at processing massive amounts of data, identifying patterns, and executing consistent responses to known threats. Humans excel at providing contextual understanding, creative problem-solving, and strategic judgment, which are essential skills for addressing novel and complex security challenges.

Organizations that embrace this collaborative approach can position themselves to build more resilient, scalable, and effective security operations. Rather than pursuing the lofty and perhaps unrealistic goal of full automation, consider focusing on creating systems where AI bolsters human capabilities and helps security professionals deliver their best work.

The journey toward AI-augmented cybersecurity necessitates careful planning, gradual implementation, and continual refinement. By following the frameworks and best practices outlined in this article, security leaders can build operations that leverage both human intelligence and artificial intelligence to protect their organizations in an increasingly complex threat landscape.

Ready to explore how AI-augmented cybersecurity can strengthen your organization’s security posture? The Managed Services team at Forvis Mazars has certified partnerships with SentinelOne and Elastic. Contact us to discuss tailored solutions.

Related reading:

Source link