AI Research

OpenAI Model Earns Gold-Medal Score at International Math Olympiad and Advances Path to Artificial General Intelligence

Published

3 weeks ago

August 21, 2025

A few months before the 2025 International Mathematical Olympiad (IMO) in July, a three-person team at OpenAI made a long bet that they could use the competition’s brutally tough problems to train an artificial intelligence model to think on its own for hours so that it was capable of writing math proofs. Their goal wasn’t simply to create an AI that could do complex math but one that could evaluate ambiguity and nuance—skills AIs will need if they are to someday take on many challenging real-world tasks. In fact, these are precisely the skills required to create artificial general intelligence, or AGI: human-level understanding and reasoning.

The IMO, held this year on Australia’s Sunshine Coast, is the world’s premier math competition for high schoolers, bringing together top contenders from more than 100 countries. All are given the same six problems—three per day, each worth seven points—to solve over two days. But these problems are nothing like what you probably remember from high school. Rather than a brief numeric answer, each demands sustained reasoning and creativity in the form of a pages-long written proof. These logical, step-by-step arguments have to span many fields of mathematics—exactly the sort of problems that, until just this year, AI systems failed at spectacularly.

The OpenAI team of researchers and engineers—Alex Wei, Sheryl Hsu and Noam Brown—used a general-purpose reasoning model: an AI designed to “think” through challenging problems by breaking them into steps, checking its own work and adapting its approach as it goes. Though AI systems couldn’t officially compete as participants, the notoriously tough test served as a demonstration of what they can do, and the AIs tackled this year’s questions in the same test format and with the same constraints as human participants. Upon receiving the questions, the team’s experimental system worked for two 4.5‑hour sessions (just as the student contestants did), without tools or the Internet—it had absolutely no external assistance from tools such as search engines or software designed for math. The proofs it produced were graded by three former IMO medalists and posted online. The AI completed five of the six problems correctly, receiving 35 out of 42 points—the minimum required for an IMO gold medal. (Google’s DeepMind AI system also achieved that score this year.) Out of 630 competitors, only 26 students, or 4 percent, outperformed the AI; five students achieved perfect 42s. Given that a year ago language-based AI systems like OpenAI’s struggled to do elementary math, the results were a dramatic leap in performance.

On supporting science journalism

If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

In the following conversation, Scientific American spoke with two members of the OpenAI team, Alex Wei and Sheryl Hsu, to discuss how they conducted their work, why the model’s lack of response to the sixth question was actually a major step toward addressing AI’s “hallucination” problem and how developing a system capable of writing complex proofs could help lead to artificial general intelligence.

[An edited transcript of the interview follows.]

What led you to suddenly begin preparing an AI model for the IMO just a few months before the competition? What was the spark?

WEI: I had been thinking about math proofs for quite a while. I’m on a team at OpenAI called MathGen. We had just seen the results progress a lot. We felt like we had a shot to get a model that could do really well at the IMO, and we wanted to make a mad dash to get there.

HSU: I used to do math competitions. [Wei] used to do math competitions—he was a lot better than me. The IMO is definitely well known within the [AI research] community, including among researchers at OpenAI. So it was really inspiring to push specifically for that.

Can you talk about your decision to work with a general‑purpose AI system rather than a system that was specifically designed to answer math problems?

WEI: The philosophy is that we want to build general‑purpose AI and develop methods that don’t just work for math. Math is a very good proving ground for AI because it’s fairly objective: if you have a proof, it’s easier to get consensus on whether it’s correct. That’s harder for, say, poetry—you’ll have more disagreement among readers. And IMO problems are very hard, so we wanted to tackle hard problems with general‑purpose methods in the hope that they’ll also apply to domains beyond math.

HSU: I’d also say the goal at OpenAI is to build AGI—it’s not necessarily to write papers or win competitions. It was important that everything we did for this project also be useful for the bigger goal of building AGI and better models that users can actually use.

In what ways could a reasoning model winning a gold in the IMO help lead to AGI?

WEI: One perspective is to think in terms of how long tasks take. A year ago, ChatGPT could only do very basic math problems. Two years ago—and even a year and a half ago—we were often thinking about grade‑school math problems you’d find on fifth‑grade homework. For someone really good at math, those take a second or two to read and solve. Then we started evaluating using AIME [the American Invitational Mathematics Examination, a 15-question high school math contest]. That takes around 10 minutes per problem, with about three hours for 15 problems. The IMO is four and a half hours for just three problems—that’s 90 minutes per problem. ChatGPT started off being good for quick questions. Now it’s better at longer‑running tasks, such as “Can you edit this paragraph for me?” As AI improves, you can expand the time horizon of tasks, and you can see that progression clearly in math.

HSU: Another aspect is that reasoning models were previously very good at tasks that are easy to verify. If you’re solving a non‑proof‑based math problem, there’s one numerically correct answer. It’s easy to check. But in the real world—and in the tasks people actually want help with—it’s more complex. There’s nuance: maybe it’s mostly correct but has some errors; maybe it’s correct but could be stylized better. Proof‑based math isn’t trivial to evaluate. If we think about AGI, those tasks won’t be easy to judge as correct or not; they’ll be more loosely specified and harder overall.

What was the process for training the model?

WEI: In general, reinforcement learning trains a model by rewarding good behavior and penalizing bad behavior. If you repeatedly reinforce good behavior and discourage bad behavior, the model becomes more likely to exhibit the good behavior.

HSU: Toward the end, we also scaled up test‑time compute [how long the AI model was able to “think” before answering]. Previously, for a human, problems of this sort might be a few minutes; now we were scaling to hours. That extra thinking time gave surprising gains. There was a moment when we ran evaluations on our internal test set that took a long time because of the increased test‑time compute. When we finally looked at the results—and Alex graded them—seeing the progress made me think gold might be within reach. That was pretty exciting.

On the IMO test, the model you developed got five out of six answers correct. But with the sixth question, the model didn’t try to provide an answer. Can you tell me more about the significance of this response?

WEI: The model knowing what it doesn’t know was one of the early signs of [progress] we saw. Today if you use ChatGPT, you’ll sometimes see “hallucinations”—models don’t reliably know when they don’t know. That capability isn’t specific to math. I’d love it if, for everyday questions, the model could honestly say when it doesn’t know instead of giving an answer I must verify independently.

What kind of impact could your work on this model have on future models?

HSU: Everything we did for this project is fairly general‑purpose—being able to grade outputs that aren’t single answers and to work on hard problems for a long time while making steady progress. Those contributed a lot to the success here, and now we and others at OpenAI are applying them beyond math. It’s not in GPT‑5, but in future models, we’re excited to integrate these capabilities.

WEI: If you look at the solutions we publicly posted for the IMO problems, some are very long—five to 10 pages. This model can generate long outputs that are consistent and coherent, without mistakes. Many current state‑of‑the‑art models can’t produce a totally coherent five‑page report. I’m excited that this care and precision will help in many other domains.

Source link

AI Research

Researchers ‘polarised’ over use of AI in peer review

Published

2 hours ago

September 14, 2025

Tom Williams

Researchers appear to be becoming more divided over whether generative artificial intelligence should be used in peer review, with a survey showing entrenched views on either side.

A poll by IOP Publishing found that there has been a big increase in the number of scholars who are positive about the potential impact of new technologies on the process, which is often criticised for being slow and overly burdensome for those involved.

A total of 41 per cent of respondents now see the benefits of AI, up from 12 per cent from a similar survey carried out last year. But this is almost equal to the proportion with negative opinions which stands at 37 per cent after a 2 per cent year-on-year increase.

This leaves only 22 per cent of researchers neutral or unsure about the issue, down from 36 per cent, which IOP said indicates a “growing polarisation in views” as AI use becomes more commonplace.

Women tended to have more negative views about the impact of AI compared with men while junior researchers tended to have a more positive view than their more senior colleagues.

Nearly a third (32 per cent) of those surveyed say they already used AI tools to support them with peer reviews in some form.

Half of these say they apply it in more than one way with the most common use being to assist with editing grammar and improving the flow of text.

A minority used it in more questionable ways such as the 13 per cent who asked the AI to summarise an article they were reviewing – despite confidentiality and data privacy concerns – and the 2 per cent who admitted to uploading an entire manuscript into a chatbot so it could generate a review on their behalf.

IOP – which currently does not allow AI use in peer reviews – said the survey showed a growing recognition that the technology has the potential to “support, rather than replace, the peer review process”.

But publishers must fund ways to “reconcile” the two opposing viewpoints, the publisher added.

A solution could be developing tools that can operate within peer review software, it said, which could support reviewers without positing security or integrity risks.

Publishers should also be more explicit and transparent about why chatbots “are not suitable tools for fully authoring peer review reports”, IOP said.

“These findings highlight the need for clearer community standards and transparency around the use of generative AI in scholarly publishing. As the technology continues to evolve, so too must the frameworks that support ethical and trustworthy peer review,” Laura Feetham-Walker, reviewer engagement manager at IOP and lead author of the study, said.

tom.williams@timeshighereducation.com

Source link

AI Research

Amazon Employing AI to Help Shoppers Comb Reviews

Published

3 hours ago

September 14, 2025

PYMNTS

Amazon earlier this year began rolling out artificial intelligence-voiced product descriptions for select customers and products.

Now, the company’s “Hear the Highlights” feature has extended to all users, CNBC reported Sunday (Sept. 14), arguing this could replace user-created reviews as the main source of product information.

Among the advantages here, the report added, is that artificial intelligence (AI) won’t suffer from cognitive overload from combing through thousands of reviews.

“It’s important to recognize where AI is currently strong, such as in automation and pattern recognition, and where it still falls short, like in judgment-heavy tasks,” said Ankur Edkie, co-founder and CEO of Murf AI, which develops AI voiceovers. “A key question is whether there’s a way to factor in customer context as an input while generating these summaries.”

The value of AI, according to Edkie, is determining the right “problem-capability fit.” Without that, he added, a sense of “gimmickry” is likely to filter through thanks to AI fatigue, which he says consumers are likely feeling by now.

PYMNTS has contacted Amazon for comment but has not yet gotten a reply.

The CNBC report also noted that the tendency of AI to focus on common themes can water down responses even as it summarizes them, taking out the detailed personal experiences found in human reviews.

“AI might overlook unique insights or niche needs that don’t align with the majority of responses,” said Brian Numainville, principal at consumer research firm Feedback Group. “Additionally, the ability to critically interpret reviews — like spotting biases or trusting certain reviewers — is diminished with AI summaries.”

Nauman Dawalatabad, a research scientist at Zoom Communications, offered his opinion that the technology is on its way to improving customer experience.

“I take it as technology helping us to make informed decisions,” he said, pointing to the mental fatigue and wasted time that can result from working through customer reviews.

Meanwhile, recent research by PYMNTS Intelligence shows that AI shopping adoption has begun to gain traction with younger and middle-aged consumers. The research found that 32% of all consumers said they have used or would use generative AI for shopping.

“Bridge millennials — older millennials straddling Gen X — lead the way, with 38% reporting AI use for shopping,” PYMNTS wrote last month. “Zillennials are close behind at 36%, followed by millennials at 35%. Gen X is next, at 33%, while Gen Z comes in at 31%. Baby boomers show some traction as well, with 28% using gen AI for shopping.”

Source link

AI Research

Nubank To Continue Leveraging AI To Enhance Digital Financial Services In Latin America

Published

4 hours ago

September 14, 2025

Omar Faridi

Nubank (NYSE: NU) is reportedly millions of customers across Latin America. Recently, the company’s Chief Technology Officer, Eric Young, shared his vision for leveraging artificial intelligence to fuel Nubank’s global expansion and improve financial services.

During a recent discussion, Young outlined how AI is not just a tool but a cornerstone for operational efficiency, customer-centric growth, and democratizing access to personalized finance.

With a career that includes work at Amazon in the early 2000s, Young brings a philosophy of prioritizing customer experience.

At Amazon, he witnessed firsthand how technology could transform user experiences, a mindset he now applies to Nubank’s mission. “If not us, then who?”

Young posed rhetorically during the videocast, underscoring Nubank’s unique position to disrupt traditional banking.

Founded in Brazil in 2013, Nubank has positively impacted the financial sector by prioritizing financial inclusion and superior customer service, challenging legacy banks with its digital-first approach.

Under Young’s leadership, Nubank’s priorities are clear: enhance agility, expand internationally, and harness AI to serve customers better.

He emphasized the need for cross-functional collaboration, particularly with the product and design teams.

This includes partnering with Nubank’s recently appointed Chief Design Officer (CDO), Ethan Eismann, to iterate quickly on new features.

By fostering a culture of testing and learning, Young aims to deliver products that not only meet but exceed user expectations, ultimately capturing a larger market share.

This involves deepening engagement with existing users, attracting new ones, and venturing into underserved markets where financial services remain inaccessible.

Central to Young’s strategy is AI’s transformative potential.

Nubank’s 2024 acquisition of Hyperplane, an AI-focused startup, marks a pivotal step in this direction.

Young highlighted how advanced language models—such as those powering ChatGPT and Google Gemini—can bridge the gap between everyday users and elite financial advisory services.

These models excel at processing vast amounts of data, including transaction histories, to offer hyper-personalized recommendations.

Imagine an AI that automates budgeting, predicts spending patterns, and suggests investment opportunities tailored to an individual’s financial profile, all without the hefty fees of traditional private banking.

Young drew a parallel to the exclusivity of high-end services.

Historically, AI-driven private banking was reserved for the ultra-wealthy, but Nubank’s vision is to make it ubiquitous.

“We’re democratizing access to hyper-personalized financial experiences.”

By analyzing user data ethically and securely, AI can empower customers from all segments—whether a small business owner in Mexico or a young professional in Colombia—to manage their finances with the precision once afforded only to elites.

This aligns with Nubank’s core ethos of inclusion, ensuring that technology serves as an equalizer rather than a divider.

Looking ahead, Young sees AI as the engine for Nubank’s platformization efforts, enabling scalable solutions that support international growth.

As Nubank eyes further expansion beyond Brazil, Mexico, and Colombia, AI will streamline operations, from fraud detection to customer support chatbots, reducing costs while enhancing reliability.

Yet, Young cautioned that success hinges on responsible implementation—prioritizing privacy, transparency, and human oversight to build trust.

In an era where fintechs aggressively compete for market share, Eric Young’s insights position Nubank not just as a bank, but as a key player in AI-powered financial services.

By blending technological prowess with a focus on the customer, Nubank is set to transform money management, making various services more accessible to consumers.

As Young basically put it, the question isn’t whether AI will change finance—it’s how Nubank will aim to make a positive impact.

Source link