Connect with us

AI Research

‘The illusion of thinking’: Apple research finds AI models collapse and give up with hard puzzles

Published

on


New artificial intelligence research from Apple shows AI reasoning models may not be “thinking” so well after all.

According to a paper published just days before Apple’s WWDC event, large reasoning models (LRMs) — like OpenAI o1 and o3, DeepSeek R1, Claude 3.7 Sonnet Thinking, and Google Gemini Flash Thinking — completely collapse when they’re faced with increasingly complex problems. The paper comes from the same researchers who found other reasoning flaws in LLMs last year.

The news was a bucket of cold water for artificial general intelligence (AGI) optimists (and welcome news for AI and AGI skeptics), as Apple’s research seemed to show damning evidence about the limitations of reasoning model intelligence. While the much-hyped LRM performed better than LLMs on medium-difficulty puzzles, they performed worse on simple puzzles. And according to Apple’s research, when they faced hard puzzles, they collapsed completely, giving up on the problem prematurely.

Or, as the Apple researchers put it, while AI models perform extremely well at math and coding, when it comes to more complex problems, they only provide “The Illusion of Thinking.”

Apple was slow to develop large language models and implement AI in its devices, largely staying out of the conversation. The company has added Apple Intelligence AI features, though they have generally been considered underwhelming. In fact, after WWDC 2025, it’s clear that Apple is going in a different direction with AI than the rest of the industry. With that in mind, this research might explain some of Apple’s reticence to go all-in, unlike Google and Samsung, which have frontloaded their devices with AI capabilities.

How Apple researchers tested reasoning skills

The problems researchers used to evaluate the reasoning models, which they call LRMs or Large Reasoning Models, are classic logic puzzles like the Tower of Hanoi. The puzzle consists of discs, stacked largest to smallest on one of three pegs, and the goal is to move the discs to the third peg without ever placing a larger disc on top of a smaller disc. Other puzzles included jumping checker pieces into empty spaces, the river-crossing problem (the one usually involving a fox, a chicken, and a bag of grain), and stacking blocks in a specific configuration.

Mashable Light Speed

You probably recognize these logic puzzles from math class or online games, since it’s a simple way of testing humans’ ability to reason and problem-solve. Once you figure it out, it’s a simple matter of following the logic even as the complexity increases, which in this case means more discs, checkers, animals, or blocks. However, researchers found that LRMs start to fail after a certain point.

“Results show that all reasoning models exhibit a similar pattern with respect to complexity: accuracy progressively declines as problem complexity increases until reaching complete collapse (zero accuracy) beyond a model specific complexity threshold,” researchers wrote. In the results shown, Claude 3.7 Sonnet + thinking and DeepSeek R1 start to fail when a fifth disc is added to the Tower of Hanoi problem. Even when more computing power is applied to the LRMs, they still fail at the more complex puzzles.

What’s more, researchers found that reasoning models initially apply more thinking tokens as complexity increases, but they actually give up at a certain point. “Upon approaching a critical threshold — which closely corresponds to their accuracy collapse point — models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty,” the paper read. So when the problems get harder, they spend less tokens, or “think” less.

But what about when the LRMs are given the answers? Nope, accuracy doesn’t improve. Even when researchers included the algorithm in the prompt, so all the models need to do is follow the steps, they continued to fail.

But before you fire up the grill because LLM reasoning is so cooked, season these findings with a grain of salt. The research doesn’t mean LRMs don’t reason at all, it just means they may not currently be much smarter than humans. As AI expert Gary Marcus pointed out on his blog, “(ordinary) humans actually have a bunch of (well-known) limits that parallel what the Apple team discovered. Many (not all) humans screw up on versions of the Tower of Hanoi with 8 discs.” As others have pointed out online, the research does not compare results from human attempts at these puzzles.

Essentially, LLMs have their uses for tasks like coding and writing, but they also have weaknesses. “What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that LLMs are no substitute for good well-specified conventional algorithms,” wrote Marcus, who has been very vocal about the reasoning limitations of AI models.

That’s to say, take the findings from Apple researchers for what they are: important data to be considered within the context of other LLM research. It’s tempting to categorize AI’s overall advancements as overhyped when new research like this comes out. Or, on the flip side, for AGI boosters to claim victory when research has discovered new advancements. But the reality is usually somewhere in the boring middle.





Source link

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

AI Research

Radiomics-Based Artificial Intelligence and Machine Learning Approach for the Diagnosis and Prognosis of Idiopathic Pulmonary Fibrosis: A Systematic Review – Cureus

Published

on



Radiomics-Based Artificial Intelligence and Machine Learning Approach for the Diagnosis and Prognosis of Idiopathic Pulmonary Fibrosis: A Systematic Review  Cureus



Source link

Continue Reading

AI Research

A Real-Time Look at How AI Is Reshaping Work : Information Sciences Institute

Published

on


Artificial intelligence may take over some tasks and transform others, but one thing is certain: it’s reshaping the job market. Researchers at USC’s Information Sciences Institute (ISI) analyzed LinkedIn job postings and AI-related patent filings to measure which jobs are most exposed, and where those changes are happening first. 

The project was led by ISI research assistant Eun Cheol Choi, working with students in a graduate-level USC Annenberg data science course taught by USC Viterbi Research Assistant Professor Luca Luceri. The team developed an “AI exposure” score to measure how closely each role is tied to current AI technologies. A high score suggests the job may be affected by automation, new tools, or shifts in how the work is done. 

Which Industries Are Most Exposed to AI?

To understand how exposure shifted with new waves of innovation, the researchers compared patent data from before and after a major turning point. “We split the patent dataset into two parts, pre- and post-ChatGPT release, to see how job exposure scores changed in relation to fresh innovations,” Choi said. Released in late 2022, ChatGPT triggered a surge in generative AI development, investment, and patent filings.

Jobs in wholesale trade, transportation and warehousing, information, and manufacturing topped the list in both periods. Retail also showed high exposure early on, while healthcare and social assistance rose sharply after ChatGPT, likely due to new AI tools aimed at diagnostics, medical records, and clinical decision-making.

In contrast, education and real estate consistently showed low exposure, suggesting they are, at least for now, less likely to be reshaped by current AI technologies.

AI’s Reach Depends on the Role

AI exposure doesn’t just vary by industry, it also depends on the specific type of work. Jobs like software engineer and data scientist scored highest, since they involve building or deploying AI systems. Roles in manufacturing and repair, such as maintenance technician, also showed elevated exposure due to increased use of AI in automation and diagnostics.

At the other end of the spectrum, jobs like tax accountant, HR coordinator, and paralegal showed low exposure. They center on work that’s harder for AI to automate: nuanced reasoning, domain expertise, or dealing with people.

AI Exposure and Salary Don’t Always Move Together

The study also examined how AI exposure relates to pay. In general, jobs with higher exposure to current AI technologies were associated with higher salaries, likely reflecting the demand for new AI skills. That trend was strongest in the information sector, where software and data-related roles were both highly exposed and well compensated.

But in sectors like wholesale trade and transportation and warehousing, the opposite was true. Jobs with higher exposure in these industries tended to offer lower salaries, especially at the highest exposure levels. The researchers suggest this may signal the early effects of automation, where AI is starting to replace workers instead of augmenting them.

“In some industries, there may be synergy between workers and AI,” said Choi. “In others, it may point to competition or replacement.”

From Class Project to Ongoing Research

The contrast between industries where AI complements workers and those where it may replace them is something the team plans to investigate further. They hope to build on their framework by distinguishing between different types of impact — automation versus augmentation — and by tracking the emergence of new job categories driven by AI. “This kind of framework is exciting,” said Choi, “because it lets us capture those signals in real time.”

Luceri emphasized the value of hands-on research in the classroom: “It’s important to give students the chance to work on relevant and impactful problems where they can apply the theoretical tools they’ve learned to real-world data and questions,” he said. The paper, Mapping Labor Market Vulnerability in the Age of AI: Evidence from Job Postings and Patent Data, was co-authored by students Qingyu Cao, Qi Guan, Shengzhu Peng, and Po-Yuan Chen, and was presented at the 2025 International AAAI Conference on Web and Social Media (ICWSM), held June 23-26 in Copenhagen, Denmark.

Published on July 7th, 2025

Last updated on July 7th, 2025



Source link

Continue Reading

AI Research

SERAM collaborates on AI-driven clinical decision project

Published

on


The Spanish Society of Medical Radiology (SERAM) has collaborated with six other scientific societies to develop an AI-supported urology clinical decision-making project called Uro-Oncogu(IA)s.

Uro-Oncog(IA)s project team.SERAM

The initiative produced an algorithm that will “reduce time and clinical variability” in the management of urological patients, the society said. SERAM’s collaborators include the Spanish Urology Association (AEU), the Foundation for Research in Urology (FIU), the Spanish Society of Pathological Anatomy (SEAP), the Spanish Society of Hospital Pharmacy (SEFH), the Spanish Society of Nuclear Medicine and Molecular Imaging (SEMNIM), and the Spanish Society of Radiation Oncology (SEOR).

SERAM Secretary General Dr. MaríLuz Parra launched the project in Madrid on 3 July with AEU President Dr. Carmen González.

On behalf of SERAM, the following doctors participated in this initiative:

  • Prostate cancer guide: Dr. Joan Carles Vilanova, PhD, of the University of Girona,
  • Upper urinary tract guide: Dr. Richard Mast of University Hospital Vall d’Hebron in Barcelona,
  • Muscle-invasive bladder cancer guide: Dr. Eloy Vivas of the University of Malaga,
  • Non-muscle invasive bladder cancer guide: Dr. Paula Pelechano of the Valencian Institute of Oncology in Valencia,
  • Kidney cancer guide: Dr. Nicolau Molina of the University of Barcelona.



Source link

Continue Reading

Trending