AI Research

‘The illusion of thinking’: Apple research finds AI models collapse and give up with hard puzzles

Published

3 months ago

June 9, 2025

New artificial intelligence research from Apple shows AI reasoning models may not be “thinking” so well after all.

According to a paper published just days before Apple’s WWDC event, large reasoning models (LRMs) — like OpenAI o1 and o3, DeepSeek R1, Claude 3.7 Sonnet Thinking, and Google Gemini Flash Thinking — completely collapse when they’re faced with increasingly complex problems. The paper comes from the same researchers who found other reasoning flaws in LLMs last year.

The news was a bucket of cold water for artificial general intelligence (AGI) optimists (and welcome news for AI and AGI skeptics), as Apple’s research seemed to show damning evidence about the limitations of reasoning model intelligence. While the much-hyped LRM performed better than LLMs on medium-difficulty puzzles, they performed worse on simple puzzles. And according to Apple’s research, when they faced hard puzzles, they collapsed completely, giving up on the problem prematurely.

This Tweet is currently unavailable. It might be loading or has been removed.

Or, as the Apple researchers put it, while AI models perform extremely well at math and coding, when it comes to more complex problems, they only provide “The Illusion of Thinking.”

Apple was slow to develop large language models and implement AI in its devices, largely staying out of the conversation. The company has added Apple Intelligence AI features, though they have generally been considered underwhelming. In fact, after WWDC 2025, it’s clear that Apple is going in a different direction with AI than the rest of the industry. With that in mind, this research might explain some of Apple’s reticence to go all-in, unlike Google and Samsung, which have frontloaded their devices with AI capabilities.

How Apple researchers tested reasoning skills

The problems researchers used to evaluate the reasoning models, which they call LRMs or Large Reasoning Models, are classic logic puzzles like the Tower of Hanoi. The puzzle consists of discs, stacked largest to smallest on one of three pegs, and the goal is to move the discs to the third peg without ever placing a larger disc on top of a smaller disc. Other puzzles included jumping checker pieces into empty spaces, the river-crossing problem (the one usually involving a fox, a chicken, and a bag of grain), and stacking blocks in a specific configuration.

Mashable Light Speed

You probably recognize these logic puzzles from math class or online games, since it’s a simple way of testing humans’ ability to reason and problem-solve. Once you figure it out, it’s a simple matter of following the logic even as the complexity increases, which in this case means more discs, checkers, animals, or blocks. However, researchers found that LRMs start to fail after a certain point.

“Results show that all reasoning models exhibit a similar pattern with respect to complexity: accuracy progressively declines as problem complexity increases until reaching complete collapse (zero accuracy) beyond a model specific complexity threshold,” researchers wrote. In the results shown, Claude 3.7 Sonnet + thinking and DeepSeek R1 start to fail when a fifth disc is added to the Tower of Hanoi problem. Even when more computing power is applied to the LRMs, they still fail at the more complex puzzles.

What’s more, researchers found that reasoning models initially apply more thinking tokens as complexity increases, but they actually give up at a certain point. “Upon approaching a critical threshold — which closely corresponds to their accuracy collapse point — models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty,” the paper read. So when the problems get harder, they spend less tokens, or “think” less.

But what about when the LRMs are given the answers? Nope, accuracy doesn’t improve. Even when researchers included the algorithm in the prompt, so all the models need to do is follow the steps, they continued to fail.

But before you fire up the grill because LLM reasoning is so cooked, season these findings with a grain of salt. The research doesn’t mean LRMs don’t reason at all, it just means they may not currently be much smarter than humans. As AI expert Gary Marcus pointed out on his blog, “(ordinary) humans actually have a bunch of (well-known) limits that parallel what the Apple team discovered. Many (not all) humans screw up on versions of the Tower of Hanoi with 8 discs.” As others have pointed out online, the research does not compare results from human attempts at these puzzles.

This Tweet is currently unavailable. It might be loading or has been removed.

Essentially, LLMs have their uses for tasks like coding and writing, but they also have weaknesses. “What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that LLMs are no substitute for good well-specified conventional algorithms,” wrote Marcus, who has been very vocal about the reasoning limitations of AI models.

That’s to say, take the findings from Apple researchers for what they are: important data to be considered within the context of other LLM research. It’s tempting to categorize AI’s overall advancements as overhyped when new research like this comes out. Or, on the flip side, for AGI boosters to claim victory when research has discovered new advancements. But the reality is usually somewhere in the boring middle.

Topics
Apple
Artificial Intelligence

Source link

AI Research

Databricks at a crossroads: Can its AI strategy prevail without Naveen Rao?

Published

3 hours ago

September 13, 2025

Anirban Ghoshal

“Databricks is in a tricky spot with Naveen Rao stepping back. He was not just a figurehead, but deeply involved in shaping their AI vision, particularly after MosaicML,” said Robert Kramer, principal analyst at Moor Insights & Strategy.

“Rao’s absence may slow the pace of new innovation slightly, at least until leadership stabilizes. Internal teams can keep projects on track, but vision-driven leaps, like identifying the ‘next MosaicML’, may be harder without someone like Rao at the helm,” Kramer added.

Rao became a part of Databricks in 2023 after the data lakehouse provider acquired MosaicML, a company Rao co-founded, for $1.3 billion. During his tenure, Rao was instrumental in leading research for many Databricks products, including Dolly, DBRX, and Agent Bricks.

Source link

AI Research

NFL player props, odds: Week 2, 2025 NFL picks, SportsLine Machine Learning Model AI predictions, SGP

Published

3 hours ago

September 13, 2025

Ross Kelly

The Under went 12-4 in Week 1, indicating that not only were there fewer points scored than expected, but there were also fewer yards gained. Backing the Under with NFL prop bets was likely profitable for the opening slate of games, but will that maintain with Week 2 NFL props? Interestingly though, four of the five highest-scoring games last week were the primetime games, so if that holds, then the Overs for this week’s night games could be attractive with Week 2 NFL player props.

There’s a Monday Night Football doubleheader featuring star pass catchers like Nico Collins, Mike Evans and Brock Bowers. The games also feature promising rookies such as Ashton Jeanty, Omarion Hampton and Emeka Egbuka. Prop lines are usually all over the place early in the season as sportsbooks attempt to establish a player’s potential, and you could take advantage of this with the right NFL picks. If you are looking for NFL prop bets or NFL parlays for Week 2, SportsLine has you covered with the top Week 2 player props from its Machine Learning Model AI.

Built using cutting-edge artificial intelligence and machine learning techniques by SportsLine’s Data Science team, AI Predictions and AI Ratings are generated for each player prop.

Now, with the Week 2 NFL schedule quickly approaching, SportsLine’s Machine Learning Model AI has identified the top NFL props from the biggest Week 2 games.

Week 2 NFL props for Sunday’s main slate

After analyzing the NFL props from Sunday’s main slate and examining the dozens of NFL player prop markets, the SportsLine’s Machine Learning Model AI says Lions receiver Amon-Ra St. Brown goes Over 63.5 receiving yards (-114) versus the Bears at 1 p.m. ET. Detroit will host this contest, which is notable as St. Brown has averaged 114 receiving yards over his last six home games. He had at least 70 receiving yards in both matchups versus the Bears a year ago.

Chicago allowed 12 receivers to go Over 63.5 receiving yards last season as the Bears’ pass defense is adept at keeping opponents out of the endzone but not as good at preventing yardage. Chicago allowed the highest yards per attempt and second-highest yards per completion in 2024. While St. Brown had just 45 yards in the opener, the last time he was held under 50 receiving yards, he then had 193 yards the following week. The SportsLine Machine Learning Model projects 82.5 yards for St. Brown in a 4.5-star pick. See more Week 2 NFL props here.

Week 2 NFL props for Vikings vs. Falcons on Sunday Night Football

After analyzing Falcons vs. Vikings props and examining the dozens of NFL player prop markets, the SportsLine’s Machine Learning Model AI says Falcons running back Bijan Robinson goes Over 65.5 rushing yards (-114). Robinson ran for 92 yards and a touchdown in Week 14 of last season versus Minnesota, despite the Vikings having the league’s No. 2 run defense a year ago. The SportsLine Machine Learning Model projects Robinson to have 81.8 yards on average in a 4.5-star prop pick. See more NFL props for Vikings vs. Falcons here.

You can make NFL prop bets on Robinson, Justin Jefferson and others with the Underdog Fantasy promo code CBSSPORTS2. Pick at Underdog Fantasy and get $50 in bonus funds after making a $5 wager:

Week 2 NFL props for Buccaneers vs. Texans on Monday Night Football

After analyzing Texans vs. Buccaneers props and examining the dozens of NFL player prop markets, the SportsLine’s Machine Learning Model AI says Bucs quarterback Baker Mayfield goes Under 235.5 passing yards (-114). While Houston has questions regarding its offense, there’s little worry about the team’s pass defense. In 2024, Houston had the second-most interceptions, the fourth-most sacks and allowed the fourth-worst passer rating. Since the start of last year, and including the playoffs, the Texans have held opposing QBs under 235.5 yards in 13 of 20 games. The SportsLine Machine Learning Model forecasts Mayfield to finish with just 200.1 passing yards, making the Under a 4-star NFL prop. See more NFL props for Buccaneers vs. Texans here.

You can also use the latest FanDuel promo code to get $300 in bonus bets instantly:

Week 2 NFL props for Chargers vs. Raiders on Monday Night Football

After analyzing Raiders vs. Chargers props and examining the dozens of NFL player prop markets, the SportsLine’s Machine Learning Model AI says Chargers quarterback Justin Herbert goes Under 254.5 passing yards (-114). The Raiders’ defense was underrated in preventing big passing plays a year ago as it ranked third in the NFL in average depth of target allowed. It forced QBs to dink and dunk their way down the field, which doesn’t lead to big passing yardages, and L.A. generally prefers to not throw the ball anyway. Just four teams attempted fewer passes last season than the Chargers, and with L.A. running for 156.5 yards versus Vegas last season, Herbert shouldn’t be overly active on Monday night. He’s forecasted to have 221.1 passing yards in a 4.5-star NFL prop bet. See more NFL props for Chargers vs. Raiders here.

How to make Week 2 NFL prop picks

SportsLine’s Machine Learning Model has identified another star who sails past his total and has dozens of NFL props rated 4 stars or better. You need to see the Machine Learning Model analysis before making any Week 2 NFL prop bets.

Which NFL prop picks should you target for Week 2, and which quarterback has multiple 5-star rated picks? Visit SportsLine to see the latest NFL player props from SportsLine’s Machine Learning Model that uses cutting-edge artificial intelligence to make its projections.

Source link

AI Research

In the News: Thomas Feeney on AI in Higher Education – Newsroom

Published

5 hours ago

September 13, 2025

The Newsroom

“I had an interesting experience over the summer teaching an AI ethics class. You know plagiarism would be an interesting question in an AI ethics class … They had permission to use AI for the first written assignment. And it was clear that many of them had just fed in the prompt, gotten back the paper and uploaded that. But rather than initiate a sort of disciplinary oppositional setting, I tried to show them, look, what you what you’ve produced is kind of generic … and this gave the students a chance to recognize that they weren’t there in their own work. This opened the floodgates,” Feeney said.

“I think the focus should be less on learning how to work with the interfaces we have right now and more on just graduate with a story about how you did something with AI that you couldn’t have done without it. And then, crucially, how you shared it with someone else,” he continued.

Source link