The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models
While the year of 2025 is appearing to be a year of reasoning models, Apple has published a breakthrough paper where they have quite straight away mentioned that reasoning LLMs can’t reason, they just mimic the patterns they learned during training. Very similar to general LLMs.
My 2nd book “Model Context Protocol : Advanced AI Agents for Beginners” is out now
Model Context Protocol: Advanced AI Agents for Beginners (Generative AI books)
The paper, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity” (Apple, 2025), is a deep-dive into Large Reasoning Models (LRMs) where the authors wish to figure out the below theories
Do these models actually reason?
How do they behave as problems get more complex?
And most importantly, are they really better than standard LLMs?
https://medium.com/media/6eb2c8fce734d4cc27adf07366dc0fe1/href
The Experiment
The scientists wanted to test if new “reasoning” AI models (LRMs) are actually good at solving problems step by step, and how they behave when the problems get harder.
They weren’t just interested in the final answer. They also wanted to see how the models got there — what they “thought” while solving the problem.
The Problem with Normal Tests
Most models are tested on math or coding questions. But:
- Many of those questions may have already been seen by the models during training (called “data contamination”).
- They don’t allow full control over how hard the questions are.
- They don’t let us examine the model’s step-by-step thinking easily.
The Puzzle Setup

They used 4 different puzzles that are easy to control and analyze:
- Tower of Hanoi : Move disks between pegs following rules (can’t put a big disk on a small one).
- Checker Jumping : Swap red and blue checkers using only jumps and slides.
- River Crossing : Safely move people across a river while following certain safety rules.
- Blocks World : Rearrange blocks from one stack to another in a specific order.
These puzzles are great because the difficulty can be increased slowly (e.g., by adding more disks or blocks), and we can track every move the model makes.
What Did the Researchers Do?

- They gave these puzzles to two types of models:
Regular language models (LLMs).
Reasoning models (LRMs) that write down their thoughts before giving an answer.
2. They increased the difficulty of the puzzles (like adding more disks to Tower of Hanoi).
3. They watched how the models solved them, by:
Checking the final answer (Did they reach the correct goal?).
Reading the thinking steps (Were they making sense along the way?).
4. They kept the compute (token) budget the same between models. So both types had the same resources to work with.
What Did They Measure?
- Accuracy: Did the model solve the puzzle?
- Thinking token usage: How much thinking did the model do?
- Correct vs. incorrect steps: Did the model find the correct steps early or late?
- When the first mistake happened: Especially in long puzzles, like Tower of Hanoi.
How Was the Thinking Analyzed?
They looked at the position of the correct solution in the model’s reasoning:
If it was early but the model kept going and got it wrong: that’s overthinking.
If it came late: maybe the model needed time to explore.
If it never came: the model completely failed.
They used simulators for each puzzle to automatically check whether each step was valid or not.
What did the scientists discover?
Three levels of difficulty:
- On easy problems, regular models (without thinking steps) were better.
- On medium problems, reasoning models did better.
- On hard problems, both types of models failed — they gave completely wrong answers.
Thinking drops when it should rise:
- As problems get harder, the models start thinking less, even though they still have room to think more. This is unexpected.
Strange thinking behavior:
- On easy problems, models often find the right answer early but then keep going and confuse themselves.
- On medium ones, they find the answer later.
- On hard ones, they never find the answer at all.
Can’t follow instructions well:
- Even when given the exact steps to solve a puzzle, the models often mess up. They don’t always follow rules correctly.
Custom puzzles give better insights:
- Instead of using school math problems, the authors made their own puzzles. This let them test the models more fairly and understand what’s really going on.
What Should Everyone Keep in Mind?
Here’s the juicy takeaway:
- Just because a model “thinks” doesn’t mean it reasons well. LRMs can look impressive with their long-winded thought traces, but they often don’t reach the correct conclusions, especially as task complexity increases.
- Evaluation needs to evolve. We can’t just look at the final answer — we need to examine the reasoning path to understand where things go wrong.
- There’s a ceiling to their reasoning. LRMs aren’t scaling their reasoning like we’d hoped. Fixing this might require rethinking model architectures, training objectives, or how we prompt them.
So, what to conclude?
This paper is a timely reminder that current AI models, even the ones designed to “reason,” still have serious limits. While they may appear thoughtful by writing out long explanations, that doesn’t mean they truly understand the problems they’re solving. Apple’s study shows that reasoning performance doesn’t scale well with complexity — and in many cases, the extra thinking just leads to more confusion, not better answers.
For researchers and builders, the message is clear: we need better ways to evaluate and train these models. For users, it’s a cue to stay cautious — just because an AI sounds smart doesn’t mean it is.
Reasoning LLMs can’t reason, claims Apple was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.