Recent claims argue that Large Language Models (LLMs) only simulate reasoning and that their “thinking” is mostly an illusion. But this paper — “Thinking Isn’t an Illusion” by Zhao Song, Song Yue, Jiahao Zhang — strongly disagrees.
It makes the case that Large Reasoning Models (LRMs) can reason effectively, but they need help from external tools like calculators, memory, and structured templates. When you give them these tools, their reasoning isn’t just an illusion — it’s very real and very powerful.
Part 1: What’s the Big Debate?
The core question at the heart of this paper is:
Do LLMs truly reason, or do they just imitate reasoning by mimicking patterns in language?
Some recent papers (especially Apple’s “Thinking is an Illusion” study) argue that even though LLMs appear to reason by generating step-by-step answers (using Chain-of-Thought), they don’t actually do better on hard problems. In fact, sometimes models that just guess the answer directly outperform those that reason step by step.
This created a wave of skepticism: Is reasoning in LLMs just a parlor trick?
Part 2: What This Paper Argues
This paper pushes back.
❝ LLMs don’t reason poorly because they’re incapable — they reason poorly because we don’t give them the right tools. ❞
It introduces the idea of tool-augmented LRMs — that is, giving reasoning-capable models access to external aids like:
- A Python interpreter (so they can do real calculations),
- A scratchpad (to remember intermediate steps),
- And template-based reasoning prompts (to structure their thoughts).
With these tools, the models:
- Solve harder problems,
- Make fewer mistakes,
- And outperform direct-answer models by wide margins.
Cracking Data Science Case Study Interviews is a practical guide featuring 20+ real-world case studies across Fintech, Finance, Retail, Supply Chain, and eCommerce to help you master the system design-style interview rounds every data scientist faces.
Cracking Data Science Case Study Interview: Data, Features, Models and System Design
Part 3: Key Concepts and Definitions
Let’s walk through the key terminology used in this paper and what it really means:

Part 4: The Problem with Previous Benchmarks
The Apple paper (“Thinking Is an Illusion”) introduced a benchmark where reasoning models (LRMs using CoT) surprisingly underperformed direct-answer models.
Why?
The authors of Thinking Isn’t an Illusion found three major reasons:
- Unfair Test Design: The benchmark rewards short, shallow reasoning, not deep logical steps.
- No Tool Support: Real reasoning is hard without tools. You wouldn’t solve a long division problem in your head — why expect AI to?
- Reasoning is Expensive: Step-by-step thinking uses more tokens, which increases the chance of small mistakes.
Part 5: Experiments and Setup
To prove their point, the authors created a new benchmark called TAR (Tool-Augmented Reasoning).
Benchmark Components:
- Symbolic Reasoning: Logic puzzles like Boolean expressions or decision problems.
- Arithmetic Tasks: Big number multiplication, nested math expressions, etc.
- Algorithmic Problems: Sorting, looping, dynamic programming — things that require stepwise computation.
Models Tested:

— Direct Models: Just give final answers.
— LRMs (with Chain-of-Thought): Reason step-by-step using natural language.
— Tool-Augmented LRMs:
- Use scratchpads
- Call a Python interpreter
- Use modular templates like: [Breakdown] → [Plan] → [Code] → [Answer]
Tool Types:

Part 6: Results


Here’s where it gets interesting.

Key Takeaways:
- Tool-Augmented LRMs dominated every task.
- Reasoning alone wasn’t enough — the tools made all the difference.
- Even small tool improvements (like structured prompts) made big gains in accuracy.
Part 7: Why Tools Matter for AI
The paper draws a powerful analogy:
Giving a reasoning model no tools is like asking a human to solve a complex puzzle without paper, calculator, or memory.
Humans don’t solve logic problems purely in their heads:
- We write things down
- Use calculators
- Break big problems into smaller parts
So, why expect AI to be different?
Tools give structure, computation, and memory — the scaffolding that makes real reasoning possible.
Part 8: Implications for the Future of AI
This work is not just an academic rebuttal — it points to a powerful future design for AI systems:
We shouldn’t just build smarter models — we should also build better tools around them.
What does this mean?
- Chatbots like GPT should come with built-in code runners, memory slots, and modular planners.
- AI agents can be structured like humans: plan → compute → decide.
- Hybrid architectures (language model + toolchain) are the future.
Final Summary
Thinking is not an illusion — it’s just incomplete without help.
This paper shows that when you equip reasoning models with tools:
- They reason better
- Solve harder problems
- And stop relying on shallow tricks
If LLMs are like students, then tools are their notebooks, calculators, and study guides. Give them the right tools — and they’re not faking it. They’re thinking.
Thinking Isn’t an Illusion: Why Large Language Models Can Reason (When You Let Them Use Tools) was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.