Previously few days, Apple’s provocatively titled paper, The Phantasm of Pondering, has sparked recent debate in AI circles. The declare is stark: in the present day’s language fashions don’t actually “cause”. As a substitute, they simulate the looks of reasoning till complexity reveals the cracks of their logic. Not surprisingly, the paper has triggered a rebuttal – entitled, The Phantasm of the Phantasm of Pondering, credited to “C. Opus”, a nod to Anthropic’s Claude Opus mannequin, and
Alex Lawsen, who initially revealed the commentary on the arXiv distribution service as a joke, apparently. The joke acquired out of hand and the response has been extensively circulated. Joke or not – does the LLM truly debunk Apple’s thesis? Not fairly.
What Apple exhibits

The Apple group got down to probe whether or not AI fashions can actually cause – or whether or not they’re simply mimicking problem-solving primarily based on memorized examples. To do that, the group designed duties the place complexity may very well be scaled in managed increments: extra disks within the Tower of Hanoi, extra checkers in Leaping Checkers, extra characters in River Crossing, extra blocks in Blocks World.
The belief is easy: if a mannequin has mastered reasoning in less complicated instances, it ought to be capable of prolong those self same rules to extra complicated ones – particularly when ample compute and context size stay out there. However that’s not what occurs. The Apple paper finds that even when working properly inside their token budgets and inference capabilities, fashions don’t rise to the problem.
As a substitute, they generate shorter, much less structured outputs as complexity will increase. This implies a type of “giving up,” not a wrestle in opposition to onerous constraints. Much more telling, the paper finds that fashions usually cut back their reasoning effort simply when extra effort is required. As additional proof, Apple references 2024 and 2025 benchmark questions from the American Invitational Arithmetic Examination (AIME), a prestigious US arithmetic competitors for top-performing high-school college students.
Whereas human efficiency improves year-on-year, mannequin scores decline for extra the unseen 2025 batch – supporting the concept that AI success continues to be closely reliant on memorized patterns, and never versatile problem-solving.
The place Claude fails
The counterargument hinges on the concept that language fashions truncate responses not as a result of they fail to cause, however as a result of they “know” the output is changing into too lengthy. One cited instance exhibits a mannequin halting mid-solution with a self-aware remark: “The sample continues, however to keep away from making this too lengthy, I’ll cease right here.”
That is offered as proof that fashions perceive the duty however select brevity.
However it’s anecdotal at finest – drawn from a single social media put up – and makes a big inferential leap. Even the engineer who initially posted the instance doesn’t absolutely endorse rebuttal’s conclusion. They level out that increased technology randomness (“temperature”) results in amassed errors, particularly on longer sequences – so stopping early could not point out understanding, however entropy avoidance.
The rebuttal additionally invokes a probabilistic framing: that each transfer in an answer is sort of a coin flip, and ultimately even a small per-token error price will derail a protracted sequence. However reasoning isn’t simply probabilistic technology; it’s sample recognition and abstraction. As soon as a mannequin identifies an answer construction, later steps shouldn’t be impartial guesses – they need to be deduced. The rebuttal doesn’t account for this.
However the actual miss for the rebuttal is its argument that fashions can succeed if prompted to generate code. However this misses the entire level. Apple’s objective was to not take a look at whether or not fashions may retrieve canned algorithms; it was to guage their capability to cause by means of the construction of the issue on their very own. If a mannequin solves an issue by merely recognizing it ought to name or generate a particular instrument or piece of code, then it’s not actually reasoning – it’s simply recalling an answer or a sample.
In different phrases, if an AI mannequin sees the Tower of Hanoi puzzle and responds by outputting Lua code it has ‘seen’ earlier than, it’s simply matching the issue to a recognized template and retrieving the corresponding instrument. It’s not ‘pondering’ by means of the issue; it’s simply refined library search.
The place this leaves us
To be clear, the Apple paper shouldn’t be bulletproof. Its remedy of the River Crossing puzzle is a weak level. As soon as sufficient individuals are added to the puzzle, the issue turns into unsolvable. And but Apple’s benchmark marks a “no resolution” response as incorrect. That’s an error. However the factor is, the mannequin’s efficiency has already collapsed earlier than the issue turns into unsolvable – which suggests the drop-off occurs not on the fringe of cause, however lengthy earlier than it.
In conclusion, the rebuttal’s response, whether or not AI assisted or AI generated, raises essential questions, particularly round analysis strategies and mannequin self-awareness. However the rebuttal rests extra on anecdote and hypothetical framing than on rigorous counter-evidence. Apple’s authentic declare – that present fashions simulate reasoning with out scaling it – stays largely intact. And it’s not truly new; knowledge scientists have been saying this for a very long time.
However it all the time helps, after all, when large corporations like Apple assist the prevailing science. Apple’s paper could sound confrontational, at occasions – within the title, alone. However its evaluation is considerate and well-supported. What it reveals is a reality the AI neighborhood should grapple with: reasoning is greater than token technology, and with out deeper architectural shifts, in the present day’s fashions could stay trapped on this phantasm of pondering.
Maria Sukhareva has been working within the area of AI for 15 years – in AI mannequin coaching and product administration. She is principal key skilled in AI at Siemens. The views expressed are above are her’s, and never her employer’s. Her Substack weblog web page is right here; her web site is right here.