
If you asked a 19th-century inventor how a machine could fly, they'd likely describe flapping wings and feathers. Nature's example – the bird – was the only model of flight we knew. Yet, on December 17, 1903, when the Wright brothers made their first successful flight, their airplane didn't mimic nature. No flapping, no feathers — just physics applied in ways birds never could.
Today, we keep asking whether AI models truly "think" or "reason," measuring them against the yardstick of human cognition. But what if that's the wrong question entirely? What if AI, like the airplane, achieves its goal through fundamentally different means? Flying without flapping – producing the appearance of reasoning without the real thing?
The Illusion of Thinking
A recent paper from Apple (bluntly titled “The Illusion of Thinking”) suggests that much of what we call AI “reasoning” is indeed an elaborate imitation – one that works only up to a point. The Apple researchers put cutting-edge “reasoning” AI models (OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet, and Gemini) through their paces using carefully designed puzzles, and the results were eye-opening. They discovered three distinct performance regimes in these so-called Large Reasoning Models (LRMs):
- Overthinking the easy stuff: On simple tasks, the fancy reasoning models often second-guess themselves into the wrong answer. In fact, the standard non-reasoning models (the ones that don’t generate step-by-step “thoughts”) outperformed their thinker counterparts on basic problems. The “smart” models would find a correct answer quickly, then waste time and tokens veering off track – like someone who solves a simple addition problem correctly, then talks themselves into a mistake . It’s a classic case of overthinking.
- Hitting a sweet spot: On moderately complex problems, this trend reverses. Here the reasoning-heavy approach pays off. Given a bit more puzzle complexity, the LRMs’ extra steps and self-reflection start to help – they check their work, backtrack on false paths, and outperform simpler models that would have rushed to an error. In this middle regime, the “thinking” AI shows its value by exploring multiple approaches and correcting course when needed.
- Collapse at the limits: Once the problems become very complex, performance falls off a cliff. Beyond a certain complexity threshold, all models fail miserably – and, shockingly, the ones designed to reason fail the hardest. The Apple team observed a complete accuracy collapse at high complexity. Even more perplexing, the supposed reasoning engines actually start thinking less as the puzzles get harder: instead of trying harder, their solutions become shorter and shallower when faced with real difficulty. In other words, these AI “give up” – despite plenty of computing power left, they don’t even use their full reasoning budget when it matters most.
In simple terms, today’s AI reasoning is fragile. On easy tasks, the AI can trip over its own thoughts (the authors call this the “overthinking” phenomenon ). On hard tasks, it hits a wall where it stops even pretending to reason. This wasn’t a gradual decline – it was an abrupt collapse. One moment additional thinking improves results; a few notches up in complexity, and the AI’s performance plummets. “This isn’t a gradual degradation – it’s a cliff,” the study’s commentator noted, as accuracy and reasoning effort suddenly drop together. Tellingly, when researchers handed the AI an algorithm– essentially showing it exactly how to solve a complex puzzle – the model still couldn’t execute it for the hardest cases . In the Apple team’s words, these state-of-the-art models “still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities” . In short, what we have now is an illusion of thinking: machines that can mimic the steps of reasoning, but break down when real, general reasoning is required.
Should we despair, then, that AI will never truly think? Not so fast. History suggests another interpretation: maybe we’re expecting flapping wings when we should be looking for jet engines.
Beyond Biology’s Blueprint
Airplanes don’t flap. Submarines don’t wiggle fins. The first automobiles were even called “horseless carriages” because people initially conceived of them as mechanical horses – yet no one seriously tried to make cars gallop. In each case, technology achieved the function of a natural ability (flight, swimming, locomotion) by breaking away from nature’s blueprint. Once engineers stopped slavishly imitating biology, human flight took off – literally. A jumbo jet and a sparrow have almost nothing in common, yet both soar through the same sky. The machine approach can not only match the original – it can surpass it, doing things no bird could ever do (like carrying 300 people non-stop across an ocean).
Have you watched Fate/Stay Night: Unlimited Blade Works? It's a Japanese anime that explores the idea that an imitation can be better than the original. Here's a TLDR with spoilers.
- Gilgamesh—the self-proclaimed “King of Heroes”—owns the Gate of Babylon, a shimmering arsenal of the original legendary weapons. Every sword, every spear, the very prototypes of myth are his to deploy.
- Shirou Emiya—an earnest yound mage—possesses nothing original. Instead, his magic “Unlimited Blade Works” lets him project perfect-looking copies of any weapon he has seen. Gilgamesh sneers: copies can never rival originals.
- In their climactic duel, Shirou’s flood of duplicates overwhelms the king. Copies, forged instantly to counter each attack, exploit a truth Gilgamesh overlooks: function can trump provenance. A replica that appears at the right place and time can outperform a dusty masterpiece locked in a treasury.
Yes, it’s dramatic fiction, but it touches on a real challenge to our intuitions. We instinctively assume that imitation is inferior; yet we have seen many instances where AI models out-reason its biological template. From Deep Blue’s checkmate of Garry Kasparov in 1997 to AlphaGo’s uncanny “Move 37” that upended Lee Sedol in 2016, we’ve seen decades of AI that are “imitating thinking”, defeat grandmasters of games long thought to embody peak human reasoning.
The lesson from Unlimited Blade Works is not that copies are automatically superior, but that copies can exploit different rules. Shirou’s swords are cheaper, faster, and endlessly adaptable. Airplanes do the same: rigid wings, fuel, and turbines beat feathers for crossing oceans.
So dismissing AI reasoning because it is “just statistical mimicry” repeats Gilgamesh’s error. The key question is not whether a model retraces human logic step-for-step, but whether its method reliably lands on sound conclusions. If a non-human path proves more scalable—less prone to fatigue, bias, or oversight—do we still care that it isn’t “real” reasoning?
If It Works, Does It Matter How?
To many, true intelligence requires doing things the way humans do – anything less is just a clever trick. But functional reality tends to trump philosophical purity. We don’t insist that a plane prove its worth by flapping wings; we care that it flies us from New York to London safely. Likewise, if an AI system can diagnose diseases or navigate a car more reliably than a person, does it really matter that it arrived at the result via pattern recognition rather than a human-style thought process? In practical terms, outcomes often matter more than methods.
Some argue that unless an AI thinks exactly like a person, it’s not truly “reasoning” – that it’s just simulating thought. The Apple study gives ammunition to that criticism, showing that today's AIs are indeed mere simulators hitting hard limits. However, simulation can have its own power. A flight simulator isn’t a real plane, but it can teach a pilot to handle real aircraft. In a similar way, an AI’s shallow reasoning simulation might, with improvement, solve real problems even if it lacks a human’s understanding. The ultimate test of intelligence may not be whether the process feels like human reasoning on the inside, but whether the results work on the outside. As long as the function is fulfilled – as long as the plane flies – the underlying mechanism need not mirror nature’s. In fact, it might exceed nature’s. An AI that “thinks” differently could avoid the cognitive blind spots and biases we humans are prone to. Its very alienness could become an advantage, enabling it to find solutions we’d miss.
None of this is to say the limitations uncovered by Apple’s team are trivial – far from it. The current generation of reasoning models can often impress us with human-like planning and execution, then disappoints us with sub-human performance at the worst times. To trust AI with truly important tasks, we’ll need more than an illusion; we’ll need machines that can handle complexity without breaking. But achieving that might require embracing an AI unlike us – one that reasons in a way no person would, yet succeeds where we cannot.
Do airplanes fly? Yes – not by flapping wings, but by exploiting math and science to accomplish the same goal. In the coming years, I expect we will achieve AGI in a similar way. Yes, today's models may not have the robustness of human reasoning, however, early airplanes also stalled in mild gusts that birds handled with ease. Engineers answered with airmaps, autopilots, fly-by-wire. Robustness came through iteration, driven by society's demands for safety and reliability. The same will happen with AI.
Perhaps reasoning in the age of AI will come to include any process – no matter how deterministic or probabilistic – that consistently yields good decisions and insights. The "illusion of thinking" could eventually evolve into something that surpasses organic cognition in both utility and reliability. And if that happens, we might just stop worrying about whether the AI truly thinks, in the same way we don't ask whether a Boeing 747 is truly flying. We’ll know that it gets us where we need to go – and ultimately, that was the point all along.