Why Large Reasoning Models Collapse Under Complex Tasks: Insights from Apple’s Study
Apple’s research reveals that large reasoning models, despite sophisticated self‑reflection mechanisms, experience a complete performance collapse when problem complexity exceeds a threshold, highlighting fundamental limits in their ability to achieve generalized reasoning.
Study Overview
Apple researchers examined large reasoning models (LRMs) such as o3‑mini, DeepSeek‑R1, and Claude‑3.7‑Sonnet. Although these models incorporate sophisticated self‑reflection mechanisms, their reasoning ability collapses once problem complexity exceeds a certain threshold, exposing a fundamental limitation in achieving generalized reasoning.
Controlled Evaluation Platform
The team built a controllable experimental platform based on algorithmic puzzle environments, allowing precise manipulation of problem complexity while avoiding data contamination present in existing benchmarks.
Four puzzle environments were used: Tower of Hanoi, Chinese Checkers, River‑crossing, and Block World. Each environment shows the transition from an initial state (top) through intermediate states (middle) to a goal state (bottom).
Complexity Levels and Performance
By scaling the problem size N (e.g., number of disks or pieces), the researchers generated 25 samples for each model and identified three complexity patterns:
Low‑complexity tasks: Standard LLMs outperform LRMs; LRMs are less token‑efficient.
Medium‑complexity tasks: LRMs begin to show advantages, producing longer reasoning chains that improve performance.
High‑complexity tasks: Performance of both LRMs and standard LLMs collapses completely. LRMs delay the collapse but ultimately fail.
Performance Collapse and Reasoning Effort
Accuracy declines steadily with increasing complexity and drops to zero beyond a critical threshold. Near the collapse point, the amount of reasoning effort (measured by tokens used) paradoxically decreases, indicating a fundamental limitation in the models’ reasoning capacity.
Analysis of Reasoning Traces
Extracting intermediate solutions revealed distinct patterns:
In low‑complexity scenarios, non‑thinking models are more accurate and token‑efficient.
As complexity rises, reasoning models achieve higher accuracy but require more tokens.
In high‑complexity cases, reasoning models either find the correct answer much later or waste tokens on incorrect answers, demonstrating inefficiency.
Complexity‑Dependent Reasoning Behaviors
Simple problems: Reasoning models often locate the correct solution early but continue exploring wrong alternatives (“overthinking”).
Medium problems: Models explore many erroneous paths before arriving at the correct solution.
Hard problems: Models fail to find any correct solution.
Open Questions
Even when provided with explicit solution algorithms, reasoning models struggle with precise logical step execution, suggesting deeper verification issues. Different puzzle types expose varied behaviors; for example, Claude‑3.7‑Sonnet can perform up to 100 correct moves in the Tower of Hanoi but only four in the River‑crossing task, hinting at uneven training exposure.
<code>original: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf</code>Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.