Artificial Intelligence 7 min read

Why Large Reasoning Models Collapse Under Complex Tasks: Insights from Apple’s Study

Apple’s research reveals that large reasoning models, despite sophisticated self‑reflection mechanisms, experience a complete performance collapse when problem complexity exceeds a threshold, highlighting fundamental limits in their ability to achieve generalized reasoning.

Architect
Architect
Architect
Why Large Reasoning Models Collapse Under Complex Tasks: Insights from Apple’s Study

Study Overview

Apple researchers examined large reasoning models (LRMs) such as o3‑mini, DeepSeek‑R1, and Claude‑3.7‑Sonnet. Although these models incorporate sophisticated self‑reflection mechanisms, their reasoning ability collapses once problem complexity exceeds a certain threshold, exposing a fundamental limitation in achieving generalized reasoning.

Controlled Evaluation Platform

The team built a controllable experimental platform based on algorithmic puzzle environments, allowing precise manipulation of problem complexity while avoiding data contamination present in existing benchmarks.

Four puzzle environments were used: Tower of Hanoi, Chinese Checkers, River‑crossing, and Block World. Each environment shows the transition from an initial state (top) through intermediate states (middle) to a goal state (bottom).

Puzzle environments diagram
Puzzle environments diagram

Complexity Levels and Performance

By scaling the problem size N (e.g., number of disks or pieces), the researchers generated 25 samples for each model and identified three complexity patterns:

Low‑complexity tasks: Standard LLMs outperform LRMs; LRMs are less token‑efficient.

Medium‑complexity tasks: LRMs begin to show advantages, producing longer reasoning chains that improve performance.

High‑complexity tasks: Performance of both LRMs and standard LLMs collapses completely. LRMs delay the collapse but ultimately fail.

Performance Collapse and Reasoning Effort

Accuracy declines steadily with increasing complexity and drops to zero beyond a critical threshold. Near the collapse point, the amount of reasoning effort (measured by tokens used) paradoxically decreases, indicating a fundamental limitation in the models’ reasoning capacity.

Analysis of Reasoning Traces

Extracting intermediate solutions revealed distinct patterns:

In low‑complexity scenarios, non‑thinking models are more accurate and token‑efficient.

As complexity rises, reasoning models achieve higher accuracy but require more tokens.

In high‑complexity cases, reasoning models either find the correct answer much later or waste tokens on incorrect answers, demonstrating inefficiency.

Complexity‑Dependent Reasoning Behaviors

Simple problems: Reasoning models often locate the correct solution early but continue exploring wrong alternatives (“overthinking”).

Medium problems: Models explore many erroneous paths before arriving at the correct solution.

Hard problems: Models fail to find any correct solution.

Open Questions

Even when provided with explicit solution algorithms, reasoning models struggle with precise logical step execution, suggesting deeper verification issues. Different puzzle types expose varied behaviors; for example, Claude‑3.7‑Sonnet can perform up to 100 correct moves in the Tower of Hanoi but only four in the River‑crossing task, hinting at uneven training exposure.

Performance collapse diagram
Performance collapse diagram
<code>original: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf</code>
AI evaluationmodel limitationslarge reasoning modelsproblem complexitytoken efficiency
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.