The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Recent generations of frontier language models have introduced Large Reasoning Models
(LRMs) that generate detailed thinking processes before providing answers. While these models
demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scal-
ing properties, and limitations remain insufficiently understood. Current evaluations primarily fo-
cus on established mathematical and coding benchmarks, emphasizing final answer accuracy. How-
ever, this evaluation paradigm often suffers from data contamination and does not provide insights
into the reasoning traces’ structure and quality. In this work, we systematically investigate these
gaps with the help of controllable puzzle environments that allow precise manipulation of composi-
tional complexity while maintaining consistent logical structures. This setup enables the analysis
of not only final answers but also the internal reasoning traces, offering insights into how LRMs
“think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs
face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter-
intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then
declines despite having an adequate token budget. By comparing LRMs with their standard LLM
counterparts under equivalent inference compute, we identify three performance regimes: (1) low-
complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity
tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks
where both models experience complete collapse. We found that LRMs have limitations in exact
computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We
also investigate the reasoning traces in more depth, studying the patterns of explored solutions
and analyzing the models’ computational behavior, shedding light on their strengths, limitations,
and ultimately raising crucial questions about their true reasoning capabilities.

*Equal contribution.
†Work done during an internship at Apple.