Recent generations of frontier language models have introduced a large inference model (LRM) that generates detailed thought processes before providing answers. These models demonstrate improved performance of benchmark inference, but remain poorly understood with basic features, scaling properties, and limitations. The current assessment is primarily concerned with established mathematical and coding benchmarks, highlighting the accuracy of the final answer. However, this assessment paradigm often suffers from data contamination and does not provide insight into inference traces and quality. This work systematically explores these gaps with the help of a controllable puzzle environment that allows for precise manipulation of constitutive complexity while maintaining a consistent logical structure. This setup allows you to analyze not only the final answer, but also traces of inference within, providing insight into how LRMS “thinks” it. Through extensive experiments across diverse puzzles, we show that frontier LRMS faces complete accuracy disruption beyond a particular complexity. Furthermore, they present rebuttal scaling limitations. Their reasoning efforts, along with the complexity of the problem, decrease despite having a proper token budget. Three performance regimes are identified by comparing LRM with standard LLM counterparts with equivalent inference calculations. (1) a low complexity task where the standard model surprisingly outweighs LRMS, (2) additional thinking in LRM shows benefits, and (3) a high-level task experiences complete disruption. We found that LRMS has limitations on accurate calculations. They found that they did not use reasons inconsistent with explicit algorithms throughout the puzzle. It also explores in greater depth the traces of inference, study patterns of explored solutions, analyze the computational behavior of models, and investigates pose important questions about their strengths, limitations, and ultimately their true inference abilities.
*Equal contributions.
†Work done during an internship at Apple.
