Apple researchers find “major” flaws in AI inference models ahead of WWDC 2025

Newly released Apple Machine Learning Research Research challenges the general idea that large-scale models (LLMs) like Openai's O1 and Claude's thinking variations are truly capable of “inference.” This study illustrates the fundamental limitations of these AI systems. In this study, Apple researchers designed controllable puzzle environments such as towers in Hanoi and river intersections. This approach avoided standard mathematical benchmarks that are susceptible to data contamination. According to the researchers, these custom environments allowed for accurate analysis of both the final answers generated by LLMS and internal inferences at various complexity levels.

What Apple researchers found out from this study

According to a Macrumors report, inference models tested by Apple researchers, including the O3-Mini, Deepseek-R1 and Claude 3.7 Sonnet, completely collapsed when the complexity of the problem exceeded a certain threshold. Even if the model had sufficient computational resources, the success rate dropped to zero. Surprisingly, as the problem became more difficult, the model reduced its inference efforts. This indicates basic scaling limitations, not lack of resources.Even more obvious, even when researchers provided a complete solution algorithm, the model still failed at the same complex points. This indicates that the limitations are in the basic logical steps rather than choosing the right problem-solving strategy.The model also showed an inexplicable contradiction. They managed to solve a problem that required over 100 movements, but only 11 movements were needed.Three performance patterns were identified in this study. The standard model produced more unexpected performance than the inference model on the low complexity problem. The inference model favored moderate complexity. Both types failed due to high complexity. Researchers also found that models exhibit inefficient “rethinking” patterns, often discovering the correct solution early, but often waste computational efforts to explore false alternatives.The key point is that the current “inference” model relies heavily on advanced pattern matching, rather than true inference. These models are not about inferring human methods. They tend to rethink simple problems and don't think much when faced with a more rigid problem.It is worth noting that the study surfaced a few days before WWDC 2025, according to Bloomberg. Apple is expected to focus on new software designs rather than on headline-grabbing AI features at this year's event.

WWDC 2025: 5 things to expect

Source link