New book from Turing Award winner Sutton: The next step in AI

Machine Learning


From the increasing maturity of LLM’s ultra-long text processing, realistic video generation models, and autonomous planning and execution of agents, to the introduction of VLA and world models into the physical world, AI is constantly expanding its capabilities.

Model version iteration cycles are constantly getting shorter and remain popular in industry news and technical discussion. In this progressive atmosphere, we seem to be getting very close to AGI.

But the question remains: do these AIs in the servers really “understand” the world? Rather, is the intelligence they display fundamentally the same as the cognition displayed by living things in the real physical world?

Recently, academic Banafshe Rafie and Richard S. Sutton, the father of reinforcement learning, co-authored a paper. They systematically reflect on and critique the “passive representation” approaches that current mainstream artificial intelligence (including large-scale language models, pure visual models, and even traditional symbolic systems) rely on, “Active cognition” A framework from cognitive science to the field of AI.

This research argues that perception, cognition, and action are an inseparable and mutually constructed whole. Explore how AI can transform from passive information processing systems that rely on static data to intelligent agents that can gain experience through interaction with the environment, embodied behavior, and self-evaluation.

Paper title: Toward revitalizing artificial intelligence

The world itself is the best model

An important part of current mainstream AI development still follows a classic concept called “expressionism.”

In traditional artificial intelligence paradigms, whether in early symbolic systems or today’s deep learning models, cognition is typically understood as a linear process: first input, then processing, then action. The system first receives external signals, then processes these signals into internal representations, then makes inferences and decisions based on these representations, and finally outputs actions.

From this perspective, intelligent systems are similar to central processing units. Internally we need to build as accurate a “copy of the world” as possible. Successful recognition depends on whether this internal model can accurately reproduce external reality.

But Laffey and Sutton pointed out that this approach has fundamental limitations. The real world is open, dynamic, and infinitely complex. No finite internal model can completely capture all states. The world is not a static set of features waiting to be encoded, but a space of possibilities that constantly changes depending on the agent’s actions, context, and interaction history.

This paper therefore introduces a famous quote from roboticist Rodney Brooks. “The world itself is its best model.”

This means that the most reliable, up-to-date, and richest information is always in the outside world, rather than inside the agent. Agents should not attempt to completely replace reality with internal representations, but should maintain continuous interaction with the environment, adjusting their actions, adjusting expectations, and shaping their understanding with real-time feedback.

AI needs to not only “see the world” but also “understand the world in action”

“Active cognition” originates from activationism in cognitive science. Its central idea is that cognition is generated in the interaction between the embodied subject and the environment, rather than as an internal reproduction of an existing objective world.

Absorb ideas from phenomenology, Gestalt psychology, and ecological psychology. Phenomenology emphasizes that perception is not a reconstruction of the world in the mind, but a direct encounter with the world in the subject’s life experience. Gibson’s ecological psychology proposes the concept of ‘affordances’, suggesting that whether objects in the environment can be ‘grasped’, ‘climbed’ or ‘passed through’ depends on their relationship to specific physical capabilities.

That is, the world is not passively presented to the agent in the form of abstract features, but has meaning through the actions that the agent can perform.

Introducing these ideas into AI, Rafiee and Sutton distilled four key pillars. Experience, inseparability of perception and action, autonomy, embodiment. They all show the same judgment. This means that intelligence is not a static representation of the world, but a process of acting in the environment, obtaining feedback, and maintaining oneself.

experience

In the active cognition framework, experience is not equivalent to data. Real experiences emerge from continuous, real-time, mutually influencing interactions between agents and the environment. Rather than passively receiving existing data, agents continually acquire skills through action, feedback, failure, and correction.

This also reveals the limitations of current mainstream machine learning. Supervised learning relies on data that has been previously collected and labeled by humans. The model learns only the traces left by the experience, not the experience itself. In contrast, reinforcement learning is closer to the requirements of active cognition. Agents continuously generate new data and capabilities during their interactions by actively exploring their environment, receiving feedback, and adjusting their strategies.

In other words, a truly autonomous system cannot always rely on static datasets prepared by humans and must be able to expand its capabilities through its own experience.

Inseparability of perception and action

Active cognition opposes the division of perception and action into two independent modules. Awareness is not a preparatory step before action. Awareness itself is the ability to act.

Humans do not passively receive images. Through eye, head, body, and hand movements, we constantly change input to determine space, sound, texture, and the shape of objects. In other words, perception does not wait for information to enter the brain, but instead reveals the structure of the environment through purposeful action.

This is especially important for today’s video generation models. A purely observational system may learn a number of visual rules, such as predicting the movement of objects or the order in which signals change, but this does not mean that it truly understands the physical world. When something goes wrong in the environment, we often lack the ability to proactively intervene, try, and fix it.

Active cognition emphasizes this very point. In addition to predicting how the world will change, agents must be able to change the world through their actions and shape their understanding with feedback.

autonomy

Active cognition considers agents to be self-organizing and self-maintaining systems, rather than simply machines that respond to external stimuli. Things in the environment have meaning because they relate to the agent’s own goals, needs, and continued existence.

This means that agents must have some internal criteria for success and failure. Food, obstacles, and energy are important not because they are inherently important, but because they influence whether an agent can continue an action, maintain a state, or achieve its goal.

From this perspective, many current AI systems still lack true autonomy. Supervised learning relies on external labels, large-scale language models primarily mimic human data patterns, and the goals of traditional planning systems are mostly preset by humans. Although reinforcement learning introduces behavioral evaluation through reward mechanisms, most reward functions are still specified by external designers and do not naturally emerge from the agent’s self-maintenance process.

Therefore, current AI is still far from true autonomy.

Embodiment

The final key to active cognition is embodiment. The body is not just an executive tool used after an intelligent system has completed its reasoning, but a prerequisite for perceiving and understanding the world.

Body shape, sensor locations, locomotion capabilities, and modes of action directly determine how an agent explores the environment and how the world makes sense of it. Whether the same chair is “sitable” for humans is a big hurdle for ants, and for robots it depends on whether it has the appropriate height, joint structure, and control ability.

This explains why many mainstream AIs remain “disembodied.” Although they can process large amounts of text, images, and video, they lack the ability to modify perceptual input through their own movements and are unable to actively explore and adapt to changes in their real-world environment.

Even in the field of robotics, many systems still divide perception, planning, and control into independent modules. The body is only a hardware platform for executing strategies, not a core condition for forming cognition itself.

The next step for reinforcement learning?

Laffey and Sutton made clear judgments about the current AI paradigm in four dimensions: experience, perception-action, autonomy, and embodiment. Mainstream AI, especially large-scale language models and pure visual models, remains primarily at the level of passive representation and pattern prediction.

You can generate highly realistic text, images, or videos and demonstrate powerful reasoning and planning abilities in complex tasks. However, insofar as they lack continuous interaction with the environment, outcome-based evaluation of their own actions, and truly embodied exploration processes, important gaps still exist between them and their “understanding of the world.”

In contrast, there is a stronger structural resonance between reinforcement learning and active cognition. RL emphasizes action, feedback, exploration, adaptation, and long-term evaluation, making it the AI ​​branch closest to the concept of active cognition.

However, this closeness does not imply equivalence. There are still three flaws in current reinforcement learning. First, most reward functions are specified externally, rather than coming from the agent’s self-maintenance and organizational structure. Second, many systems still separate perception and action into relatively independent steps. Third, embodiment is often seen as an engineering constraint rather than a basis for cognitive formation.

Therefore, reinforcement learning also needs to evolve further. From external rewards to more internal self-evaluation, from task-driven to continuous survival and adaptation, and from mere optimization strategies to the generation of truly embodied experiences.

This article is from the official WeChat account “MachineHeart” (ID:mosthuman2014). The author has academic interests. Published with permission from 36Kr.



Source link