From Code to Cognition: A Deep Dive into the Evolution of Robot Brains

Machine Learning


Author: Matt White, Global Chief Technology Officer, Linux Foundation

Compiled by: Felix, PANews

Humanoid robot

Wang Xingxing (CEO of Unitree Robotics) and Matt White

A few weeks ago in Shanghai, a travel companion (intelligent, regularly reads the news and observes the world, but unfamiliar with robotics) asked the question I had been looking forward to throughout the entire trip over dinner.

The robot dogs we see running around, the humanoid robots performing kung fu on demonstration stages at Unitree’s office, and the robotic arms folding clothes—how do they do it? Are they powered by large language models (LLMs)? How exactly does this work? Is there some kind of language model controlling their movements?

That’s a great question, and to be honest: in a way, yes—but the real story is much more interesting. The bots you see on social media aren’t ChatGPT in metal shells. They run a stack of technologies (multiple AI layers working together), and this stack has changed more in the past three years than it did in the past thirty. Language models are just one part of it. Vision models, action models, behavior trees, classical control loops, and a growing family of systems called “world models” are all critical components—and “world models” may be the most important development of all.

This is a long article that will begin at the beginning, gradually describe each major transformation, and ultimately arrive at the current stage: robots can not only react to the world but also imagine it.

One: The Pre-LLM Era: When Bots Were Just Software

For decades, building robots meant writing vast amounts of code, almost all of which didn’t require learning.

Classic industrial robots are tower-like structures made by stacking carefully designed modules, such as the orange robotic arms that welded Toyota chassis in the 1990s or Boston Dynamics’ BigDog in the early 2000s.

  • Perception: Filter camera footage, perform edge detection, and use geometric matching to identify the position of the workpiece.
  • State estimation: Determines the robot’s position and velocity by combining wheel encoders, gyroscopes, and accelerometers (sensor fusion).
  • Planning: Given a target pose, compute a collision-free path through a known map using algorithms such as A* or RRT.
  • Control: At the lowest level, the PID controller adjusts motor torque hundreds or thousands of times per second to follow the path.

These layers are typically written by different personnel in different labs and meticulously stitched together. Behaviors (such as “pick up the cup if it is red, otherwise wait”) are encoded as state machines or behavior trees: flowcharts that the robot executes step by step.

Humanoid robot

The advantages of this approach are clear. It is predictable and meets safety standards. That’s why your car is equipped with an effective ABS anti-lock braking system.

The drawbacks are equally obvious. Such a robot can only demonstrate its intelligence in scenarios pre-defined by engineers. Once placed in a new factory, under different lighting conditions, or with cups of a new color, it fails completely. Its ability to generalize is nearly zero.

Two: Machine learning is quietly stepping in

In the 2010s, deep learning began addressing perception-layer problems. Convolutional neural networks (CNNs) that outperformed humans on ImageNet image classification tasks could be retrained to detect grasp points on objects, segment furniture in a room, or recognize human poses. Suddenly, the “perception” layer at the top of the technology stack no longer required manual design—you could train it directly.

Subsequently, the learning mechanism spread to the “control” layer. Researchers from Berkeley, DeepMind, and OpenAI demonstrated that reinforcement learning—enabling robotic agents to trial millions of actions in simulated environments and reinforce successful behaviors—can produce remarkably skilled gaits, hand-object manipulation (OpenAI’s one-handed Rubik’s Cube solving in 2019 was a milestone), and adaptive locomotion strategies for varied terrains.

Another parallel research direction is imitation learning, often referred to as behavioral cloning: recording hundreds of attempts by a human remotely controlling a robot to complete a task, then training a neural network to predict what actions a human would take based on what the robot observes.

The key issue is that each learned strategy is too narrow. Train a network to pick up a red block, and it won’t know how to handle a yellow cup. Train it to walk on grass, and it will fall on tile flooring. Generalization remains an urgent challenge.

Notably, during this period, an infrastructure emerged that still underpins nearly everything today: ROS, the Robot Operating System (first released in November 2007). ROS is not an operating system in the sense of Windows or Linux, but rather a middleware framework—a universal robotic pipeline system. It enables “camera nodes,” “navigation nodes,” “arm controller nodes,” and dozens of other nodes to publish and subscribe to messages through a shared bus.

ROS2 currently runs on the underlying systems of the vast majority of research and commercial robots worldwide, from Stanford University’s labs to Chinese humanoid robotics startups. When people refer to a robot’s “operating system,” they almost always mean ROS2 along with the various perception, planning, and control packages running on top of it.

Humanoid robot

ROS2: It is not an operating system, but a universal pipeline that enables independent robot software to communicate with each other.

Three: Application of LLMs in Robotics

Then, ChatGPT was born.

Suddenly, there was this thing: LLM. It could read simple English instructions, perform multi-step reasoning, write code, and call functions. Robotics experts immediately realized this was the missing piece they had been trying to solve for years. The hardest part of getting a robot to perform useful tasks in a home or office is usually not motor control, but human-robot interaction: how do people tell the robot what to do, and how does the robot break that goal down into atomic actions it already knows how to execute?

The first wave of applying LLMs to robotics treated language models as natural language compilers layered on top of ROS. The pattern is as follows:

  1. Bring the coffee cup from the kitchen counter and place it on my desk.

  2. The LLM generates a plan based on the list of atomic skills available to the bot: it can be a sequence of function calls, a state machine, or a behavior tree written in XML.

  3. ROS2 nodes will execute the plan step by step. If any step fails, the failure information will be reported to the LLM for replanning.

Google’s 2022 SayCan project is a very streamlined version of this idea: the LLM proposes skills, a separate “affordance” model evaluates the likelihood of success for each skill, and the robot selects the combination of skills with the highest combined score. Open frameworks such as ROS-LLM, ROSGPT, and ROSA, led by Huawei’s research lab, have popularized this approach.

This is indeed a significant leap. Suddenly, you can tell a robot, “Clean the table and put the recyclables in the blue bin,” and it will attempt to perform some reasonable actions. But note that there are still some issues here: the language model remains at the planning level. The actual motion commands are still generated by underlying controllers that have been carefully designed or specifically trained. The language model is merely an intelligent scheduler—it does not handle actuation.

Humanoid robot

Four: Vision-Language-Action Models (VLA), when the brain begins to drive robots

Humanoid robot

The Keenon XMAN-R1 robot is retrieving medications from shelves at Galbot's automated pharmacy in Beijing—for just $100,000.

The next leap will be harder, but also more important. Researchers have posed a more ambitious question: What if a model could not only plan, but also directly generate action commands? What if, by feeding camera images and language instructions directly into a neural network, we could obtain the next millisecond of joint movements?

This is the vision-language-action model (VLA). It is now the dominant paradigm in the field of humanoid and quadruped robots.

The first widely known vision-language robot is RT-2, launched by Google DeepMind in 2023. Its cleverness lies in using a large vision-language model—already trained on image captioning and question answering—and further training it on robot demonstration data, treating robot actions as additional tokens to be predicted. The same neural network that once output “a cat sitting on a mat” can now output a sequence of tokens encoding “move the right claw forward 3 centimeters, close the claw, lift 5 centimeters.” Both reasoning and action are performed within the same model.

Then, in mid-2024, a team led by Stanford University released OpenVLA, a 7-billion-parameter open-source VLA model trained on the Open X-Embodiment dataset. This dataset aggregates over one million training fragments from 21 different research labs, covering 22 distinct robot bodies. For the first time outside of Google, people could download a general-purpose robot model and begin modifying it. It transformed the entire field overnight.

Today, leading VLAs, though few in number, are growing rapidly:

  • π0 and π0.5 from Physical Intelligence: Exceptional task adaptability.
  • NVIDIA Isaac GR00T N1.7: Open weights, commercial license, designed specifically for humanoid robots, and the model most Chinese hardware companies are currently fine-tuning using their own data.
  • Figure AI’s Helix and the updated Helix-02: proprietary technology, but critically important in architecture.
  • AgiBot’s Genie Envisioner: A platform based on China’s world model.
  • SmolVLA, NORA, ACoT-VLA, CogACT: An increasing number of VLAs are emerging in academia, exploring diverse design directions.

How VLA works (without mathematical formulas)

You can think of VLA as combining three input signals into one output signal.

The first data stream is visual data. RGB cameras (sometimes depth sensors or LiDAR), and occasionally tactile sensors on the fingertips, are processed by a visual encoder (typically a Transformer model such as DINOv2 or SigLIP), which compresses each image into hundreds of “visual tokens” that summarize what the robot sees.

The second data stream is language. Your instruction (“Pass me the screwdriver”) is converted into tokens, just like in ChatGPT.

These two data streams are connected and fed into a Transformer “backbone” (typically a small open-source language model like Qwen3 or Llama). This backbone is responsible for reasoning, combining the information it receives with the information it is asked about.

Third data stream: action, flowing out from the other end. This is where various architectural designs diverge:

  • Discrete action tokens: The model directly generates tokens that can be decoded into joint angles or end-effector positions, similar to how ChatGPT generates words. This approach is simple but can cause stuttering during high-frequency operation.
  • Diffusion or flow-matching action head: A standalone mini-network takes the backbone’s output and denoises it to generate a smooth trajectory of joint positions, similar to image diffusion models but generating motion instead. This is what π0 does, producing more smooth and natural actions.
  • Action chunking: Instead of predicting a single next instruction, predict a set of instructions for the next half-second to smooth out jitter.

Humanoid robot

In the VLA model: two input streams are fed in, a motion command is output, and reasoning and action are integrated into a single network.

This is the crucial architectural shift: reasoning and action are no longer separated. Teaching the neural network to recognize a cup also teaches it how to grasp the cup. It is this coupling that enables VLAs to generalize, something their predecessors could not do.

Five: The Dual-Brain Strategy — How LLMs and VLAs Work Together

There is a detail rarely explained in marketing: today’s highest-performing humanoid robots do not run a single VLA system, but rather two models operating at different speeds that communicate with each other. This is sometimes called a dual-system or System 1/System 2 architecture, drawing from Daniel Kahneman’s psychological framework, which posits that humans possess a fast, intuitive brain and a slow, deliberate thinking brain.

Figure AI’s Helix made this design iconic, and now it (and its variants) are nearly universally imitated. Notably, NVIDIA’s GR00T N1.7 has adopted this design, as have most Chinese humanoid robots. Its structure is as follows:

  • System 2 (S2): The slow-thinking brain. A vision-language model with 7 billion parameters, running at a frequency of approximately 7–9 Hz (i.e., 7 to 9 times per second). Its role is to observe scenes, interpret instructions, perform multi-step reasoning (e.g., “The bowl is behind the cereal box; I need to move the box first”), and generate high-level intentions—typically a compact set of internal vectors rather than text itself.
  • System 1 (S1): Fast Reaction Brain. A much smaller (approximately 80 million parameters) visuomotor policy model running at 200 Hz. It receives the intention vector from S2 plus the latest sensor data and outputs continuous joint commands. It does not engage in any meaningful “thinking”—it simply reacts.

Recently, Figure’s Helix-02 added System 0, a reflexive layer situated beneath the dual-brain system, not a third cognitive layer. It is a network with 10 million parameters running at 1 kHz, responsible for handling low-level balance and full-body coordination, replacing over 100,000 lines of handwritten motion control C++ code with neural controllers. Think of S0 as an acquired spinal cord: it does not reason or plan—it simply maintains posture and coordination, while higher-level thinking is handled by the dual-brain system above.

Humanoid robot

The dual-brain architecture of modern humanoid robots: System 2 thinks slowly, System 1 reacts quickly—beneath it lies a System 0 reflex layer for maintaining balance, tactile contact, and full-body coordination

This division stems from physical limitations. If motion commands are issued only every 200 milliseconds (the speed of a large VLA), the robot’s movements would be as sluggish as moving underwater. The update rate of motion commands must exceed the natural oscillation frequency of the joints it controls, meaning hundreds or thousands of updates per second are required. No 70-billion-parameter Transformer model can run this fast on a battery-powered robot.

Thus, cognitive tasks are divided: a large, slow model handles thinking; a small, fast model handles action. They do not communicate in English, but through learned latent vectors: the slow model emits abstract goals, and the fast model knows how to interpret them.

Six: The placement of cloud, edge computing, and the “brain”

Where exactly are all these calculations performed?

Today, there is nearly a strong, almost ideological consensus among robot teams that the core control loops critical to safety must run locally. There are two reasons for this:

Latency. The round-trip transmission time over WiFi or cellular networks is at best 30–80 milliseconds. However, action commands require updates every 1–5 milliseconds. Such network latency simply cannot support normal operation.

Reliability. Robots operate in factories, warehouses, kitchens, hospitals, and other locations. Networks may drop at any time. If a robot stops functioning the moment Wi-Fi is lost, it becomes a safety hazard.

So, the modern classification is roughly as follows:

On-device (local), running on hardware such as the NVIDIA Jetson Thor or AGX Thor module (approximately 2,000 TFLOPS, 128 GB memory, 40–130 W power consumption):

  • All functions of S0/S1: balance, movement, fine motor control.
  • VLA itself (System 2) is increasingly quantized to FP8 or FP4 formats to accommodate hardware limitations. Models in the 2 billion to 7 billion parameter range can now run on-device.
  • Perception, sensor fusion, and a security monitoring program that can cover any other operation.

Cloud or remote server (if any):

  • Conversational interfaces (“Hey, bot, what should I make for dinner?”): These interfaces can tolerate delays.
  • Federated learning: Thousands of robots send remote operation data back to the server to be aggregated into the next model version.
  • Large-scale long-term planning is required, potentially involving state-of-the-art scale models.
  • Operator dashboard and monitoring.

Additionally, there is a growing intermediate layer: local edge servers located within factories or warehouses that communicate with robot clusters over local networks with single-digit millisecond latency. Larger LLMs may be deployed at this layer to handle advanced scheduling tasks that individual robots do not need to manage themselves.

China’s humanoid robot wave is built on this assumption: Unitree, AgiBot, Xpeng IRON, Fourier, and EngineAI. Their robots are equipped with onboard computing (typically Jetson, sometimes using domestic chips like Huawei Ascend), while the cloud is used for swarm learning and conversational interfaces, not control loops.

Humanoid robot

Where the bot's brain actually runs: safety-critical loops run locally, while the cloud handles tasks that can wait.

Seven: Why open-source models are quietly becoming the focus

If you only look at the demonstrations, you might think this field is dominated by a few well-funded American companies. But the reality is far more complex. The pace of development in physical AI is largely determined by open-source weight models that anyone can download and fine-tune.

The models listed below, though few in number, are highly significant:

  • OpenVLA (Stanford University): The first open-source 7B general-purpose robot model.
  • NVIDIA Isaac GR00T (N1, N1.5, N1.7): Open-source weights are coming soon, along with commercial licensing; this model is trained on tens of thousands of hours of human egocentric video. GR00T N1.7, released in March 2026, will be available for free to any user with a humanoid robot, featuring its dual-system architecture.
  • Physical Intelligence’s π0: Weights released for research.
  • NVIDIA Cosmos: Open World Foundation Model.
  • AgiBot World: A large open-source dataset from a Shanghai startup featuring demonstrations of remotely controlled humanoid robots.
  • Hugging Face’s LeRobot: an open library that has become the hub for all of the above platforms.
  • Mimic Robotics’ mimic-video: an open-source video-to-action model that is 10 times more sample-efficient than traditional VLAs.

There are two reasons why this is important. First, robotic startups no longer need to spend tens of millions of dollars to pre-train a foundational model: they can take GR00T or π0 and fine-tune it with their own robot data. Unitree, ZhiJi Dynamics, Booster, Galbot, and dozens of smaller Chinese companies are doing exactly this. This is why a company with only a few hundred employees can produce humanoid robots that can walk, talk, and fold clothes: they are standing on the shoulders of an open-source tech stack.

Second, open-source models are the only realistic solution to security concerns. If a completely closed-source model were running inside a robot in a factory, with no external insight into its reasoning logic, it would be a regulatory nightmare. Open models allow auditors, researchers, and operators to truly examine what the robot has been trained on.

Eight: What other issues remain unresolved?

If you’ve watched enough robot demonstration videos, you’ve also seen plenty of robot failure videos. The current generation of LLM+VLA robots is indeed impressive, but it also has clear limitations. Here are the issues it faces:

  • Resuming a task mid-process. VLA has greater ability than any previous technology to handle unexpected changes. But when things go seriously wrong (such as grasping failures, objects rolling away, or someone entering the workspace), getting back on track remains a weakness. The robot will blindly repeat failed actions.
  • Sample efficiency. Training a VLA from scratch requires tens of thousands of hours of remote operation data, while humans can learn to operate a new tool in just minutes. This efficiency gap is enormous.
  • Cross-entity generalization. A model trained on a Franka robotic arm in a Stanford lab cannot be perfectly transferred to a Unitree humanoid robot in a Shenzhen warehouse, as their physical forms differ.
  • Long-term tasks. Any task requiring more than 30-60 seconds of continuous behavior and involving multiple sub-goals is prone to deviation. Tasks like “make me breakfast” remain perpetually out of reach.
  • Physical common sense. VLA is trained through imitation, not understanding. It does not truly comprehend the principle that water spills when a cup is knocked over. It has merely seen some examples and predicts what happens next based on pattern matching.
  • Spatial reasoning ability. Despite being multimodal, they are surprisingly weak at tasks such as “navigate around obstacles rather than through them” or “stack these items without toppling.”

This final series of weaknesses has prompted the industry to bet on a fundamentally different model.

Nine: World Model

Imagine this: What if, instead of training a robot to predict actions, you trained it to predict the consequences of those actions?

A World Model is a neural network that predicts the future state of the world based on the current state (typically a video or sequence of frames) and a set of predefined actions. In simple terms, you can think of it as a learning-based video predictor with a steering wheel: you show it the last second of camera footage and tell it, “The robot will move its arm forward by 10 centimeters,” and it generates a realistic video predicting what the next second will look like.

Why is this important?

Once a world model is in place, robots can think before acting. They can mentally simulate three to four different candidate actions, predict the outcomes of each, score them, and select the optimal one—all before any motor movement occurs. This is exactly how chess engines operate: they don’t memorize moves; they simulate the future. This capability has never before existed in physical robotics, as sufficiently accurate models to simulate the complex real world have never been available.

Humanoid robot

World models enable robots to simulate multiple possible future scenarios, score them, and select the optimal one before any motors are activated.

What will the world model look like in 2026?

There is a wide variety of state-of-the-art world models, and they are evolving rapidly. Here are some examples:

  • NVIDIA Cosmos: A suite of open-world foundation models, including Cosmos Predict 2.5 (generative model), Cosmos Transfer 2.5 (controllable simulation model), Cosmos Reason 2 (vision-language reasoner for robots), and the latest Cosmos Policy. Cosmos Policy goes further by directly outputting actions for control through post-training on world models. Cosmos is trained on tens of thousands of GPU hours of video data (Cosmos Predict 2.5 is the world model in this series).
  • DeepMind Genie 3: An interactive world model that generates fully navigable environments from text prompts, running at 24 frames per second and sustaining stable performance for several minutes. Originally designed for gaming environments.
  • Meta V-JEPA 2: Pretrained on over one million hours of web videos, then fine-tuned for action conditioning using only 62 hours of robot video. Achieved an 80% zero-shot pick-and-place success rate on real robot arms across different labs, without any task-specific training. The “JEPA” approach is architecturally distinct from other methods.
  • DeepMind Dreamer 4: Learned to collect diamonds in Minecraft (a 20,000-step task) using only offline data, without any environment interaction. This demonstrates that true reinforcement learning in virtual worlds is feasible.
  • AgiBot’s Genie Envisioner: A unified world model platform from China, trained on over 3,000 hours of real-world humanoid robot operation videos. It can generate both predicted rollout trajectories and executable action trajectories. AgiBot uses NVIDIA Cosmos Predict 2 as its backbone network and undergoes post-training with proprietary data. This is precisely the “open-source tech stack + proprietary data” model described earlier.
  • Toyota Research Institute’s world model based on Cosmos: for remote operation data augmentation and navigation.

Humanoid robot

The six most important world models of 2025-2026, each proposing a different vision for how machines should learn physics.

Ten: Alternative architectures, as the field has not yet been settled

There is no universal standard for building world models. The debate over architectures is one of the most intriguing in AI today, directly influencing what robots will be able to do in the future. Three key camps are worth noting:

Pixel-level video diffusion (Cosmos/Sora school): Use diffusion models to predict actual pixels of future frames. The advantage is that it can serve as a synthetic data generator, rendering entirely new robot demonstrations that have never occurred. The drawbacks are high cost, occasional violations of physical laws, and the inefficiency of predicting pixels that will never be seen.

Joint Embedding Predictive Architecture, or JEPA (LeCun school): Instead of predicting pixels, it predicts abstract representations of the next frame. It discards texture details and retains only the semantic essence of objects in the scene. The advantage is efficiency, focusing on factors critical to action. The drawback is that it is more difficult to use. V-JEPA, V-JEPA 2, and the new JEPA-VLA hybrid models are exploring this area.

Potential Action World Models (Genie/Dreamer paradigm): Learn to compress entire video sequences into a latent “action language” that captures behavioral structure, then train a world model to predict the next latent state based on the next potential action. The advantage is that you can train using unlabeled internet videos and then supplement with a small amount of real robot data. The drawback is that latent actions are not interpretable by humans, making safety analysis more complex.

Humanoid robot

Pixel diffusion, JEPA, and latent actions: same goal, radically different approaches to building world models

Eleven: Practical Applications of Robots Based on World Models

If you fast-forward a few years, the architecture of cutting-edge humanoid robots might look like this:

A world model is mounted on VLA. When the robot encounters a new situation, it performs operations similar to the following:

  • VLA has proposed several candidate follow-up actions (it remains a strategy).
  • The world model takes each candidate action and simulates a hypothetical video lasting 1-3 seconds.
  • Evaluators will score based on the imagined outcomes: Was the cup picked up? Did anything fall? Was a person hit?
  • The bot will select the action with the highest score and execute only its first part.
  • Real sensor data feedback; cyclic repetition.

This is model predictive control, a technology long used to stabilize rockets and quadcopters, but it replaces manually derived physical equations with learned world models. Its scalability comes from world models pre-trained on millions of hours of video, rather than from someone writing Navier-Stokes equations for a kitchen environment.

Its benefits build upon each other:

  • The recovery situation has improved. If a grasping action fails, the world model can envision multiple correction paths and select the most promising one.
  • Generalization capability has been improved. The world model trained on network videos has experienced several orders of magnitude more “physical phenomena” than any robotic teleoperation dataset.
  • Long-term planning becomes manageable. Plan in your imagination, not in reality.
  • The gap between simulation and reality has narrowed. Previously, training required using self-built simulators (such as Isaac Sim or Newton physics engine), with the hope that the trained results would transfer to real-world applications. Now, it is possible to train using simulators that have been trained to match real-world video. Therefore, the gap is smaller.
  • Synthetic data is growing exponentially. A world model can generate millions of distinct robot trajectories with varying lighting, materials, and object configurations at nearly zero cost. This addresses one of the field’s biggest bottlenecks.

In addition, it offers a significant security advantage: robots capable of simulating the consequences of actions can refuse to perform hazardous operations—not because of predefined rules, but because they anticipate potential harm to people in the future.

Humanoid robot

Two modes of movement: VLA reacts based on what it sees; world model robots think before moving.

Twelve: Other things you should know

The real core issue is data: no matter how innovative the architecture, it’s useless without data to feed the model. Currently, remote operation—where humans wear VR devices to remotely puppeteer robots—is the main technological bottleneck. A robotics company’s competitive moat is increasingly determined by its data collection pipeline, not the model itself. Agi Robotics has already established warehouses filled with operators. NVIDIA’s GR00T N1.7 dexterity scaling law shows that more human first-person video directly and predictably improves robot dexterity. This is also one reason China holds a structural advantage: lower labor costs for data collection, more permissive deployment environments, and active national coordination of supply chains.

Simulation is a parallel universe. NVIDIA’s Isaac Sim, the all-new open-source Newton physics engine (version 1.0 will be officially released in April 2026), and the Omniverse platform enable enterprises to train robots in millions of parallel simulated environments without deploying them into the real world. Most features that appear to be “robotic intelligence” are actually cultivated in simulation environments and then transferred to hardware.

Economic benefits are beginning to emerge. Unitree delivered approximately 5,500 humanoid robots in 2025 and plans to reach 10,000 to 20,000 units in 2026. The average price has dropped from $85,000 to $25,000 over two years. The Unitree R1 is priced at $5,900, while Noetix Bumi’s launch price is $1,400. The hardware cost of humanoid robots is approaching the price level of consumer electronics, while the underlying AI technology still lags behind demonstration products. This gap will eventually narrow, at which point the expansion of the market size will have a significant impact on the entire industry.

The failure modes seem unusual. When LLM-based bots fail, they do so in ways that traditional robots cannot—such as confidently doing the wrong thing, “hallucinating” the presence of certain features, or getting stuck in circular dialogues with their own planners. The traditional robotics community has expressed considerable skepticism toward this, and with good reason: they insist that learning systems must be securely monitored and behaviorally constrained. The most reliable deployed robots today are hybrid systems: a VLA brain enclosed within a hand-crafted safety cage.

The narrative of the “ChatGPT moment” is useful but misleading: Jensen Huang has been telling everyone that the ChatGPT moment for robotics has arrived. He says this because NVIDIA sells shovels and pickaxes. A more honest version is: we’re currently around the GPT-2 era of physical AI. It’s powerful enough to amaze you, but not yet powerful enough to be deployed unattended. It’s iterating rapidly, but hasn’t reached a viral tipping point—instead, it’s on a slow, steady upward trajectory.

Conclusion

Humanoid robot

Evolution of Unitree's quadruped robots (from right to left)

In a demonstration seen at Unitree’s office, five G1 humanoid robots performed martial arts, their movements meticulously choreographed and fine-tuned by an onboard VLA-style controller, with a remote operator ensuring everything ran smoothly. Fundamentally, it was not fully autonomous. But the entire process—perception, planning, motion control—was being replaced by neural networks. Two years later, the same robots could perform the same movements without choreography, because they had pre-conceived the entire sequence and selected the optimal version.

The entire development journey described in this article—from manually written controllers, to machine learning perception, to LLM planners, to VLA, to dual-system architectures, and finally to world models—represents a gradual shift in where robotic intelligence resides. It began in the minds of engineers, evolved into manually written code, then moved into the perception layer, into planners, and into the policy layer. Now, it is ultimately moving toward learning models of the world itself.

Each transformation makes robots more general, more adaptable, and more useful. If world model shifts work, they will truly empower robots with capabilities so powerful that the question will no longer be “What can robots do?” but “What should we let them do?”

Related reading: A roundup of over 30 humanoid robotics companies: Who will emerge victorious by 2026?



Source link