Even though AI is learning through virtual interactions, it still struggles with basic physics

Machine Learning


Scientists are increasingly investigating whether artificial intelligence can gain a human-like understanding of how the physical world works. Luca M. Schulze Buschoff, Konstantinos Voudouris, and Can Demircan from the Helmholtz München Institute for Human-Centered AI, together with Eric Schulz and colleagues, will present research exploring whether vision language models can develop intuitive physics through direct interaction with the environment. This work addresses a major gap in current AI capabilities, as pre-trained models often lack basic physical reasoning and supervised fine-tuning alone proves insufficient for robust generalization. Their research investigates whether reinforcement learning, which allows models to learn from experience, can facilitate the development of transferable intuition about physical mechanics, an important step toward more adaptive and intelligent artificial systems.

Current pre-trained visual language models have limited intuitive physics and struggle with tasks that humans find simple. Although supervised fine-tuning can improve performance on specific tasks, it cannot reliably produce models that can generalize physical rules to new situations.

In this study, we consider an alternative approach and hypothesize that models require active engagement with the environment to learn its underlying dynamics, reflecting how humans acquire intuitive understanding. This research focuses on training a model using reinforcement learning, allowing it to learn through trial and error within a simulated environment.
The model was given the task of building a tower of color blocks generated by a physics engine and received a reward based on the stability of the structure built. This interactive training was contrasted with a non-interactive method that simply showed the model an example of an optimal tower construction sequence.

The primary objective was to determine whether learning through interaction promotes more generalizable physical intuition compared to passive observation. The researchers specifically tested whether the interactively trained model showed improved performance both in constructing new and unseen towers and in determining the stability of existing structures.

The evaluation focused on the model’s textual output and assessed its ability to articulate solutions and predictions. To further investigate the learning process, this study also investigated the internal activation of the model and sought to decipher how effectively the model represents important physical quantities such as the stability of the tower at different layers.

Surprisingly, this study revealed no significant differences between interactive and non-interactive training conditions. While both approaches allowed the models to perform well on the specific tower construction task they were trained on, neither resulted in a model that could reliably cope with new physical challenges. Despite the ability to decipher physical quantities from model activations, this internal ability does not translate into improved performance on invisible tasks, suggesting a disconnect between knowledge representation and real-world application.

Tower block dataset construction and model training parameters

A 0.256×256 pixel RGB image-based experimental setup forms the basis of this study on visual language models and intuitive physics. Two tower block datasets are built within the ThreeDWorld environment, each consisting of a stack of 2 to 4 randomly colored cubes photographed from a fixed camera angle.

The camera angle and block size remained constant throughout the study to facilitate learning the mapping between pixel space and ground truth distance. Both datasets featured a tower with a single intentionally displaced block on the floor at the top of the tower or beside it. The model was trained on four combinations of datasets and action types using the GRPO algorithm to assess whether interactions with the environment promote generalizable physical intuition.

One dataset focused on the displaced top block, and the other showed the displaced side blocks. Action types include binary stability decisions, which require a model to evaluate tower stability, and x-only/xy tasks, which require accurate displacement values ​​to improve tower stability or size. The xy task extended the x-only task by adding a vertical displacement component.

To further assess generalization, we also tested our model on an external dataset of real wooden block towers obtained from Lerer et al. (2016). To investigate the distinction between model capabilities and performance, a layer-by-layer decoding of model activations was performed, and in particular the predictability of key physical quantities was investigated.

The purpose of this analysis is to determine whether interactive training improves the decodability of these quantities in later model layers compared to non-interactive training. This study acknowledged that although both training methods achieved the best performance for the task they were trained for, neither could reliably generalize to new physical tasks.

Adapter-based reinforcement learning and supervised fine-tuning yield comparable performance on block manipulation tasks

The researchers evaluated the performance gains achieved through reinforcement learning and oversaw how to fine-tune the visual language model. Group relative policy optimization (GRPO) and supervised fine-tuning (SFT) were implemented using adapters inserted per layer in the model, and the adapter size was 16 × 16 for all layers.

Training was run over 10,000 steps on a single 80GB A100 GPU. In this study, we employed the Adam optimizer with stochastic gradient descent, utilizing a reward function that incorporates regularized benefits, excluding the KL divergence term. For the binary stability top block task, the GRPO model achieved an average test accuracy of 0.969 after the training period, matching the performance of the SFT model, which also achieved 0.969.

For the x-only top block task, where the model predicts a single integer to rearrange the block, the GRPO model reached an average test reward of 19.999, the same score obtained by the SFT model. Similarly, in the x-only side block task where block rearrangement requires an integer prediction, the GRPO model achieves an average test reward of 19.998, which also mirrors the SFT model’s result of 19.998.

The reward function was task-specific and assigned values ​​of −1, 0, and 1 to unparsable, incorrect, and correct answers in the binary stability task, respectively. A Gaussian function was used to reward proximity to the optimal position in the x-only and xy tasks, with rewards ranging from -5 to 20 based on distance and stability.

Specifically, unstable towers received a weak Gaussian reward of 2·e(−d2) −2, while stable towers received a reward of 20·e(−d2). Here, d represents the distance from the optimal position. This study demonstrates consistent performance across both training methods on the tasks assessed.

Limited generalization of visual language models despite interactive training

Researchers investigated whether visual language models can develop generalizable physical intuition through interaction with the environment. The model was trained using reinforcement learning and supervised fine-tuning approaches, but neither method could reliably allow generalization to related intuitive physics tasks.

Our findings show that these models, even when trained through interaction, learn task-specific shortcuts rather than making robust and transferable physical understanding. This study challenges the notion that simply exposing a model to an interactive environment or employing parameter-efficient fine-tuning methods is sufficient to acquire human-like reasoning abilities about the physical world.

In this study, we focus on models of 7 billion, 8 billion, and 32 billion parameters with relatively small datasets, recognizing the limitations associated with model size and the amount of training data used. Furthermore, studies have been limited to single-step interactions, and the potential benefits of extended interaction sequences remain unexplored. Future research will investigate the effects of larger models, increased amounts of data, and multi-step interactions to determine whether these factors can promote the development of more generalizable intuitions.



Source link