original version of this story appeared in Quanta Magazine.
This is a test for young children. Please show me the glass of water on the desk. Hide it behind the wooden plank. Then move the board towards the glass. Would they be surprised if the board passed through the glass as if it wasn't there? By the age of 6 months, many children have an intuitive concept of the permanence of objects that they have learned through observation, and by the age of 1, almost all children have an intuitive concept of the permanence of objects that they have learned through observation. Now, some artificial intelligence models are doing the same.
Researchers have developed an AI system that learns about the world through videos and exhibits the concept of “surprise” when presented with information that contradicts the knowledge it has gathered.
The model, created by Meta and called Video Joint Embedding Predictive Architecture (V-JEPA), makes no assumptions about the physics of the world contained in the video. Nevertheless, you will be able to understand how the world works.
“Their argument is deductively very plausible, and the results are very interesting,” says Mika Heilbron, a cognitive scientist at the University of Amsterdam who studies how the brain and artificial systems understand the world.
higher level abstraction
As engineers who build self-driving cars know, it can be difficult to ensure that AI systems understand what they see. Most systems designed to “understand” videos and classify content (e.g., “people playing tennis”) or identify the contours of objects (e.g., a car in front of you) operate in so-called “pixel space.” This model basically treats every pixel in the video as being of equal importance.
However, these pixel space models have limitations. Imagine trying to understand a suburban street. If your scene has cars, traffic lights, and trees, the model may focus too much on irrelevant details such as leaf movement. You may miss the color of traffic lights or the location of nearby cars. “When I use images and videos, I don't want to work in them. [pixel] There are too many details that you don't want to model,” said Randall Balestriello, a computer scientist at Brown University.

