Meta announces AI that thinks and sees the world like humans

Meta has introduced a new artificial intelligence model called the V-Jepa 2. This can seem to help AI agents to better understand and predict the real world. According to Meta, this new open source AI model is a major step towards developing what is called Advanced Machine Intelligence (AMI). Ami is Meta's vision for the future. This is an AI model that not only processes data, but also learns from those around you and predicts how things change, as humans do every day.

Meta calls the V-Jepa 2 the most sophisticated world model ever. V-JEPA 2 represents the embedded prediction architecture 2 for video joints. 2. The models are mainly trained on a huge amount of video footage. By watching over a million hours of video clips, the company explains that the AI has learned how people interact with objects, how things move, and how different actions affect the world around them. This training also allows AI to predict the behavior of robots and AI systems of objects, how their environment responds to movement, and how different elements physically interact.

“As humans, we have the ability to predict how the physical world will evolve in response to our actions and the actions of others,” Meta said in an official blog post. “V-JEPA 2 helps AI agents mimic this intelligence and makes them smarter about the physical world.”

Taking an example of meta, explains that, just as a person knows that if a tennis ball is thrown into the air, the V-Jepa 2 can learn this kind of common sense behavior by observing the video. This training with video and understanding of the world will help AI develop mental maps and further improve their understanding of how the physical world works.

Why is Meta's V-Jepa 2 different?

The V-Jepa 2 is a 1.2 billion parametre model built on the predecessor V-Jepa, announced last year by Meta. This new generation is said to provide significant improvements in understanding, forecasting and planning. The company emphasizes that unlike previous systems, V-JEPA 2 can not only recognize images and respond to commands, but can actually predict them. You can look at the situation and estimate what will happen next if a particular action is taken. According to Meta, these features are essential for AI to function autonomously in real settings. For example, this allows the robot to navigate unfamiliar terrain and manipulate objects that have never been seen before.

Meta also revealed that they tested this by putting AI models in the lab into the robot. During testing, the company claims that these robots were able to complete basic tasks such as picking up unfamiliar objects and placing them in new spots. The robot used the model to plan the next movement based on the current view and target image. After that, I chose the best action to take step by step.

To support the broader research community, Meta has released three new benchmarks to assess how well AI models learn and infer from videos. These benchmarks aim to standardize how researchers test world models, providing a clearer pathway to advancements in physical reasoning in AI.

“By sharing this work, we aim to provide researchers and developers with access to the best models and benchmarks that will help accelerate research and progress.

Meanwhile, the company is currently focusing on short tasks like object selection and placement, but the meta says they want to go further. You can create long-term plans, break down complex tasks into smaller steps, and even use sensations like future touches and sounds.

Published:

Divya Bhati

Published:

June 12, 2025

https://www.youtube.com/watch?v=fdphqoy4vea

Source link