AI can’t do physics well – and that’s an obstacle to autonomy

AI Video & Visuals


In the video of a pool table covered with green felt, several colorful balls randomly move across the screen. Most people can start estimating the velocity of a billiard ball fairly accurately, but if you ask an AI to do the same, the results can be very different. It turns out that AI is not good at physics.

AI’s inability to understand the physical world is holding back a new era in robotics, self-driving cars and other visual recognition fields, developers say. Quantifya new test that shows how AI’s understanding of the physical world is lagging but improving.

QuantiPhy evaluates an AI’s ability to numerically estimate given properties of size, velocity, or acceleration of an object, such as the diameter of a billiard ball, and allows researchers to compare models to see which works best and which improves the fastest. Most importantly, the authors say that thanks to QuantiPhy, they now know how to make AI better.

“So far, the models seem to rely heavily on pre-trained knowledge of the world, i.e. memorized facts, rather than actual quantitative inference from visual and textual inputs,” he explained. Ethan Adelidirector of Stanford Translational Artificial Intelligence (STAI) Labfaculty of Stanford Vision and Learning (SVL) Lab and lead author of HAI. new preprint paper Introducing QuantiPhy. “This represents a significant advance in our ability to measure AI’s ability to understand and interact with the real world.”

“QuantiPhy is part of a benchmark test that allows us to fairly assess the physical understanding across today’s most popular models, but it is also a model itself that shows how all models can be improved,” added the co-lead authors. Tiange Shana PhD student and member of the SVL Lab.

As such, the authors say QuantiPhy could help move models that simultaneously understand video, images, and text (visual language models, or VLMs) beyond simple linguistic plausibility toward a numerically accurate understanding of the world that makes robots and self-driving cars smarter, more useful, and safer.

quantitative difference

Generative AI models have great qualitative abilities to summarize large amounts of text, write essays and poems, and generate original images, but they are always inadequate for quantitative understanding of the physical world.

Qualitatively, AI can accurately depict how a coconut falls from a palm tree to the beach below, but it cannot accurately estimate the speed of the coconut. For these physics-related questions, “the AI ​​generates answers such as: sound It seems plausible, but a closer analysis shows that it is just speculation,” says Adeli.

“Even the best models rarely perform better than chance at estimating the distance, orientation, and size of objects in 2D videos,” Xiang says. “And this is no trivial shortcoming. QuantiPhy is an important step towards physically aware AI as we evaluate AI’s increasing ability to perform fundamental physics and help developers hone these skills.”

Home robots and self-driving cars need further improvements. Home robots need to understand that they need to apply less force when cracking an egg than when cutting a butternut squash, or that they need to wait until the mixer blades have stopped spinning before removing the bowl. Industrial robots require similar skills to navigate factory floors, manipulate objects, and assemble products. Autonomous security cameras require such capabilities to recognize threats to the valuable assets they protect.

“AI is best when it learns on its own”

To develop QuantiPhy, the research team took a multifaceted approach that combined real-world and simulated data. They collected more than 3,300 videos from the internet and recorded experiments in the lab. “By setting up a space with four or five cameras and manually recording some physical interactions, we were able to provide accurate 3D data to QuantiPhy,” Xiang recalls.

They then released QuantiPhy. As part of its approach to training, QuantiPhy was tasked with evaluating the videos and conducting its own quantitative evaluation through a kind of trial-and-error process. In an instant, QuantiPhy was pre-populated with a step-by-step process applied by humans to make accurate calculations. Surprisingly, an end-to-end learning approach that did not explicitly build manual inference steps performed best. This result suggests that forcing a model to follow human-designed inference steps can impede quantitative learning.

“We tried to give the model a head start by first counting the number of pixels in the image frame to estimate the size of various objects in the image, and then encouraging it to convert that scale to real-world units,” Xiang explained his team’s process. “But surprisingly, a direct, unprompted approach worked better. The AI ​​was most effective when it learned on its own.”

Li points out that the main finding of this project is that VLM relies too much on pre-learned knowledge of the world. That is, it uses memorized facts rather than visual input. “Their approach is more of a guess than an inference,” says the co-lead author. Lee Puyina graduate student at STAI/SVL Labs. “The evidence from our tests supports this.”

For example, Li said that in his tests, VLM generally performed better in complex scenes. This increases the chances of “guessing” while making accurate object detection and measurement more difficult. Similarly, VLM performs “terribly” when presented with a counterfactual context. In one video, the team asked VLM to assume the length of a car at the scene was 6,000 meters and to estimate its width. When humans adapt and reason according to proportional shifts, VLM tends to “hallucinate” in such situations. Finally, VLM answered QuantiPhy’s questions fairly well even when no video was provided.

“VLM is a very successful guesser,” Lee explains, coming up with plausible answers even when the answers are not based on visual measurements.

Tomorrow and beyond

Better physical reasoning could have a big impact in the future. In the medical field, QuantiPhy could be useful for precision robotic surgery. Autonomous diagnosis could help analyze medical images and record changes in the body. In home robotics, physical understanding could enhance a robot’s ability to interact with its environment and become a better companion and collaborator. Self-driving cars should similarly benefit from improved spatial reasoning to increase safety and efficiency.

The team next hopes to use multi-camera input to refine QuantiPhy’s reasoning capabilities in three dimensions, allowing QuantiPhy to perform more accurate spatial calculations than ever before, improving visual language models in more complex spaces, such as rotational mechanics (think spinning balls and turbines), deformable objects (in surgery and manufacturing), different camera perspectives, and complex multibody interactions (from cars to spacecraft to advanced robotics).

“We are excited to pioneer what we believe is a new field in AI,” concluded Xiang. “We believe the future of robotics relies on AI with the advanced physical reasoning that QuantiPhy is just beginning to uncover.”

For more information, visit the QuantiPhy website or read the paper.

Contributors include graduate students Ella Mao, Shirley Wei, and Xinye Chen. Dr. Adnan Masood of UST. and Feifei Lico-director of the Stanford Human-Centered AI Institute (HAI).



Source link