Researchers evaluate AI inference using 786 real-world videos

AI Video & Visuals


Researchers are addressing a significant limitation of current multimodal foundation models by introducing a new benchmark designed to test situational awareness, the ability to understand the surrounding environment and potential actions within it. Chuhan Li and Ruilin Han of Yale University, Joy Hsu of Stanford University, Yongyuan Liang of the University of Maryland, College Park, Rajiv Dhawan of Amazon, Jiajun Wu and Ming-Hsuan Yang of the University of California, Merced, and Xin Eric of the University of California, Santa Barbara. In collaboration with Wang, we presented SAW-Bench, a dataset of 786 real-world videos captured using smart glasses. 2,071 annotated question and answer pairs. This work is important because existing benchmarks primarily focus on object relationships, ignoring the critical observer-centered perspective required for true spatial understanding. Their evaluation reveals a large performance gap between humans and even state-of-the-art models like Gemini 3 Flash, highlighting the need for consistent camera geometries and improved algorithms that can infer grounded observer-centered dynamics.

There is currently a 38% gap between artificial intelligence and humans’ understanding of their everyday environment. This gap, measured through real-world video analysis, highlights important limitations in the way a machine perceives space relative to itself. Closing this is essential to building truly sentient robots and virtual assistants. Scientists have introduced SAW-Bench, a new benchmark designed to assess how well artificial intelligence understands spatial awareness from a first-person perspective.

Current methods for evaluating multimodal foundational models (MFMs) mainly focus on understanding the relationships between objects in a scene, neglecting important elements of observer perspective and movement. This new benchmark aims to address this oversight by focusing on “location awareness,” or the ability to understand your surroundings in relation to your position and movement.

SAW-Bench leverages real-world video captured using Ray-Ban Meta smart glasses to provide a more realistic and active evaluation environment for these models. Merely identifying objects is not enough to assess an agent’s understanding of space. You need to understand how those objects relate to the agent itself. Unlike existing benchmarks that treat models as isolated observers, SAW-Bench asks models to reason about space from an embodied perspective, reflecting how humans perceive and interact with the world.

Tasks within SAW-Bench require the model to determine relative orientation, plan routes, and evaluate spatial affordances (possibilities for action in the environment). These tasks require understanding the observer’s position, orientation, and trajectory. Initial evaluation reveals a 37.66% performance difference between humans and Gemini 3 Flash. Gemini 3 Flash is currently the highest performing MFM tested on SAW-Bench.

Accurately measuring an agent’s location and orientation allows the system to interact more effectively with the physical world and create a more immersive experience for the user. Improving location awareness is essential to building reliable and intelligent systems, as failures in spatial awareness can lead to cascading errors.

Detailed video annotations build benchmarks for spatial reasoning and contextual understanding.

Initially, 786 first-person videos shot using Ray-Ban Meta (2nd generation) smart glasses formed the core dataset for assessing situational awareness. These videos were recorded in a variety of indoor and outdoor environments to provide a realistic egocentric perspective. Each video was then detailed annotated, resulting in over 2,071 question-answer pairs designed to explore the model’s understanding of spatial relationships and situational awareness.

This extensive annotation process was performed by human raters to establish the ground truth for performance evaluation. In order to accurately benchmark situated cognition, the researchers defined six different cognition tasks, each targeting a specific aspect of observer-centered understanding. These tasks required the model to infer the agent’s perspective, pose, and movement relative to its surrounding environment.

The experimental design involved careful selection of real-world videos. While datasets consisting of synthetic or staged scenes are common, using naturally captured footage presented challenges related to variations in lighting, occlusion, and camera movement. This realism was considered essential to accurately assess the model’s ability to generalize to real-world scenarios.

Smart glasses provided a unique data source that more closely reflected the human visual experience than traditional camera settings. Because of the complexity of accurately assessing spatial reasoning, the research team focused on observer-centered relationships, an aspect often overlooked by existing multimodal benchmarks. The work prioritized understanding how the model interprets the environment. From an agent’s perspectiverather than only evaluating the model’s ability to identify objects and their relationships. This emphasis on self-centered awareness necessitated a new benchmark design, leading to the creation of SAW-Bench.

Human spatial awareness outperforms state-of-the-art AI on SAW-Bench benchmark tasks

The researchers established a 37.66% performance gap between human observers and the best-performing multimodal underlying model, Gemini 3 Flash, when evaluated on the SAW-Bench benchmark. This measure, derived from an evaluation of observer-centered spatial awareness using real-world video, highlights the considerable differences in how humans and artificial intelligence effectively perceive and reason about their environments from a first-person perspective.

SAW-Bench consists of 786 self-recorded videos and over 2,071 human-annotated question-answer pairs, providing detailed evaluations across six different recognition tasks. Human baseline performance reached 91.55% overall, and peak accuracy reached 94.00% in the self-localization task, demonstrating a high ability to understand one’s position in a scene.

The lowest human score was 79.01% for the reverse route planning task, indicating the greatest challenge even for human observers. Gemini 3 Flash achieved a combined overall score of 53.89% with 66.00% in the spatial affordance task and 64.84% in the relative orientation task. Qwen3-VL 235B-A22B achieved 41.40%, while smaller models like Qwen3-VL 8B only achieved 36.12%.

Qwen2.5-VL 32B achieved 36.46% and LLaVA OneVision 72B achieved 33.70%. These results demonstrate a large range of performance among different models and highlight the challenges in developing AI systems that rival human-level spatial reasoning abilities in active, real-world environments.

Evaluation of artificial intelligence using first-person spatial reasoning and behavioral evaluation

Scientists have created a new benchmark to test how well artificial intelligence understands the world from a human perspective. Advances in artificial intelligence have focused on identifying objects in a scene and their relationships, but less attention has been paid to how agents perceive those objects. relative To itself. The new test, called SAW-Bench, uses video recorded from wearable cameras to assess whether AI can accurately reason about space and behavior from an observer’s perspective, something humans easily do.

Current models still struggle with this kind of “situational awareness”, resulting in large performance gaps when compared to human capabilities. The numbers reveal a gap of more than 37%, showing that even the most advanced systems fall short of replicating our basic spatial understanding. Its importance goes beyond simply achieving higher scores on tests. It speaks to the limitations of current AI when it comes to actually interacting with the real world.

Robots need more than just object recognition to move around the house or help someone with a task. What do you need to understand? that It’s about those objects and how their behavior affects the environment. Unlike previous benchmarks, SAW-Bench forces AI to address these observer-centric challenges, exposing weaknesses in spatial reasoning that may not surface in more static scenarios.

Addressing these shortcomings could enable more natural and effective human-machine collaboration. This benchmark highlights that models often rely on superficial cues rather than building a true understanding of camera geometry. An important question remains: Can AI truly “see” the world the way we do, or will it forever be limited to processing visual data without understanding the underlying spatial relationships? Future work could explore how AI can learn from its active environment and adapt to changing perspectives, moving us closer to truly intelligent systems.



Source link