Egocentric video-spatial question answering requires sophisticated reasoning about 3D object locations and scene affordances, and zero-shot settings further increase the challenge. Current visual language models (VLMs) are often unstable without task-specific fine-tuning or access to 3D sensor data. In this paper, we introduce SpatioRoute, a novel dynamic prompt generation approach that adjusts prompts to the received questions without requiring additional training or 3D input.
Visual TL;DR. Zero-shot video QA leads to SpatioRoute VLM. SpatioRoute VLM uses question-aware routing. Question-Aware Routing includes SpatioRoute-R. Question-Aware Routing includes SpatioRoute-L. SpatioRoute-L enables dynamic prompts. SpatioRoute VLM delivers SOTA performance. SpatioRoute VLM takes spatial understanding even further.
Zero-shot video QA: The challenge of spatial video question answering without fine-tuning or 3D
SpatioRoute VLM: A new dynamic prompt generation approach for video QA
Question-Aware Routing: Two Complementary Routing Mechanisms for Prompt Adjustment
SpatioRoute-R: A rules-based system maps question typologies to prompt templates
SpatioRoute-L: LLM generates task-specific prompts based on question and context
Dynamic prompts: Adjust prompts according to the questions you receive without any additional training
SOTA Performance: Cutting-edge performance without fine-tuning or 3D sensors
Improve spatial understanding: Improve spatial understanding of videos without using 3D data.
Visual TL;DR
Question-aware routing for zero-shot efficiency
SpatioRoute works through two complementary routing mechanisms. SpatioRoute-R employs a rules-based system to definitively map question typologies (e.g., “what,” “is,” “how”) to specialized prompt templates. To complement this, SpatioRoute-L leverages LLM to generate task-specific prompts based solely on questions and situational context, without the critical need for video input during the routing stage. This flexibility allows SpatioRoute VLM to adapt to diverse question types and contextual nuances, enhancing zero-shot capabilities.
Accelerating spatial video understanding without 3D data
When evaluated on SQA3D benchmarks across various VLM families, SpatioRoute consistently shows up to 5% accuracy improvement compared to fixed-prompt baselines. This establishes a new state-of-the-art for zero-shot video-only spatial VQA, which specifically does not require 3D point cloud input. Additionally, this study highlights important findings. Chain of Thought (CoT) prompts, especially using the Think it Twice architecture, actually degrade the performance of Qwen series models in this context. This highlights the superiority of question-aware routing over uniform inference strategies for understanding spatial videos.