SpatioRoute VLM: Dynamic Prompts for Video QA

AI Video & Visuals


Egocentric video-spatial question answering requires sophisticated reasoning about 3D object locations and scene affordances, and zero-shot settings further increase the challenge. Current visual language models (VLMs) are often unstable without task-specific fine-tuning or access to 3D sensor data. In this paper, we introduce SpatioRoute, a novel dynamic prompt generation approach that adjusts prompts to the received questions without requiring additional training or 3D input.

Visual TL;DR. Zero-shot video QA leads to SpatioRoute VLM. SpatioRoute VLM uses question-aware routing. Question-Aware Routing includes SpatioRoute-R. Question-Aware Routing includes SpatioRoute-L. SpatioRoute-L enables dynamic prompts. SpatioRoute VLM delivers SOTA performance. SpatioRoute VLM takes spatial understanding even further.

  1. Zero-shot video QA: The challenge of spatial video question answering without fine-tuning or 3D
  2. SpatioRoute VLM: A new dynamic prompt generation approach for video QA
  3. Question-Aware Routing: Two Complementary Routing Mechanisms for Prompt Adjustment
  4. SpatioRoute-R: A rules-based system maps question typologies to prompt templates
  5. SpatioRoute-L: LLM generates task-specific prompts based on question and context
  6. Dynamic prompts: Adjust prompts according to the questions you receive without any additional training
  7. SOTA Performance: Cutting-edge performance without fine-tuning or 3D sensors
  8. Improve spatial understanding: Improve spatial understanding of videos without using 3D data.

Visual TL;DR
Visual TL;DR—startuphub.ai Zero-shot video QA leads to SpatioRoute VLM. SpatioRoute VLM uses question-aware routing. Question-Aware Routing includes SpatioRoute-L. SpatioRoute VLM delivers SOTA performance Purpose Contains achieve Zero shot video QA

Spatio Root VLM

Routing with questions in mind

Spatio Root-L

SOTA performance

From startuphub.ai · Publishers behind this format

Visual TL;DR—startuphub.ai Zero-shot video QA leads to SpatioRoute VLM. SpatioRoute VLM uses question-aware routing. Question-Aware Routing includes SpatioRoute-L. SpatioRoute VLM delivers SOTA performance Purpose Contains achieve zero shot videoQA

Spatio Root VLM

I was aware of the questionrouting

Spatio Root-L

SOTA performance

From startuphub.ai · Publishers behind this format

Visual TL;DR—startuphub.ai Zero-shot video QA leads to SpatioRoute VLM. SpatioRoute VLM uses question-aware routing. Question-Aware Routing includes SpatioRoute-L. SpatioRoute VLM delivers SOTA performance Purpose Contains achieve Zero shot video QA spatial video question and answerChallenges without tweaking or 3D Spatio Root VLM New dynamic prompt generation approachFor video QA Routing with questions in mind Two complementary routing mechanismsquick tailoring Spatio Root-L LLM generates task-specific prompts based on them.About questions and context SOTA performance Achieve cutting-edge technology without doing anythingFine tuning or 3D sensor

From startuphub.ai · Publishers behind this format

Visual TL;DR—startuphub.ai Zero-shot video QA leads to SpatioRoute VLM. SpatioRoute VLM uses question-aware routing. Question-Aware Routing includes SpatioRoute-L. SpatioRoute VLM delivers SOTA performance Purpose Contains achieve zero shot videoQA spatial videoquestion answer…Challenge without Spatio Root VLM innovative dynamicPrompt generationVideo approach… I was aware of the questionrouting two complementaryRouting mechanismright away… Spatio Root-L LLM generatestask specificPrompts based on… SOTA performance achievecutting edgeWithout fine-tuning…

From startuphub.ai · Publishers behind this format

Visual TL;DR—startuphub.ai Zero-shot video QA leads to SpatioRoute VLM. SpatioRoute VLM uses question-aware routing. Question-Aware Routing includes SpatioRoute-R. Question-Aware Routing includes SpatioRoute-L. SpatioRoute-L enables dynamic prompts. SpatioRoute VLM delivers SOTA performance. Accelerate spatial understanding with SpatatioRoute VLM Purpose Contains Contains enable achieve enable Zero shot video QA spatial video question and answerChallenges without tweaking or 3D Spatio Root VLM New dynamic prompt generation approachFor video QA Routing with questions in mind Two complementary routing mechanismsquick tailoring Spatio Root R Rule-based system map question typologyTo prompt for templates Spatio Root-L LLM generates task-specific prompts based on them.About questions and context dynamic prompt Adjust prompts depending on incoming questionswithout additional training SOTA performance Achieve cutting-edge technology without doing anythingFine tuning or 3D sensor Advancing spatial understanding Improve spatial understanding of videosNo 3D data

From startuphub.ai · Publishers behind this format

Visual TL;DR—startuphub.ai Zero-shot video QA leads to SpatioRoute VLM. SpatioRoute VLM uses question-aware routing. Question-Aware Routing includes SpatioRoute-R. Question-Aware Routing includes SpatioRoute-L. SpatioRoute-L enables dynamic prompts. SpatioRoute VLM delivers SOTA performance. Accelerate spatial understanding with SpatatioRoute VLM Purpose Contains Contains enable achieve enable zero shot videoQA spatial videoquestion answer…Challenge without Spatio Root VLM innovative dynamicPrompt generationVideo approach… I was aware of the questionrouting two complementaryRouting mechanismright away… Spatio Root R rule-based systemmap questionsFrom typology… Spatio Root-L LLM generatestask specificPrompts based on… dynamic prompt the tailor urgesQuestions askedWithout adding… SOTA performance achievecutting edgeWithout fine-tuning… evolving spaceunderstanding improve your videospatialUnderstand…

From startuphub.ai · Publishers behind this format

Question-aware routing for zero-shot efficiency

SpatioRoute works through two complementary routing mechanisms. SpatioRoute-R employs a rules-based system to definitively map question typologies (e.g., “what,” “is,” “how”) to specialized prompt templates. To complement this, SpatioRoute-L leverages LLM to generate task-specific prompts based solely on questions and situational context, without the critical need for video input during the routing stage. This flexibility allows SpatioRoute VLM to adapt to diverse question types and contextual nuances, enhancing zero-shot capabilities.

Accelerating spatial video understanding without 3D data

When evaluated on SQA3D benchmarks across various VLM families, SpatioRoute consistently shows up to 5% accuracy improvement compared to fixed-prompt baselines. This establishes a new state-of-the-art for zero-shot video-only spatial VQA, which specifically does not require 3D point cloud input. Additionally, this study highlights important findings. Chain of Thought (CoT) prompts, especially using the Think it Twice architecture, actually degrade the performance of Qwen series models in this context. This highlights the superiority of question-aware routing over uniform inference strategies for understanding spatial videos.

© 2026 StartupHub.ai. Unauthorized reproduction is prohibited. Please do not type, scrape, copy, reproduce or republish this article in whole or in part. Use for AI training, fine-tuning, search enhancement generation, or as input to any machine learning system is prohibited without a written license. Substantially similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer abuse laws. See our Clause.



Source link