A streaming vision language model (VLM) continuously generates responses given an online stream of instructional prompts and input frames. This is the core mechanism of real-time visual assistants. Existing VLM frameworks primarily evaluate models in offline settings. In contrast, the performance of streaming VLM relies on additional metrics beyond pure video understanding, such as proactiveness, which reflects the timeliness of the model’s response, and consistency, which captures the robustness of the response over time. To address this limitation, we propose VSAS-Bench, a new framework and benchmark for Visual Streaming Assistant. In contrast to previous benchmarks that primarily use single-turn question answering on video inputs, VSAS-Bench features temporally dense annotations, including over 18,000 annotations across a variety of input domains and task types. We introduce standardized synchronous and asynchronous evaluation protocols and metrics to isolate and measure the distinct features of streaming VLM. Using this framework, we conduct a large-scale evaluation of recent video and streaming VLMs and analyze the accuracy-latency trade-off under key design factors such as memory buffer length, memory access policy, and input resolution, yielding some practical insights. Finally, we empirically show that traditional VLMs can be adapted to streaming settings without additional training and demonstrate that these adapted models outperform recent streaming VLMs. For example, Qwen3-VL-4B outperforms Dispider, the best streaming VLM in the benchmark, by 3% under asynchronous protocols.
