The inherent temporal redundancy in videos, where adjacent frames largely overlap, introduces fundamental inefficiencies for current video multimodal large-scale language models (video MLLMs). These models typically treat each sampled frame as an independent image, leading to redundant visual tokens and high computational costs. The new approach detailed in arXiv challenges this paradigm by proposing a more dynamic and efficient video interface.
Visual TL;DR. Video MLLM Inefficiency Problem AdaCodec was introduced. The inefficiency of video MLLM leads to temporal redundancy. Introduced AdaCodec uses predictive visual coding. Predictive visual coding leads to selective frame encoding. Predictive visual coding produces compact P tokens. Selective frame encoding reduces the number of tokens. The compact P token contributes to reducing the number of tokens. Efficiency is increased by reducing the number of tokens. The reduced number of tokens provides better performance.
Video MLLM inefficiency: Treating adjacent frames as independent images results in redundant tokens
Temporal redundancy: Adjacent video frames largely overlap, causing high computational costs.
Introducing AdaCodec: A new dynamic and efficient video interface for MLLM
Predictive visual coding: Intelligently manage visual token transmission based on scene predictions.
Selective frame encoding: Send a complete reference frame only when scene prediction is unreliable.
Compact P token: Encodes frame-to-frame changes such as motion and prediction residuals.
Reduced number of tokens: Significantly minimize the visual tokens required to understand the video.
Increased efficiency: Significantly reduce tokenization costs and latency for video MLLM.
Better performance: Achieve better results with a fraction of your computational budget.
Visual TL;DR
Reduce redundancy with predictive visual coding
The core of the innovation lies in a “predictive visual code” that intelligently manages the transmission of visual tokens. This system, instantiated as AdaCodec, does not fully encode every frame, but instead selectively sends a complete reference frame only when the scene prediction is unreliable. Otherwise, compact “P-tokens” are used to encode changes between frames, including motion and prediction residuals. This adaptation strategy significantly minimizes the number of visual tokens required to understand the video.
Significant improvements in efficiency and performance
AdaCodec shows significant improvement over the baseline Qwen3-VL-8B model across 11 benchmarks. Even with a significantly reduced token budget (1/7), AdaCodec using 32,000 tokens outperforms the 224,000 baseline on all long video benchmarks. Additionally, on common video benchmarks, not only does the average score improve, but the time to first token decreases from 9.26 seconds to just 1.62 seconds. This leap in efficiency makes real-time video analysis and interaction much more achievable.