AdaCodec: Efficient video MLLM encoding

AI Video & Visuals


The inherent temporal redundancy in videos, where adjacent frames largely overlap, introduces fundamental inefficiencies for current video multimodal large-scale language models (video MLLMs). These models typically treat each sampled frame as an independent image, leading to redundant visual tokens and high computational costs. The new approach detailed in arXiv challenges this paradigm by proposing a more dynamic and efficient video interface.

Visual TL;DR. Video MLLM Inefficiency Problem AdaCodec was introduced. The inefficiency of video MLLM leads to temporal redundancy. Introduced AdaCodec uses predictive visual coding. Predictive visual coding leads to selective frame encoding. Predictive visual coding produces compact P tokens. Selective frame encoding reduces the number of tokens. The compact P token contributes to reducing the number of tokens. Efficiency is increased by reducing the number of tokens. The reduced number of tokens provides better performance.

  1. Video MLLM inefficiency: Treating adjacent frames as independent images results in redundant tokens
  2. Temporal redundancy: Adjacent video frames largely overlap, causing high computational costs.
  3. Introducing AdaCodec: A new dynamic and efficient video interface for MLLM
  4. Predictive visual coding: Intelligently manage visual token transmission based on scene predictions.
  5. Selective frame encoding: Send a complete reference frame only when scene prediction is unreliable.
  6. Compact P token: Encodes frame-to-frame changes such as motion and prediction residuals.
  7. Reduced number of tokens: Significantly minimize the visual tokens required to understand the video.
  8. Increased efficiency: Significantly reduce tokenization costs and latency for video MLLM.
  9. Better performance: Achieve better results with a fraction of your computational budget.

Visual TL;DR
Visual TL;DR—startuphub.ai Video MLLM Inefficiency Problem AdaCodec was introduced. Introduced AdaCodec uses predictive visual coding. Predictive visual coding produces compact P tokens. The compact P token contributes to reducing the number of tokens. Efficiency is increased by reducing the number of tokens. Better performance due to reduced number of tokens problem Purpose generate contribute to enable leads to Video MLLM inefficiencies

Introduction of AdaCodec

Predictive visual coding

Compact P token

Reduction in number of tokens

Increased efficiency

excellent performance

From startuphub.ai · Publishers behind this format

Visual TL;DR—startuphub.ai Video MLLM Inefficiency Problem AdaCodec was introduced. Introduced AdaCodec uses predictive visual coding. Predictive visual coding produces compact P tokens. The compact P token contributes to reducing the number of tokens. Efficiency is increased by reducing the number of tokens. Better performance due to reduced number of tokens problem Purpose generate contribute to enable leads to Video MLLMinefficiency

ada codecintroduced

Prediction visualcoding

Compact P token

reduced tokencount

Increased efficiency

excellentperformance

From startuphub.ai · Publishers behind this format

Visual TL;DR—startuphub.ai Video MLLM Inefficiency Problem AdaCodec was introduced. Introduced AdaCodec uses predictive visual coding. Predictive visual coding produces compact P tokens. The compact P token contributes to reducing the number of tokens. Efficiency is increased by reducing the number of tokens. Better performance due to reduced number of tokens problem Purpose generate contribute to enable leads to Video MLLM inefficiencies Treat adjacent frames as independentImages lead to redundant tokens Introduction of AdaCodec New dynamic and efficient videoInterface for MLLM Predictive visual coding Manage your visual tokens intelligentlyTransmission based on scene prediction Compact P token Encode changes between frames such as motionand the predicted residual Reduction in number of tokens Significantly minimize visual tokensnecessary to understand the video Increased efficiency Significantly reduce tokenization costs,Video MLLM delay excellent performance Get better results in less timecalculation budget

From startuphub.ai · Publishers behind this format

Visual TL;DR—startuphub.ai Video MLLM Inefficiency Problem AdaCodec was introduced. Introduced AdaCodec uses predictive visual coding. Predictive visual coding produces compact P tokens. The compact P token contributes to reducing the number of tokens. Efficiency is increased by reducing the number of tokens. Better performance due to reduced number of tokens problem Purpose generate contribute to enable leads to Video MLLMinefficiency Adjacent processingas a frameIndependent images… ada codecintroduced new dynamics andefficient videoInterface for MLLM Prediction visualcoding wiselyManage your visualsSend token… Compact P token Interframe encodingchange like movementAnd predictions… reduced tokencount SignificantlyKeep visuals to a minimumRequired tokens… Increased efficiency significantly cutTokenization costAnd the waiting time… excellentperformance achieve better resultsresults inpart of…

From startuphub.ai · Publishers behind this format

Visual TL;DR—startuphub.ai Video MLLM Inefficiency Problem AdaCodec was introduced. The inefficiency of video MLLM leads to temporal redundancy. Introduced AdaCodec uses predictive visual coding. Predictive visual coding leads to selective frame encoding. Predictive visual coding produces compact P tokens. Selective frame encoding reduces the number of tokens. The compact P token contributes to reducing the number of tokens. Efficiency is increased by reducing the number of tokens. Better performance due to reduced number of tokens problem Purpose generate contribute to enable leads to Video MLLM inefficiencies Treat adjacent frames as independentImages lead to redundant tokens temporal redundancy Adjacent video frames overlap significantly;Causes a rise in calculation costs Introduction of AdaCodec New dynamic and efficient videoInterface for MLLM Predictive visual coding Manage your visual tokens intelligentlyTransmission based on scene prediction selective frame encoding Send a complete reference frame only if:Scene prediction is unreliable Compact P token Encode changes between frames such as motionand the predicted residual Reduction in number of tokens Significantly minimize visual tokensnecessary to understand the video Increased efficiency Significantly reduce tokenization costs,Video MLLM delay excellent performance Get better results in less timecalculation budget

From startuphub.ai · Publishers behind this format

Visual TL;DR—startuphub.ai Video MLLM Inefficiency Problem AdaCodec was introduced. The inefficiency of video MLLM leads to temporal redundancy. Introduced AdaCodec uses predictive visual coding. Predictive visual coding leads to selective frame encoding. Predictive visual coding produces compact P tokens. Selective frame encoding reduces the number of tokens. The compact P token contributes to reducing the number of tokens. Efficiency is increased by reducing the number of tokens. Better performance due to reduced number of tokens problem Purpose generate contribute to enable leads to Video MLLMinefficiency Adjacent processingas a frameIndependent images… temporalredundancy adjacent videoThe frame is mainlyWhen they overlap… ada codecintroduced new dynamics andefficient videoInterface for MLLM Prediction visualcoding wiselyManage your visualsSend token… selection frameencoding send completelyreference frameOnly during the scene… Compact P token Interframe encodingchange like movementAnd predictions… reduced tokencount SignificantlyKeep visuals to a minimumRequired tokens… Increased efficiency significantly cutTokenization costAnd the waiting time… excellentperformance achieve better resultsresults inpart of…

From startuphub.ai · Publishers behind this format

Reduce redundancy with predictive visual coding

The core of the innovation lies in a “predictive visual code” that intelligently manages the transmission of visual tokens. This system, instantiated as AdaCodec, does not fully encode every frame, but instead selectively sends a complete reference frame only when the scene prediction is unreliable. Otherwise, compact “P-tokens” are used to encode changes between frames, including motion and prediction residuals. This adaptation strategy significantly minimizes the number of visual tokens required to understand the video.

Significant improvements in efficiency and performance

AdaCodec shows significant improvement over the baseline Qwen3-VL-8B model across 11 benchmarks. Even with a significantly reduced token budget (1/7), AdaCodec using 32,000 tokens outperforms the 224,000 baseline on all long video benchmarks. Additionally, on common video benchmarks, not only does the average score improve, but the time to first token decreases from 9.26 seconds to just 1.62 seconds. This leap in efficiency makes real-time video analysis and interaction much more achievable.

© 2026 StartupHub.ai. Unauthorized reproduction is prohibited. Please do not type, scrape, copy, reproduce or republish this article in whole or in part. Use for AI training, fine-tuning, search enhancement generation, or as input to any machine learning system is prohibited without a written license. Substantially similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer abuse laws. See our Clause.



Source link