The increased computational demands of vision language models (VLMs) caused by large-scale visual token processing have become a critical bottleneck for scalability. Existing training-aware pruning methods often fail under aggressive compression because they rely on continuous approximations to inherently discrete problems.
Visual TL;DR. The computational demands of VLM introduce limitations to existing pruning. The existing pruning limitations are resolved in the GRIP-VLM framework. The GRIP-VLM framework uses RL for discrete optimization. RL for discrete optimization adopts the GRPO paradigm. The GRPO paradigm allows for direct discrete searches. Discrete optimization RL enables direct discrete search. Direct discrete search achieves superior efficiency.
Limitations of existing pruning: Existing training-aware pruning does not perform well under aggressive compression by approximations.
GRIP-VLM Framework: A new framework for pruning discrete vision language models
RL for discrete optimization: Formulating visual token pruning as a Markov decision process.
GRPO Paradigm: Group-Relative Policy Optimization Enhanced with Supervised Warmup
Direct discrete search: Directly navigate the discrete search space for effective pruning decisions.
Greater efficiency: Achieve unprecedented efficiency and adaptability with VLM.
Visual TL;DR
Unleash discrete optimization with reinforcement learning
A new approach is introduced in the GRIP-VLM framework to avoid the limitations of gradient-based methods, where the optimization often falls into local minima. By formulating visual token pruning as a Markov decision-making process, GRIP-VLM leverages the Group Relative Policy Optimization (GRPO) paradigm. This RL-driven strategy is powered by a supervised warmup to directly navigate the discrete search space, allowing for more effective and less constrained pruning decisions. This represents a major departure from previous attempts at pruning visual language models.
Adaptive pruning for unprecedented efficiency
The GRIP-VLM architecture features a lightweight agent with a budget-aware scorer. The agent dynamically evaluates the importance of each token and can adapt to any compression ratio without requiring a complete retraining cycle. Extensive evaluations across a variety of multimodal benchmarks confirm the superiority of GRIP-VLM over heuristic and supervised baselines. The framework consistently achieves a more favorable Pareto frontier, speeding up inference by up to 15% while maintaining accuracy, thereby addressing the core challenges in visual language model pruning.