MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

Abstract

The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. Nevertheless, visual language models (VLMs) still face significant computational overhead when scaled from images to the video domain. When video data is too large (due to high frame rates and long durations), the inference cost of models increases sharply. This severely hinders their deployment and application in environments that require rapid responses and have limited computation resources. Token compression for input videos is one of the promising directions, as effective compression schemes can significantly reduce computational overhead. Most existing compression methods are based on training-free token merging strategies in either the spatial or temporal dimension. Although these methods reduce computational overhead, their training-free nature inevitably leads to information loss during token compression, resulting in a significant performance drop. To address these challenges, we propose a Memory-Augmented Reinforcement Learning-based Token Compression (MARC) method for efficient video understanding that integrates structured retrieval with RL-based distillation. Our proposed MARC is a retrieve then compress method, which employs a Visual Memory Retriever (VMR) tool and a Compression Group Relative Policy Optimization (C-GRPO) training strategy. The Visual Memory Retriever first segments videos into event-level fragments and selects query-relevant clips. The C-GRPO distills reasoning ability from a Teacher Network to a Student Network by encouraging the output of the student network to match the performance of the teacher network. Extensive experiments on six video benchmarks demonstrate that our compression method achieves nearly identical accuracy to the 64-frame Qwen2.5-VL-3B baseline while using only one frame’s worth of tokens as input, resulting in a 95% reduction in visual tokens. Moreover, our approach reduces GPU memory usage by 72% and generation latency by 23.9%. These results demonstrate the strong potential of our compression method as a robust solution for RL-based post-training compression of large-scale models, enabling practical deployment in latency-sensitive and resource-constrained applications such as real-time video question answering, surveillance, and autonomous driving.

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

Abstract

BibTeX