Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

Xuehui Wang, Xuankun Yang, Wei Shen|July 2, 2026arXiv

Key Takeaway

When pruning visual tokens in VLMs, filtering textual noise with entropy and selecting tokens as a structured optimization problem (not just picking top-K) preserves fine-grained details better while reducing computation.

Summary

This paper tackles the problem of compressing image tokens in vision-language models (VLMs) while preserving important visual details. The authors identify that existing pruning methods fail because textual noise corrupts the scoring process and selected tokens become fragmented.

efficiency multimodal evaluation

Key Terms

token-pruning submodular-optimization entropy vision-language-model cross-modal-scoring