When pruning visual tokens in VLMs, filtering textual noise with entropy and selecting tokens as a structured optimization problem (not just picking top-K) preserves fine-grained details better while reducing computation.
This paper tackles the problem of compressing image tokens in vision-language models (VLMs) while preserving important visual details. The authors identify that existing pruning methods fail because textual noise corrupts the scoring process and selected tokens become fragmented.