VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

Huawei Ji, Yuanhao Sun, Yuan Jin, Cheng Deng, Jiaxin Ding et al.|April 16, 2026arXiv

Key Takeaway

Automatically optimizing token pruning configurations across layers can achieve better speed-accuracy trade-offs than fixed pruning strategies, and progressive multi-layer pruning outperforms single-layer approaches for vision-language models.

Summary

This paper introduces VisPCO, a framework that automatically finds the best way to remove unnecessary visual tokens from vision-language models to speed them up.

efficiency architecture evaluation

Key Terms

visual-token-pruning pareto-frontier straight-through-estimator augmented-lagrangian-method