Automatically optimizing token pruning configurations across layers can achieve better speed-accuracy trade-offs than fixed pruning strategies, and progressive multi-layer pruning outperforms single-layer approaches for vision-language models.
This paper introduces VisPCO, a framework that automatically finds the best way to remove unnecessary visual tokens from vision-language models to speed them up.