Removing unnecessary visual tokens from images or videos to reduce computational cost in vision-language models.