You can improve VLM grounding without training by using entropy gradients to identify uncertain regions, then iteratively refining focus—useful for detail-heavy tasks like document QA and compositional reasoning.
This paper proposes a training-free method to improve vision-language models by automatically identifying which visual regions are most important for answering questions. Instead of using external tools, it leverages the model's own uncertainty about its next word to create relevance maps, then iteratively zooms into key areas until confident.