Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

Marcel Gröpl, Jaewoo Jung, Seungryong Kim, Marc Pollefeys, Sunghwan Hong|April 9, 2026arXiv

Key Takeaway

You can improve VLM grounding without training by using entropy gradients to identify uncertain regions, then iteratively refining focus—useful for detail-heavy tasks like document QA and compositional reasoning.

Summary

This paper proposes a training-free method to improve vision-language models by automatically identifying which visual regions are most important for answering questions. Instead of using external tools, it leverages the model's own uncertainty about its next word to create relevance maps, then iteratively zooms into key areas until confident.

multimodal evaluation reasoning

Key Terms

vision-language-model entropy-gradient visual-grounding test-time-optimization spatial-entropy-stopping-rule