VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

Yucheng Shen, Jiulong Wu, Jizhou Huang, Dawei Yin, Lingyong Yan et al.|April 10, 2026arXiv

Key Takeaway

For building agentic systems that reason over visual documents, maintaining structured evidence across pages and actively managing context drift through sliding windows and intent injection significantly improves both accuracy and efficiency.

Summary

VISOR is an AI system that helps vision-language models retrieve and reason over visually rich documents by combining iterative search with multi-step reasoning.

agents reasoning multimodal

Key Terms

rag agentic-ai visual-tokens grpo sliding-window-attention