Context-Aware RL for Agentic and Multimodal LLMs

Peiyang Xu, Bangzheng Li, Sijia Liu, Karthik R. Narasimhan, Pramod Viswanath et al.|June 15, 2026arXiv

Key Takeaway

Training models to identify supporting evidence through context selection—not just answer correctness—improves long-horizon reasoning and multimodal performance without requiring more data.

Summary

ContextRL trains LLMs to better handle long contexts and multimodal inputs by rewarding models for selecting the correct supporting context from similar alternatives, rather than just supervising final answers. This indirect approach improves reasoning on coding tasks and visual question-answering by encouraging fine-grained evidence grounding.

training reasoning multimodal

Key Terms

reinforcement-learning contrastive-learning grpo long-context-reasoning multimodal-reasoning