Training models to identify supporting evidence through context selection—not just answer correctness—improves long-horizon reasoning and multimodal performance without requiring more data.
ContextRL trains LLMs to better handle long contexts and multimodal inputs by rewarding models for selecting the correct supporting context from similar alternatives, rather than just supervising final answers. This indirect approach improves reasoning on coding tasks and visual question-answering by encouraging fine-grained evidence grounding.