Using process-level rewards (tracking entities along reasoning chains) instead of outcome-only rewards significantly improves how well LLMs reason through long documents with many distractors.
This paper tackles long-context reasoning in language models by combining reinforcement learning with a novel reward system. The key innovation is using search agent trajectories to create challenging training data with realistic distractors, plus a 'rubric reward' that provides fine-grained feedback on reasoning steps rather than just final answers.