LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li|May 29, 2026arXiv

Key Takeaway

Using process-level rewards (tracking entities along reasoning chains) instead of outcome-only rewards significantly improves how well LLMs reason through long documents with many distractors.

Summary

This paper tackles long-context reasoning in language models by combining reinforcement learning with a novel reward system. The key innovation is using search agent trajectories to create challenging training data with realistic distractors, plus a 'rubric reward' that provides fine-grained feedback on reasoning steps rather than just final answers.

reasoning training evaluation

Key Terms

reinforcement-learning-from-verifiable-rewards process-reward-model long-context-reasoning multi-hop-retrieval reward-hacking