Reinforcement Learning without Ground-Truth Solutions can Improve LLMs

Yingyu Lin, Qiyue Gao, Nikki Lijing Kuang, Xunpeng Huang, Kun Zhou et al.|June 25, 2026arXiv

Key Takeaway

You can train LLMs with RL on open-ended optimization tasks using only execution feedback—no ground-truth needed—and the improvements transfer to exact-solution problems, suggesting score-based tasks are valuable for general capability development.

Summary

This paper introduces RiVER, a method for training language models using reinforcement learning on tasks without ground-truth answers. Instead of requiring correct solutions, it uses continuous feedback from execution scores (like how well a heuristic algorithm performs).

training reasoning

Key Terms

reinforcement-learning-from-verifiable-rewards reward-shaping group-relative-policy-optimization execution-feedback