You can train LLMs with RL on open-ended optimization tasks using only execution feedback—no ground-truth needed—and the improvements transfer to exact-solution problems, suggesting score-based tasks are valuable for general capability development.
This paper introduces RiVER, a method for training language models using reinforcement learning on tasks without ground-truth answers. Instead of requiring correct solutions, it uses continuous feedback from execution scores (like how well a heuristic algorithm performs).