ExpRL: Exploratory RL for LLM Mid-Training

Violet Xiang, Amrith Setlur, Chase Blagden, Nick Haber, Aviral Kumar|June 15, 2026arXiv

Key Takeaway

Using reference solutions as reward signals rather than imitation targets helps models learn reusable reasoning skills that sparse rewards alone miss, making them better prepared for downstream RL.

Summary

ExpRL is a method for improving language models through reinforcement learning during mid-training. Instead of having models imitate reference solutions, it uses those solutions to create grading rubrics that reward the model for showing useful reasoning steps—like breaking down problems or verifying answers—even if the final answer is wrong.

training reasoning

Key Terms

reinforcement-learning reward-scaffolds process-reward-model mid-training sparse-reward