Using reference solutions as reward signals rather than imitation targets helps models learn reusable reasoning skills that sparse rewards alone miss, making them better prepared for downstream RL.
ExpRL is a method for improving language models through reinforcement learning during mid-training. Instead of having models imitate reference solutions, it uses those solutions to create grading rubrics that reward the model for showing useful reasoning steps—like breaking down problems or verifying answers—even if the final answer is wrong.