By intelligently routing training samples to different optimization strategies based on correctness, you can get the best of both fast learning and stable training—a practical improvement for post-training large language models.
This paper proposes Sample-Routed Policy Optimization (SRPO), a training method that combines two different approaches for fine-tuning language models: it routes correct outputs through a reward-based method and incorrect outputs through a distillation method.