Multi-token prediction with rejection sampling can accelerate RL training by 1.8x, but requires a specialized loss function (TV loss) and pre-training strategy to maintain high acceptance rates as model entropy increases during RL.
This paper tackles a major bottleneck in RL training for large language models: slow rollout generation. The authors show that Multi-Token Prediction (MTP) with rejection sampling can dramatically speed up inference by predicting multiple tokens at once, but acceptance rates drop during RL training due to increasing model entropy.