SFT works better when you treat the training target as a flexible probability distribution rather than a hard one-hot label, letting you balance between trusting demonstrations and leveraging the model's prior knowledge.
This paper reframes supervised fine-tuning (SFT) as a problem of designing target probability distributions rather than just fitting one-hot labels. Instead of forcing models to match observed tokens exactly, the authors propose Q-target—a framework that lets you control how much to trust the observed token and how to distribute remaining probability across alternatives.