A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

Tong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An, Yihang Chen et al.|June 9, 2026arXiv

Key Takeaway

SFT works better when you treat the training target as a flexible probability distribution rather than a hard one-hot label, letting you balance between trusting demonstrations and leveraging the model's prior knowledge.

Summary

This paper reframes supervised fine-tuning (SFT) as a problem of designing target probability distributions rather than just fitting one-hot labels. Instead of forcing models to match observed tokens exactly, the authors propose Q-target—a framework that lets you control how much to trust the observed token and how to distribute remaining probability across alternatives.

training reasoning

Key Terms

supervised-fine-tuning target-distribution soft-labels token-level-supervision