A multi-stage fine-tuning schedule that applies different training objectives sequentially (SFT, then offline DPO, then online DPO) to avoid conflicting optimization goals.