PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin et al.|April 30, 2026arXiv

Key Takeaway

Adding an explicit distribution-alignment stage between supervised fine-tuning and RL training significantly reduces model drift in multimodal models, with gains coming from disentangled feedback on perception vs. reasoning failures.

Summary

PRISM fixes a key problem in training multimodal AI models: when you fine-tune a model on examples and then use reinforcement learning, the model drifts away from what it learned initially.

training multimodal alignment

Key Terms

supervised-fine-tuning reinforcement-learning-from-verifiable-rewards on-policy-distillation mixture-of-experts distributional-drift