Adding an explicit distribution-alignment stage between supervised fine-tuning and RL training significantly reduces model drift in multimodal models, with gains coming from disentangled feedback on perception vs. reasoning failures.
PRISM fixes a key problem in training multimodal AI models: when you fine-tune a model on examples and then use reinforcement learning, the model drifts away from what it learned initially.