How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Chu-Cheng Lin, Eugene Ie|April 28, 2026arXiv

Key Takeaway

When training reasoning models with sparse rewards, you can escape cold-start failure by interpolating between RL and supervised learning via the Tsallis loss family—intermediate values of q balance speed of learning with training stability.

Summary

This paper solves a key problem in training reasoning models: when models rarely succeed initially, standard reinforcement learning gets stuck. The authors introduce a family of loss functions (using Tsallis math) that smoothly blend between two extremes—pure RL and pure supervised learning—letting practitioners choose how quickly to commit to learning from successes.

training reasoning alignment

Key Terms

reinforcement-learning-from-verifiable-rewards cold-start-stalling tsallis-entropy importance-resampling