When training reasoning models with sparse rewards, you can escape cold-start failure by interpolating between RL and supervised learning via the Tsallis loss family—intermediate values of q balance speed of learning with training stability.
This paper solves a key problem in training reasoning models: when models rarely succeed initially, standard reinforcement learning gets stuck. The authors introduce a family of loss functions (using Tsallis math) that smoothly blend between two extremes—pure RL and pure supervised learning—letting practitioners choose how quickly to commit to learning from successes.