You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang et al.|May 20, 2026arXiv

Key Takeaway

RLVR training produces predictable, low-rank weight changes that can be extrapolated mathematically, letting you skip 85% of training compute while matching or exceeding performance on reasoning tasks.

Summary

This paper reveals that language models trained with reinforcement learning from verifiable rewards (RLVR) follow surprisingly simple, low-rank weight trajectories.

training efficiency reasoning

Key Terms

reinforcement-learning-from-verifiable-rewards rank-1-approximation parameter-trajectory extrapolation