Hypersphere-constrained optimization enables predictable scaling of language models with a single transferable learning rate, eliminating expensive hyperparameter retuning when scaling up and improving training stability.
This paper introduces HyperP, a framework for scaling language models more efficiently by constraining weights to a hypersphere during training. The key innovation is showing that a single learning rate tuned at small scale transfers reliably across different model sizes, depths, and training amounts—achieving 1.58× better compute efficiency while maintaining training stability.