Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Liliang Ren, Yang Liu, Yelong Shen, Weizhu Chen|March 30, 2026arXiv

Key Takeaway

Hypersphere-constrained optimization enables predictable scaling of language models with a single transferable learning rate, eliminating expensive hyperparameter retuning when scaling up and improving training stability.

Summary

This paper introduces HyperP, a framework for scaling language models more efficiently by constraining weights to a hypersphere during training. The key innovation is showing that a single learning rate tuned at small scale transfers reliably across different model sizes, depths, and training amounts—achieving 1.58× better compute efficiency while maintaining training stability.

training scaling efficiency

Key Terms

scaling-laws hyperparameter-transfer hypersphere-optimization muon-optimizer mixture-of-experts