Muown Implicitly Performs Angular Step-size Decay

Florian Hübler, Kai Lion, Antonio Orvieto, Niao He|June 22, 2026arXiv

Key Takeaway

Muown's effectiveness comes from implicit angular step-size decay; making this explicit in AngularMuown gives you a faster, more controllable optimizer for Transformer pre-training with decoupled learning rate scheduling for directions vs. magnitudes.

Summary

This paper analyzes how Muown, a matrix-aware optimizer for training Transformers, implicitly controls step sizes through angular (directional) updates. The authors reformulate this insight into AngularMuown, which explicitly separates angular step-size scheduling from magnitude updates, improving training speed and stability across model sizes from nanoGPT to 1.1B parameter models.

training efficiency

Key Terms

optimizer riemannian-geometry matrix-aware-optimization step-size-decay angular-step-size