Muown's effectiveness comes from implicit angular step-size decay; making this explicit in AngularMuown gives you a faster, more controllable optimizer for Transformer pre-training with decoupled learning rate scheduling for directions vs. magnitudes.
This paper analyzes how Muown, a matrix-aware optimizer for training Transformers, implicitly controls step sizes through angular (directional) updates. The authors reformulate this insight into AngularMuown, which explicitly separates angular step-size scheduling from magnitude updates, improving training speed and stability across model sizes from nanoGPT to 1.1B parameter models.