HiMuon makes Muon optimization 10-100x faster by processing weight-matrix tiles independently rather than as full matrices, enabling practical use of this advanced optimizer on large models without sacrificing training quality.
This paper introduces Hierarchical Muon (HiMuon), a faster version of the Muon optimizer for training neural networks. Instead of updating all weights at once, HiMuon splits weight matrices into tiles and updates each tile independently, reducing computation from O(r²sK) to O(HWТK) while maintaining similar training performance.