Muon optimizer can be understood as Hamiltonian dynamics on probability measures, providing theoretical guarantees for convergence and opening the door to analyzing large-scale neural network training through mean-field theory.
This paper analyzes the Muon optimizer through the lens of Hamiltonian dynamics and probability flows. The authors show that Muon's orthogonalization step is actually a mirror descent update, then extend this insight to neural network training by deriving a mean-field equation describing how probability distributions over parameters evolve.