You can improve LLM training stability and speed by controlling weight matrix conditioning during training, then discard the mechanism at inference—no performance trade-off.
This paper introduces a PC layer that reshapes weight matrices during training using polynomial preconditioning to keep them well-conditioned, then removes it after training with no inference cost. Testing on Llama-1B shows faster convergence with both AdamW and Muon optimizers, with theory proving this approach ensures stable gradient descent in deep networks.