One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

Philip Zmushko, Egor Petrov, Nursultan Abdullaev, Mikhail Khrushchev, Samuel Horváth|June 29, 2026arXiv

Key Takeaway

Asynchronous pipeline parallelism with one-step gradient delay is practical for large LLM training if you use the right optimizer; the performance gap with synchronous training can be closed with modern optimizers and error feedback corrections.

Summary

This paper shows that asynchronous pipeline parallelism for LLM training isn't fundamentally limited by stale gradients—the problem depends on which optimizer you use. Modern optimizers like Muon handle one-step gradient delays well, while older ones like AdamW struggle.

training efficiency scaling

Key Terms

pipeline-parallelism gradient-staleness error-feedback muon-optimizer