Asynchronous pipeline parallelism with one-step gradient delay is practical for large LLM training if you use the right optimizer; the performance gap with synchronous training can be closed with modern optimizers and error feedback corrections.
This paper shows that asynchronous pipeline parallelism for LLM training isn't fundamentally limited by stale gradients—the problem depends on which optimizer you use. Modern optimizers like Muon handle one-step gradient delays well, while older ones like AdamW struggle.