Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang|June 22, 2026arXiv

Key Takeaway

AdamW's convergence under heavy-tailed noise remains unproven; understanding this gap matters because real LLM training exhibits heavy-tailed gradients, but theory currently assumes finite variance.

Summary

This paper investigates whether AdamW, the standard optimizer for training large language models, can theoretically converge when gradient noise has heavy tails—a realistic scenario in LLM training. The authors prove some positive results and identify potential obstacles, framing this as an open theoretical problem.

training

Key Terms

adamw heavy-tailed-noise convergence second-moment-accumulator