AdamW's convergence under heavy-tailed noise remains unproven; understanding this gap matters because real LLM training exhibits heavy-tailed gradients, but theory currently assumes finite variance.
This paper investigates whether AdamW, the standard optimizer for training large language models, can theoretically converge when gradient noise has heavy tails—a realistic scenario in LLM training. The authors prove some positive results and identify potential obstacles, framing this as an open theoretical problem.