Outlier tokens in diffusion transformers aren't just extreme values but represent corrupted local information; controlling them with register tokens significantly improves image generation quality.
This paper identifies and fixes a problem in Diffusion Transformers where certain tokens develop unusually high values that degrade image quality. The authors show this happens in both the image encoder and the generation model itself, and propose Dual-Stage Registers—a technique using learnable tokens to stabilize these problematic values and improve image generation.