Selectively looping transformer layers in masked diffusion models improves both training efficiency and reasoning capability—you can match performance with far fewer computations, or trade compute for better results.
This paper introduces LoopMDM, a technique that reuses early-middle transformer layers in masked diffusion models by looping them during training and inference. The approach achieves better training efficiency (3.3× fewer FLOPs) and stronger reasoning performance than standard models, while enabling flexible compute scaling at inference time without adding parameters.