TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

Ruotong Liao, Guowen Huang, Qing Cheng, Guangyao Zhai, Lei Zhang et al.|May 29, 2026arXiv

Key Takeaway

You can steer video generation at inference time by identifying and leveraging natural turning points in the diffusion denoising process—no retraining needed, and it scales better with more events.

Summary

This paper presents TunerDiT, a method for generating videos with multiple sequential events from text descriptions without requiring additional training. By identifying key moments in the diffusion process where text conditioning affects different aspects of video generation, the authors use strategic masking and prompt fusion to control event boundaries and transitions in long-form videos.

efficiency multimodal applications

Key Terms

diffusion-process text-to-video-generation diffusion-transformer prompt-fusion inference-time-steering