You can steer video generation at inference time by identifying and leveraging natural turning points in the diffusion denoising process—no retraining needed, and it scales better with more events.
This paper presents TunerDiT, a method for generating videos with multiple sequential events from text descriptions without requiring additional training. By identifying key moments in the diffusion process where text conditioning affects different aspects of video generation, the authors use strategic masking and prompt fusion to control event boundaries and transitions in long-form videos.