PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li et al.|March 26, 2026arXiv

Key Takeaway

You can train video models on short clips and generate much longer videos by using a three-tier memory strategy that compresses historical context without losing quality.

Summary

PackForcing solves the memory problem in video generation by compressing old frames intelligently—keeping early frames for context, heavily compressing middle frames, and preserving recent frames for smooth transitions. This lets models generate 2-minute videos on a single GPU after training only on 5-second clips, achieving 24x longer videos than training data.

efficiency architecture training

Key Terms

kv-cache attention-sink temporal-rope-adjustment spatiotemporal-compression autoregressive-video-diffusion