Envisioning the Future, One Step at a Time

Stefan Andreas Baumann, Jannik Wiese, Tommaso Martorella, Mahdi M. Kalayeh, Björn Ommer|April 10, 2026arXiv

Key Takeaway

Predicting sparse point trajectories instead of dense pixels makes future scene simulation orders of magnitude faster while maintaining accuracy—enabling practical exploration of many possible futures with uncertainty quantification.

Summary

This paper predicts how scenes will evolve by tracking sparse point trajectories instead of predicting dense pixel values. An autoregressive diffusion model generates thousands of plausible futures from a single image while explicitly modeling uncertainty growth over time, achieving faster simulation than dense video prediction methods.

reasoning efficiency multimodal

Key Terms

diffusion-model trajectory-generation uncertainty-quantification autoregressive-decoding physical-plausibility