OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis

Chenrui Fan, Paolo Favaro|June 23, 2026arXiv

Key Takeaway

Using 3D reconstruction as an anchor to guide video generation creates better 3D consistency than generating videos alone, and you can do this by reusing existing video models without task-specific training.

Summary

OrbitForge converts text descriptions into 3D scenes by leveraging frozen video generation models and Gaussian Splatting reconstruction. It generates a video from text, identifies missing viewpoints around a complete orbit, fills those gaps with the video model, and reconstructs everything into a consistent 3D scene—all without fine-tuning or slow step-by-step generation.

multimodal

Key Terms

gaussian-splatting text-to-video 3d-scene-reconstruction view-coverage deformable-gaussian-splatting