Robot perception improves significantly when visual encoders are trained to understand dynamics and motion during pre-training, rather than relying on static image recognition—this upstream motion understanding boosts downstream policy performance by up to 22.5% in out-of-distribution settings.
DynaFLIP is a pre-training method that teaches robot vision systems to understand motion and dynamics, not just static scenes. By training on image-language-3D flow triplets from videos, it creates visual representations that capture action-relevant changes in the world, making robots better at manipulation tasks across different scenarios.