Training with three complementary data types (images, text descriptions, motion flow) simultaneously.