A geometric constraint that encourages three modalities (image, language, flow) to align tightly in shared embedding space.