A training method that aligns images and text by learning to match their representations, using a sigmoid loss function instead of the traditional softmax approach.
Quality of vision, audio, and image understanding (distinct from modality support)