Training a model on paired images and text data so it learns to connect visual and language understanding together.
Quality of vision, audio, and image understanding (distinct from modality support)