Training a model to understand the relationship between images and their text descriptions so it can match them together effectively.
Quality of vision, audio, and image understanding (distinct from modality support)