Training a model to understand and connect both images and text together, so it can reason about visual content using language.
Quality of vision, audio, and image understanding (distinct from modality support)