The ability to measure how closely related content from different types of input (like images and text) are to each other.
Quality of vision, audio, and image understanding (distinct from modality support)