The ability to connect specific words or concepts in text to the actual objects or regions they refer to in an image.
Quality of vision, audio, and image understanding (distinct from modality support)