A model that processes both images and text together, understanding the relationship between visual content and language to answer questions about images or describe what it sees.
Quality of vision, audio, and image understanding (distinct from modality support)