A task that requires a model to understand and reason about both visual information (images) and textual information together.
Quality of vision, audio, and image understanding (distinct from modality support)