AI tasks that require understanding both visual information from images and textual information together, such as describing images or answering questions about them.
Quality of vision, audio, and image understanding (distinct from modality support)