A model that can process and understand multiple types of input, such as both text and images.
Quality of vision, audio, and image understanding (distinct from modality support)