When a model can directly understand different types of input (like images or audio) without needing to convert them to text first.
Quality of vision, audio, and image understanding (distinct from modality support)