A model that processes both images and text together to create shared numerical representations, rather than generating new text like a full language model would.
Quality of vision, audio, and image understanding (distinct from modality support)