The ability to mix images and text in any order within a single prompt, rather than requiring all images first or all text first.
Quality of vision, audio, and image understanding (distinct from modality support)