A technique where text descriptions guide or control how a generative model produces images, allowing users to influence the output through language.
Quality of vision, audio, and image understanding (distinct from modality support)