A model that accepts flexible user inputs (like text descriptions, points, or bounding boxes) to guide what it should identify or process in an image.
Adhering to complex, structured, or constrained instructions
Quality of vision, audio, and image understanding (distinct from modality support)