A training technique that teaches a model to follow instructions about images by learning from examples of image-text instruction pairs.
Adhering to complex, structured, or constrained instructions
Quality of vision, audio, and image understanding (distinct from modality support)