A neural network architecture that processes images by breaking them into small patches and treating them similarly to how language models process words.
Quality of vision, audio, and image understanding (distinct from modality support)