A neural network architecture that processes images by breaking them into small patches and analyzing them similarly to how language models process text.
Quality of vision, audio, and image understanding (distinct from modality support)