Vision Transformer

architecture

A neural network architecture that processes images by breaking them into small patches and analyzing them similarly to how language models process text.

Learn more on Wikipedia

Related Capabilities

Multimodal

Quality of vision, audio, and image understanding (distinct from modality support)

424