Vision Transformer (ViT)

architecture

A neural network architecture that processes images by breaking them into small patches and treating them similarly to how language models process words.

Related Capabilities

Multimodal

Quality of vision, audio, and image understanding (distinct from modality support)

424