Vision-Language Encoder

architecture

A model that processes both images and text together to create shared numerical representations, rather than generating new text like a full language model would.

Related Capabilities

Multimodal

Quality of vision, audio, and image understanding (distinct from modality support)

424