Vision Encoder-Decoder

architecture

A neural network architecture that processes images through an encoder component and generates text through a decoder component, commonly used for tasks like document understanding and image captioning.

Related Capabilities

Multimodal

Quality of vision, audio, and image understanding (distinct from modality support)

405