A neural network architecture that processes images through an encoder component and generates text through a decoder component, commonly used for tasks like document understanding and image captioning.
Quality of vision, audio, and image understanding (distinct from modality support)