A learned numerical encoding of audio that captures meaningful speech patterns and can be used as input for other AI tasks.
Quality of vision, audio, and image understanding (distinct from modality support)