ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining

Anuj Diwan, Eunsol Choi, David Harwath|March 30, 2026arXiv

Key Takeaway

Specialized models for different types of speech style (speaker traits vs. utterance characteristics) outperform single unified models on individual tasks, but a combined model works better when styles need to be understood together.

Summary

ParaSpeechCLAP is a dual-encoder model that learns to match speech audio with text descriptions of speaking style (like pitch, emotion, and texture). It maps both modalities into a shared embedding space, enabling applications like finding similar-sounding speech, classifying speaker characteristics, and improving text-to-speech synthesis without retraining.

multimodal applications

Key Terms

dual-encoder contrastive-learning joint-embedding-space inference-time-reward-model