Specialized models for different types of speech style (speaker traits vs. utterance characteristics) outperform single unified models on individual tasks, but a combined model works better when styles need to be understood together.
ParaSpeechCLAP is a dual-encoder model that learns to match speech audio with text descriptions of speaking style (like pitch, emotion, and texture). It maps both modalities into a shared embedding space, enabling applications like finding similar-sounding speech, classifying speaker characteristics, and improving text-to-speech synthesis without retraining.