How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

Nityanand Mathur, Hamees Sayed, Wasim Madha, Apoorv Singh, Sameer Khurana et al.|June 18, 2026arXiv

Key Takeaway

Style instructions in TTS are processed differently than content words—they influence acoustic properties like pitch and energy globally rather than locally, with maximum effect in early generation steps and mid-depth network layers.

Summary

This paper reveals how individual words in style descriptions influence speech generation by analyzing attention patterns in a text-to-speech system.

multimodal evaluation

Key Terms

cross-attention diffusion-language-model attention-attribution style-conditioning attention-entropy