Style instructions in TTS are processed differently than content words—they influence acoustic properties like pitch and energy globally rather than locally, with maximum effect in early generation steps and mid-depth network layers.
This paper reveals how individual words in style descriptions influence speech generation by analyzing attention patterns in a text-to-speech system.