Separating content and emotion into distinct latent spaces during training prevents reward conflicts and enables better emotional control in TTS systems without sacrificing intelligibility.
This paper addresses emotional expressiveness in LLM-based text-to-speech by proposing HPRO, a hierarchical reward optimization framework that separates emotional and semantic information to avoid conflicting gradients, then progressively aligns rewards across frame, word, and sentence levels to improve emotional control while maintaining speech clarity.