Phonetic information in speech representations is crucial for accurate 3D facial animation; discrete token-based representations can serve as an effective bridge between speech and facial motion synthesis.
This paper investigates which speech representations work best for animating 3D faces from audio. The researchers compare four types of speech encodings—self-supervised learning features, neural codec outputs, and ASR-based representations—and find that representations capturing phonetic information produce the most accurate facial animations.