From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

Pedro Correa, Olivier Perrotin, Samir Sadok, Paula Costa, Thomas Hueber|June 11, 2026arXiv

Key Takeaway

Phonetic information in speech representations is crucial for accurate 3D facial animation; discrete token-based representations can serve as an effective bridge between speech and facial motion synthesis.

Summary

This paper investigates which speech representations work best for animating 3D faces from audio. The researchers compare four types of speech encodings—self-supervised learning features, neural codec outputs, and ASR-based representations—and find that representations capturing phonetic information produce the most accurate facial animations.

multimodal evaluation architecture

Key Terms

self-supervised-learning neural-codec phonetic-representation facial-animation discrete-tokens