A single pretrained model can now generate realistic digital humans across all modalities (speech, movement, appearance) by treating them as a unified problem rather than separate tasks, making it practical to build avatar systems without specialized sub-models.
Archon is a unified AI model that generates digital humans across multiple modalities—text, audio, motion, and video—all from a single system. It uses a novel video compression technique to handle high-resolution talking videos efficiently, and a step-by-step reasoning approach that switches between modalities to improve output quality.