LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training

Shun Lei, Huaicheng Zhang, Dapeng Wu, Yaoxun Xu, Lishi Zuo et al.|June 29, 2026arXiv

Key Takeaway

For music generation at scale, separating semantic planning (what to generate) from acoustic refinement (how to generate it) and training them sequentially rather than simultaneously improves both coherence and sound quality.

Summary

LeVo 2 generates full-length songs by combining language models and diffusion models in a hierarchical approach: first predicting mixed vocal-instrument tokens for overall coherence, then refining each track separately for acoustic detail.

architecture training multimodal

Key Terms

hierarchical-representation-extraction preference-optimization music-codec progressive-post-training aesthetics-guided-training