For music generation at scale, separating semantic planning (what to generate) from acoustic refinement (how to generate it) and training them sequentially rather than simultaneously improves both coherence and sound quality.
LeVo 2 generates full-length songs by combining language models and diffusion models in a hierarchical approach: first predicting mixed vocal-instrument tokens for overall coherence, then refining each track separately for acoustic detail.