When training models to generate audio and video together, treating each modality's learning separately and protecting audio-specific layers from video interference leads to better results than standard single-objective RL approaches.
OmniNFT improves joint audio-video generation by using reinforcement learning with three key techniques: routing rewards separately to each modality, preventing video gradients from interfering with audio processing, and focusing optimization on synchronization regions. This addresses real-world needs for high-quality audio, high-quality video, and tight audio-video alignment simultaneously.