Jointly encoding text and images in MLLMs before conditioning diffusion models preserves subject identity better than separate encoding, while a multi-stage denoising strategy balances semantic instruction-following with fine-detail preservation.
This paper improves subject-driven image generation by using multimodal large language models (MLLMs) to jointly understand text and reference images together, rather than separately. The approach adds a VAE-based identity module and a novel aggregation technique to balance semantic understanding with preserving the subject's identity, reducing unwanted copy-paste artifacts.