Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li, Igor Gilitschenski|May 25, 2026arXiv

Key Takeaway

Jointly encoding text and images in MLLMs before conditioning diffusion models preserves subject identity better than separate encoding, while a multi-stage denoising strategy balances semantic instruction-following with fine-detail preservation.

Summary

This paper improves subject-driven image generation by using multimodal large language models (MLLMs) to jointly understand text and reference images together, rather than separately. The approach adds a VAE-based identity module and a novel aggregation technique to balance semantic understanding with preserving the subject's identity, reducing unwanted copy-paste artifacts.

multimodal architecture applications

Key Terms

diffusion-models multimodal-large-language-model variational-autoencoder identity-preservation cross-modal-reasoning