Using segmentation as a generative training task bridges the gap between visual understanding and generation in multimodal models, improving both capabilities simultaneously rather than training them separately.
This paper shows how to train unified multimodal models (that do both image understanding and generation) more effectively by using image segmentation as a training task. Instead of training understanding and generation separately, the authors use segmentation to align both capabilities, improving the model's ability to understand images and generate them accurately.