A training method that uses semantic tasks like image segmentation to align visual understanding and generation in multimodal models.