Separating structure planning from appearance rendering in image generation improves prompt following for complex spatial and compositional requirements without needing intermediate outputs.
This paper improves text-to-image generation by separating structural planning from appearance rendering. IV-CoT uses two types of queries—structural and semantic—that work together in a single pass: structural queries create a latent visual plan (like an invisible sketch), then semantic queries render the final image based on that plan.