Quantizing image channels instead of patches improves codebook efficiency and enables a more intuitive generation process that mirrors human artistic creation, achieving strong text-to-image results.
This paper introduces Channel-wise Vector Quantization (CVQ), a new way to convert images into discrete tokens by quantizing color channels instead of spatial patches. It enables a new image generation model (CAR) that builds images progressively—first sketching overall structure, then adding fine details—similar to how artists work.