For building image generation models, GPIC provides a legally usable, large-scale alternative to web-scraped datasets—eliminating licensing concerns while offering standardized evaluation benchmarks.
GPIC is a massive dataset of 100M training images (28 trillion pixels total) with AI-generated captions, all permissively licensed for research and commercial use. The dataset is deduplicated, safety-filtered, and hosted on Hugging Face with benchmarking tools and baseline models for training visual generation systems.