GPIC: A Giant Permissive Image Corpus for Visual Generation

Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, Michael Jang, Michael Poli et al.|May 28, 2026arXiv

Key Takeaway

For building image generation models, GPIC provides a legally usable, large-scale alternative to web-scraped datasets—eliminating licensing concerns while offering standardized evaluation benchmarks.

Summary

GPIC is a massive dataset of 100M training images (28 trillion pixels total) with AI-generated captions, all permissively licensed for research and commercial use. The dataset is deduplicated, safety-filtered, and hosted on Hugging Face with benchmarking tools and baseline models for training visual generation systems.

data evaluation applications

Key Terms

vision-language-model flow-matching deduplication safety-filtered