PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

Rim Assouel, Amir Bar, Michal Drozdzal, Adriana Romero-Soriano|May 22, 2026arXiv

Key Takeaway

Adding synthetic geometric overlays during training helps MLLMs learn better spatial and quantitative reasoning—suggesting many visual understanding failures come from insufficient training data rather than model architecture limits.

Summary

This paper introduces Procedurally Generated Tasks (PGT), a method that overlays geometric shapes on images to create training data that improves how multimodal AI models understand fine-grained visual details like spatial relationships and quantities. Testing shows improvements of up to 20% on visual reasoning benchmarks while keeping general capabilities intact.

multimodal training evaluation

Key Terms

multimodal-large-language-model visual-grounding instruction-tuning dense-supervision