Adding synthetic geometric overlays during training helps MLLMs learn better spatial and quantitative reasoning—suggesting many visual understanding failures come from insufficient training data rather than model architecture limits.
This paper introduces Procedurally Generated Tasks (PGT), a method that overlays geometric shapes on images to create training data that improves how multimodal AI models understand fine-grained visual details like spatial relationships and quantities. Testing shows improvements of up to 20% on visual reasoning benchmarks while keeping general capabilities intact.