Synthetic data targeted at specific visual skills can significantly improve VLM performance on perception tasks, suggesting that natural images alone don't provide enough supervision for low-level visual understanding.
VisionFoundry is a system that generates synthetic training data for vision-language models to improve their visual perception skills like spatial understanding and 3D reasoning.