High accuracy on clean images doesn't guarantee robustness to visual corruption—VLMs struggle significantly with degraded text-rich content, especially structured formats like charts and tables, which matters for real-world deployment.
This paper introduces OCR-Robust, a benchmark for testing how well vision-language models handle text recognition and reasoning when images are corrupted or degraded.