Using structured symbolic outputs (bounding boxes) as verification explanations is more effective than text for training verifiers, and separating the learning objectives for binary decisions and detailed error analysis improves performance.
This paper introduces OmniVerifier-M1, a visual verification system for multimodal AI models that uses symbolic outputs (like bounding boxes) to explain errors rather than text, and trains separate reward systems for judgment and error explanation. The approach enables fine-grained error localization and self-correction in vision-language tasks.