You can train medical vision-language models to perform spatial grounding (locating regions in images) alongside report generation without sacrificing language quality, using automatically-curated training data instead of expensive manual annotations.
This paper introduces RefRad2D, a large-scale bilingual dataset of 1.2M medical images paired with text, and RadGrounder, a vision-language model trained to simultaneously generate radiology reports, answer visual questions, and locate anatomical regions via bounding boxes or segmentation—all without manual spatial annotations.