LLM-as-a-judge evaluations strongly favor LLM-generated content and don't align with expert human judgment—automated evaluation alone is insufficient for medical translation quality assurance.
This study compares how radiologists and AI judges evaluate machine-translated Japanese versions of chest CT reports. Radiologists showed poor agreement with each other and near-zero agreement with AI judges, while AI consistently favored its own translations. The findings highlight that automated AI evaluation of translations is unreliable for medical education.