Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka et al.|April 2, 2026arXiv

Key Takeaway

LLM-as-a-judge evaluations strongly favor LLM-generated content and don't align with expert human judgment—automated evaluation alone is insufficient for medical translation quality assurance.

Summary

This study compares how radiologists and AI judges evaluate machine-translated Japanese versions of chest CT reports. Radiologists showed poor agreement with each other and near-zero agreement with AI judges, while AI consistently favored its own translations. The findings highlight that automated AI evaluation of translations is unreliable for medical education.

evaluation safety applications

Key Terms

llm-as-a-judge machine-translation inter-evaluator-agreement blinded-evaluation