MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Valentina Bui Muti, Eugénie Dulout, Ziquan Fu|May 28, 2026arXiv

Key Takeaway

LLMs show significantly lower diagnostic accuracy when working with structured healthcare data formats (FHIR) compared to plain text, meaning benchmarks must match actual clinical system requirements to predict real-world performance.

Summary

This paper creates a dataset of realistic medical records in FHIR format (a standard healthcare data structure) to test how well AI models perform clinical reasoning.

evaluation applications data

Key Terms

fhir ehr-embedded-ai-agent diagnostic-reasoning terminology-grounded-validation synthetic-data