LLMs show significantly lower diagnostic accuracy when working with structured healthcare data formats (FHIR) compared to plain text, meaning benchmarks must match actual clinical system requirements to predict real-world performance.
This paper creates a dataset of realistic medical records in FHIR format (a standard healthcare data structure) to test how well AI models perform clinical reasoning.