For healthcare AI, smaller fine-tuned models often outperform large reasoning models at both accuracy and confidence estimation—and a model's stated confidence doesn't reliably indicate whether it's actually uncertain.
MADE is a continuously updated benchmark for classifying medical device adverse events into multiple labels while measuring prediction confidence. It addresses real-world healthcare challenges like imbalanced labels and data contamination, testing 20+ language models with different uncertainty quantification methods to show which approaches work best for high-stakes medical decisions.