MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

Raunak Agarwal, Markus Wenzel, Simon Baur, Jonas Zimmer, George Harvey et al.|April 16, 2026arXiv

Key Takeaway

For healthcare AI, smaller fine-tuned models often outperform large reasoning models at both accuracy and confidence estimation—and a model's stated confidence doesn't reliably indicate whether it's actually uncertain.

Summary

MADE is a continuously updated benchmark for classifying medical device adverse events into multiple labels while measuring prediction confidence. It addresses real-world healthcare challenges like imbalanced labels and data contamination, testing 20+ language models with different uncertainty quantification methods to show which approaches work best for high-stakes medical decisions.

evaluation safety applications

Key Terms

uncertainty-quantification multi-label-text-classification long-tailed-distribution temporal-splits self-verbalized-confidence