Before deploying LLMs in clinical settings, you need model-specific fairness audits using counterfactual testing—demographic parity alone doesn't guarantee fair decisions, and interventions like demographic blinding work differently across models.
Researchers audited five large language models for gender bias in emergency department triage decisions, finding that all models showed concerning flip rates (9.9-43.8%) when patient gender was swapped.