When deploying LLMs in new languages or sectors without existing safety benchmarks, you can't collapse safety comparisons into a single score—you must report the full context: which scenarios, which judge, which risk measure, and the uncertainty around each comparison.
This paper tackles a real-world problem: comparing AI models for safety when no labeled benchmark exists yet. Instead of relying on ground-truth labels, the authors validate safety scores through three checks—whether models respond to safety changes, whether model differences dominate over measurement noise, and whether results stay consistent across retests.