When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Sushant Gautam, Finn Schwall, Annika Willoch Olstad, Fernando Vallecillos Ruiz, Birk Torpmann-Hagen et al.|May 7, 2026arXiv

Key Takeaway

When deploying LLMs in new languages or sectors without existing safety benchmarks, you can't collapse safety comparisons into a single score—you must report the full context: which scenarios, which judge, which risk measure, and the uncertainty around each comparison.

Summary

This paper tackles a real-world problem: comparing AI models for safety when no labeled benchmark exists yet. Instead of relying on ground-truth labels, the authors validate safety scores through three checks—whether models respond to safety changes, whether model differences dominate over measurement noise, and whether results stay consistent across retests.

safety evaluation

Key Terms

abliterated instrumental-validity-chain benchmarkless-comparative-safety-scoring scenario-pack auroc