Online Safety Monitoring for LLMs

Mona Schirmer, Metod Jazbec, Alexander Timans, Christian Naesseth, Maja Waldron et al.|July 2, 2026arXiv

Key Takeaway

Simple threshold-based monitoring with statistical risk control can effectively catch unsafe LLM outputs in production without requiring complex sequential testing methods.

Summary

This paper presents a real-time safety monitoring system for LLMs that uses a verifier model to detect unsafe outputs at deployment time. The approach calibrates decision thresholds using risk control methods and proves competitive with more complex alternatives on reasoning and adversarial datasets.

safety evaluation efficiency

Key Terms

safety-monitoring risk-control verifier-signal threshold-calibration