Simple threshold-based monitoring with statistical risk control can effectively catch unsafe LLM outputs in production without requiring complex sequential testing methods.
This paper presents a real-time safety monitoring system for LLMs that uses a verifier model to detect unsafe outputs at deployment time. The approach calibrates decision thresholds using risk control methods and proves competitive with more complex alternatives on reasoning and adversarial datasets.