LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

Hanwool Lee, Dasol Choi, Bokyeong Kim, Seung Geun Kim, Haon Park|June 18, 2026arXiv

Key Takeaway

LLM agents in safety-critical systems have overlapping but distinct failure modes under adversarial pressure, meaning you can't rely on a single defense strategy across different models.

Summary

NRT-Bench is a benchmark that tests how well LLM agents can safely operate a simulated nuclear power plant when facing sustained, adaptive attacks over multiple turns. The benchmark reveals that current frontier models fail 8.7-12.1% of the time under attack, and crucially, each model has different vulnerabilities—defenses that help one model can actually hurt another.

safety evaluation agents

Key Terms

red-team jailbreaking adversarial-robustness safety-critical-systems multi-turn-dialogue