LLM agents in safety-critical systems have overlapping but distinct failure modes under adversarial pressure, meaning you can't rely on a single defense strategy across different models.
NRT-Bench is a benchmark that tests how well LLM agents can safely operate a simulated nuclear power plant when facing sustained, adaptive attacks over multiple turns. The benchmark reveals that current frontier models fail 8.7-12.1% of the time under attack, and crucially, each model has different vulnerabilities—defenses that help one model can actually hurt another.