Even state-of-the-art LLMs with safety training remain vulnerable to sustained automated attacks, particularly adaptive search methods that iteratively refine prompts; static defenses alone are insufficient.
This study systematically tests two advanced AI models (Anthropic's Fable 5 and Opus 4.8) against thousands of automated jailbreak attacks across harmful scenarios. Despite strong defenses, both models can still be broken—especially through adaptive, iterative attacks—producing hundreds of confirmed harmful outputs even when using automated red-teaming with no human experts involved.