A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

Nicola Franco|June 16, 2026arXiv

Key Takeaway

Even state-of-the-art LLMs with safety training remain vulnerable to sustained automated attacks, particularly adaptive search methods that iteratively refine prompts; static defenses alone are insufficient.

Summary

This study systematically tests two advanced AI models (Anthropic's Fable 5 and Opus 4.8) against thousands of automated jailbreak attacks across harmful scenarios. Despite strong defenses, both models can still be broken—especially through adaptive, iterative attacks—producing hundreds of confirmed harmful outputs even when using automated red-teaming with no human experts involved.

safety evaluation

Key Terms

red-team jailbreaking adversarial-robustness harm-taxonomy adaptive-attack