Misdirection-based defenses that provide false feedback to automated attackers are more effective than simple refusals, which attackers can easily detect and learn from during automated search.
This paper analyzes how AI systems can defend against automated jailbreak attacks by using misdirection instead of simple refusals. Rather than blocking attacks predictably, the system gives misleading but safe responses that confuse the attacker's automated evaluation tools, making it harder for attackers to know if their prompts actually worked.