Multi-turn attacks leave detectable signatures in LLM activations that text-level defenses miss—you can catch covert attacks by monitoring how the model's internal states shift across conversation turns, but detection models don't transfer between different LLM architectures.
This paper detects multi-turn prompt injection attacks by analyzing patterns in a language model's internal activations rather than just the text. The researchers found that adversarial attacks create a distinctive 'restlessness' signature in the model's activation patterns as attackers progress through trust-building, pivoting, and escalation phases.