Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Prashant Kulkarni|April 30, 2026arXiv

Key Takeaway

Multi-turn attacks leave detectable signatures in LLM activations that text-level defenses miss—you can catch covert attacks by monitoring how the model's internal states shift across conversation turns, but detection models don't transfer between different LLM architectures.

Summary

This paper detects multi-turn prompt injection attacks by analyzing patterns in a language model's internal activations rather than just the text. The researchers found that adversarial attacks create a distinctive 'restlessness' signature in the model's activation patterns as attackers progress through trust-building, pivoting, and escalation phases.

safety evaluation

Key Terms

activation-patching residual-stream prompt-injection multi-turn-dialogue