What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Stephen Cheng, Sarah Wiegreffe, Dinesh Manocha|April 9, 2026arXiv

Key Takeaway

Steering vectors work by modifying attention output circuits, not input processing—and you can compress them by 90-99% without losing performance, making them more practical for deployment.

Summary

This paper investigates how steering vectors work inside language models by studying refusal behavior. The researchers discover that steering vectors primarily affect the attention mechanism's output-value (OV) circuit rather than the query-key (QK) circuit, and can be dramatically compressed while maintaining effectiveness.

alignment safety

Key Terms

steering-vectors activation-patching ov-circuit qk-circuit refusal-behavior