Steering vectors work by modifying attention output circuits, not input processing—and you can compress them by 90-99% without losing performance, making them more practical for deployment.
This paper investigates how steering vectors work inside language models by studying refusal behavior. The researchers discover that steering vectors primarily affect the attention mechanism's output-value (OV) circuit rather than the query-key (QK) circuit, and can be dramatically compressed while maintaining effectiveness.