Steering vectors work by modifying specific attention circuits (OV) rather than broadly changing model behavior—this insight enables dramatic compression and explains why different steering methods produce similar results.
This paper reveals how steering vectors work inside language models by studying refusal behavior. Using activation patching, the researchers found that steering vectors primarily affect the attention mechanism's output-value (OV) circuit rather than the query-key (QK) circuit, and can be compressed by 90-99% while maintaining performance.