SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Hao Li, Jingkun An, Zijun Song, Pengyu Zhu, Rui Li et al.|June 1, 2026arXiv

Key Takeaway

You can align LLMs for safety without the usual trade-off in general capabilities by targeting safety training to specific tokens rather than retraining globally, and this works with minimal data.

Summary

SafeSteer is a method that makes LLMs safer without hurting their general abilities by focusing safety training only on the specific tokens that matter for safety decisions.

safety alignment efficiency

Key Terms

activation-steering alignment-tax on-policy-distillation reverse-kl-divergence safety-token-selection