Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes

Jeremias Ferrao, Niclas Müller-Hof, Iustin Sîrbu, Traian Rebedea, Yftah Ziser|June 25, 2026arXiv

Key Takeaway

Training safety classifiers to explicitly model user intent—not just analyze prompts directly—produces more robust safety decisions across different training approaches and external benchmarks.

Summary

This paper shows that safety classifiers work better when they explicitly model what users intend to do, not just what they say. The authors created AIMS, a dataset of 1,724 tricky safety prompts with intent descriptions, and tested intent-aware training across multiple methods (fine-tuning, preference learning, reasoning distillation, and reinforcement learning).

safety training evaluation

Key Terms

safety-classification intent-formation preference-optimization reasoning-distillation grpo