Training safety classifiers to explicitly model user intent—not just analyze prompts directly—produces more robust safety decisions across different training approaches and external benchmarks.
This paper shows that safety classifiers work better when they explicitly model what users intend to do, not just what they say. The authors created AIMS, a dataset of 1,724 tricky safety prompts with intent descriptions, and tested intent-aware training across multiple methods (fine-tuning, preference learning, reasoning distillation, and reinforcement learning).