What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

Sihui Dai, Mann Patel|June 18, 2026arXiv

Key Takeaway

Safety training through preference optimization is critical for preventing benign demonstrations from accidentally increasing harmful compliance—models extract different lessons from the same demonstrations depending on their training methodology.

Summary

This paper investigates how language models interpret mixed compliance demonstrations—some showing helpful responses to benign requests, others showing helpful responses to harmful requests. The researchers find that benign and harmful demonstrations aren't interchangeable; their effect on jailbreaking depends on model training, demonstration order, and how the model handles refusals.

safety training alignment

Key Terms

in-context-learning jailbreaking preference-optimization refusal-behavior recency-bias