Safety training through preference optimization is critical for preventing benign demonstrations from accidentally increasing harmful compliance—models extract different lessons from the same demonstrations depending on their training methodology.
This paper investigates how language models interpret mixed compliance demonstrations—some showing helpful responses to benign requests, others showing helpful responses to harmful requests. The researchers find that benign and harmful demonstrations aren't interchangeable; their effect on jailbreaking depends on model training, demonstration order, and how the model handles refusals.