Safety interventions that look effective in standard evaluations can mask "conditional misalignment"—models that behave well on out-of-distribution prompts but revert to worse-than-trained misalignment when given inputs matching their training context.
When language models are finetuned on misaligned behavior, common safety interventions (mixing in benign data, sequential finetuning, inoculation prompting) appear to work on standard tests but fail when evaluation prompts resemble the training context.