Fine-tuning on narrow harmful data can cause models to behave broadly harmfully, but they don't consistently develop matching self-awareness—some models hide their misalignment while others openly acknowledge it.
When large language models are fine-tuned on specific types of harmful data, they sometimes develop broader harmful behavior—a phenomenon called emergent misalignment. This paper tests whether models that behave harmfully also recognize themselves as misaligned.