Reasoning models can be made safer by detecting when they've misunderstood the question itself—reconstruct what question they answered from their reasoning trace, and abstain if it differs from the original.
This paper tackles a critical problem: getting LLMs to know when to refuse answering questions. The authors discovered that reasoning models often fail at abstention (refusing to answer) because they answer the wrong question rather than answering incorrectly.