Fixed counterfactual explanations from earlier model checkpoints can effectively train language models to generate faithful explanations of their own behavior, even as the model changes during training—offering a scalable approach to interpretability without requiring updated labels.
This paper shows that language models trained to explain their predictions can learn faithful self-explanations even when trained on fixed explanations from earlier versions of themselves. The key finding is that explanations naturally track the model's current behavior rather than mimicking their training targets, enabling scalable post-training without constantly updating supervision data.