Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

Zifan Carl Guo, Laura Ruis, Jacob Andreas, Belinda Z. Li|June 30, 2026arXiv

Key Takeaway

Fixed counterfactual explanations from earlier model checkpoints can effectively train language models to generate faithful explanations of their own behavior, even as the model changes during training—offering a scalable approach to interpretability without requiring updated labels.

Summary

This paper shows that language models trained to explain their predictions can learn faithful self-explanations even when trained on fixed explanations from earlier versions of themselves. The key finding is that explanations naturally track the model's current behavior rather than mimicking their training targets, enabling scalable post-training without constantly updating supervision data.

alignment training

Key Terms

counterfactual-explanation introspection faithfulness post-training