MoRFI: Monotonic Sparse Autoencoder Feature Identification

Dimitris Dimakopoulos, Shay B. Cohen, Ioannis Konstas|April 29, 2026arXiv

Key Takeaway

Fine-tuning on new knowledge disrupts specific neural pathways that retrieve existing facts; you can identify and fix these broken directions using sparse autoencoders without retraining the entire model.

Summary

This paper investigates why fine-tuning language models on new facts causes hallucinations. The researchers fine-tuned three models on controlled QA datasets and used sparse autoencoders to identify specific neural directions responsible for hallucinations.

safety

Key Terms

sparse-autoencoder residual-stream activation-patching hallucination supervised-fine-tuning