Fine-tuning on new knowledge disrupts specific neural pathways that retrieve existing facts; you can identify and fix these broken directions using sparse autoencoders without retraining the entire model.
This paper investigates why fine-tuning language models on new facts causes hallucinations. The researchers fine-tuned three models on controlled QA datasets and used sparse autoencoders to identify specific neural directions responsible for hallucinations.