Exclusive Unlearning

Mutsumi Sasaki, Kouta Nakayama, Yusuke Miyao, Yohei Oseki, Masaru Isonuma|April 7, 2026arXiv

Key Takeaway

Rather than listing harmful content to remove, you can create safer models by keeping only the knowledge domains you need and forgetting the rest—this is more effective against diverse harms and jailbreaks.

Summary

This paper introduces Exclusive Unlearning, a technique that makes language models safer by forgetting most of their knowledge except for specific domains you want to keep. Instead of trying to remove harmful content one piece at a time, this approach keeps only what's useful (like medical knowledge) and discards everything else, making the model resistant to jailbreak attempts.

safety training alignment

Key Terms

machine-unlearning jailbreaking domain-specific-fine-tuning