Rather than listing harmful content to remove, you can create safer models by keeping only the knowledge domains you need and forgetting the rest—this is more effective against diverse harms and jailbreaks.
This paper introduces Exclusive Unlearning, a technique that makes language models safer by forgetting most of their knowledge except for specific domains you want to keep. Instead of trying to remove harmful content one piece at a time, this approach keeps only what's useful (like medical knowledge) and discards everything else, making the model resistant to jailbreak attempts.