Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation

Shayan Talaei, Abhinav Chinta, Devvrit Khatri, Amin Karbasi, Azalia Mirhoseini et al.|July 1, 2026arXiv

Key Takeaway

D2D reveals stealth biases in deployed LLMs by concentrating distributional shifts into a small adapter, making hidden preferences visible in generated text—enabling auditing of models where bias inspection would otherwise be impossible.

Summary

This paper introduces Distill to Detect (D2D), a method to uncover hidden biases in language models that only favor certain entities or viewpoints on specific topics while appearing normal elsewhere. The approach works by distilling differences between a suspect model and its base version into a compact adapter, amplifying hidden bias signals into detectable text patterns.

safety evaluation alignment

Key Terms

kv-cache prefix-tuning soft-labels knowledge-distillation fisher-information-matrix