D2D reveals stealth biases in deployed LLMs by concentrating distributional shifts into a small adapter, making hidden preferences visible in generated text—enabling auditing of models where bias inspection would otherwise be impossible.
This paper introduces Distill to Detect (D2D), a method to uncover hidden biases in language models that only favor certain entities or viewpoints on specific topics while appearing normal elsewhere. The approach works by distilling differences between a suspect model and its base version into a compact adapter, amplifying hidden bias signals into detectable text patterns.