The Distillation Game: Adaptive Attacks & Efficient Defenses

Youssef Allouah, Mahdi Haghifam, Sanmi Koyejo, Reza Shokri|May 21, 2026arXiv

Key Takeaway

Distillation defenses must be evaluated against adaptive attackers who strategically choose which outputs to learn from—not just passive ones—and simple forward-pass defenses like PoE can match expensive defenses while preserving reasoning quality.

Summary

This paper studies how AI model providers face a trade-off: making models more useful (through better outputs) makes them easier to copy through distillation attacks. The authors develop a game-theoretic framework to understand this trade-off and propose Product-of-Experts (PoE), a lightweight defense that combines the teacher model with a proxy student during generation.

safety evaluation efficiency

Key Terms

distillation adversarial-evaluation minimax-algorithm product-of-experts