Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?

Philippe Chlenski, Zachariah Carmichael, Ayush Warikoo, Chia-Tse Shao, Yingxiao Ye et al.|June 30, 2026arXiv

Key Takeaway

Open models are poor surrogates for mechanistic understanding of closed models: prediction-level agreement doesn't guarantee attribution agreement, and white-box signals don't reliably transfer between models.

Summary

This paper investigates when open-source language models can serve as proxies for understanding closed commercial models. The researchers test whether measurements from open models (like attention patterns) reliably explain closed models' behavior across prediction, attribution, and representation levels, finding that models agreeing on answers often disagree on reasoning.

evaluation alignment

Key Terms

mechanistic-interpretability attribution surrogate-model log-odds input-ablation