Open models are poor surrogates for mechanistic understanding of closed models: prediction-level agreement doesn't guarantee attribution agreement, and white-box signals don't reliably transfer between models.
This paper investigates when open-source language models can serve as proxies for understanding closed commercial models. The researchers test whether measurements from open models (like attention patterns) reliably explain closed models' behavior across prediction, attribution, and representation levels, finding that models agreeing on answers often disagree on reasoning.