Do Sparse Autoencoders Capture Concept Manifolds?

Usha Bhalla, Thomas Fel, Can Rager, Sheridan Feucht, Tal Haklay et al.|April 30, 2026arXiv

Key Takeaway

SAEs don't cleanly capture continuous concept structures—they fragment them across features in ways that hide geometric relationships, suggesting interpretability research needs to look for groups of features rather than individual directions.

Summary

Sparse autoencoders (SAEs) are popular tools for finding interpretable features in AI models, but this paper shows they struggle to capture concepts organized as continuous geometric structures (manifolds).

architecture evaluation

Key Terms

sparse-autoencoder manifold linear-span interpretability