When scaling sparse autoencoders for interpretability, enforcing cross-sample consistency prevents features from fragmenting or developing exceptions, making the learned representations more reliable for understanding language model behavior.
This paper identifies and fixes two major problems in Sparse Autoencoders (SAEs) used to interpret language models: feature splitting (where single concepts fragment into multiple latents) and feature absorption (where general features develop arbitrary exceptions).