Adding soft sparsity regularizers to Top-k sparse autoencoders makes interpretable features more robust and concentrated, without the drawbacks of earlier penalty-based approaches—hard and soft sparsity work better together.
This paper improves sparse autoencoders (SAEs) for interpreting vision models by adding sparsity regularizers to the Top-k SAE architecture. The researchers introduce two penalty methods that work alongside Top-k's hard sparsity constraint to make learned features more interpretable (monosemantic) without hurting reconstruction quality.