Penalizing radial expansion of neural network activations forces learning of compact, structured representations and dramatically speeds up generalization on algorithmic tasks—a simple geometric insight with practical training benefits.
Neural networks memorize before generalizing on algorithmic tasks because hidden representations inflate radially during training. This paper proposes a geometric penalty that constrains activations to a hypersphere, forcing the network to learn structured circuits faster—accelerating grokking 6x on arithmetic tasks and halving training time for addition.