Using geometrically-aligned latent spaces (hyperspheres instead of Gaussian distributions) in autoencoders preserves 3D structure and physics better than standard approaches, which matters for building world models that understand real 3D scenes.
This paper proposes S²VAE, a new type of autoencoder that uses hyperspherical (spherical geometry) latent representations instead of traditional Gaussian ones to better preserve 3D geometry and camera motion in visual world models.