Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

Andrew Bond, Ilkin Umut Melanlioglu, Erkut Erdem, Aykut Erdem|April 30, 2026arXiv

Key Takeaway

Using geometrically-aligned latent spaces (hyperspheres instead of Gaussian distributions) in autoencoders preserves 3D structure and physics better than standard approaches, which matters for building world models that understand real 3D scenes.

Summary

This paper proposes S²VAE, a new type of autoencoder that uses hyperspherical (spherical geometry) latent representations instead of traditional Gaussian ones to better preserve 3D geometry and camera motion in visual world models.

architecture multimodal efficiency

Key Terms

variational-autoencoder hyperspherical-structure power-spherical-distribution world-model latent-bottleneck