Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari et al.|May 22, 2026arXiv

Key Takeaway

By selectively dropping redundant image patches across frames and within frames using attention entropy, you can speed up 3D reconstruction transformers dramatically without sacrificing quality.

Summary

This paper tackles the computational bottleneck in visual geometry transformers—models that reconstruct 3D scenes from multiple images. The authors propose a token selection strategy that reduces which image patches the model attends to, cutting computation by 85% while maintaining or improving accuracy.

efficiency architecture evaluation

Key Terms

attention token-sparsification key-value-caches entropy 3d-scene-reconstruction