LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu et al.|May 26, 2026arXiv

Key Takeaway

Decoding bounding boxes as complete geometric units instead of individual tokens dramatically speeds up inference while maintaining or improving localization accuracy.

Summary

LocateAnything replaces slow token-by-token box decoding with Parallel Box Decoding, which generates entire bounding boxes at once. Combined with a 138-million-sample dataset, this approach makes visual grounding and detection faster while improving accuracy on standard benchmarks.

efficiency multimodal architecture

Key Terms

vision-language-models visual-grounding bounding-box autoregressive-decoding inference-throughput