Decoding bounding boxes as complete geometric units instead of individual tokens dramatically speeds up inference while maintaining or improving localization accuracy.
LocateAnything replaces slow token-by-token box decoding with Parallel Box Decoding, which generates entire bounding boxes at once. Combined with a 138-million-sample dataset, this approach makes visual grounding and detection faster while improving accuracy on standard benchmarks.