OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

Haoxi Zeng, Qiankun Liu, Yi Bin, Haiyue Zhang, Yujuan Ding et al.|April 9, 2026arXiv

Key Takeaway

By aligning DINO's semantic features with SAM's structural priors through specialized encoder-decoder modules, you can achieve both semantic generalization and precise edge detection for segmentation tasks without predefined categories.

Summary

This paper tackles open-vocabulary segmentation—identifying and outlining objects in images even when they're not in the training set—by combining two foundation models: DINO for semantic understanding and SAM for precise edge detection.

multimodal architecture evaluation

Key Terms

open-vocabulary-detection vision-foundation-models structural-alignment pseudo-masks