By aligning DINO's semantic features with SAM's structural priors through specialized encoder-decoder modules, you can achieve both semantic generalization and precise edge detection for segmentation tasks without predefined categories.
This paper tackles open-vocabulary segmentation—identifying and outlining objects in images even when they're not in the training set—by combining two foundation models: DINO for semantic understanding and SAM for precise edge detection.