A model that accepts flexible user inputs (like text descriptions, points, or bounding boxes) to guide what it should identify or process in an image.