You can now guide vision models with text prompts to focus on non-obvious visual concepts while maintaining strong performance on generic vision tasks—without needing separate language-centric models.
This paper introduces steerable visual representations that can be guided by natural language to focus on specific objects or concepts in images.