Steerable Visual Representations

Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano|April 2, 2026arXiv

Key Takeaway

You can now guide vision models with text prompts to focus on non-obvious visual concepts while maintaining strong performance on generic vision tasks—without needing separate language-centric models.

Summary

This paper introduces steerable visual representations that can be guided by natural language to focus on specific objects or concepts in images.

multimodal architecture evaluation

Key Terms

vision-transformer early-fusion cross-attention zero-shot-generalization