When training with noisy labels, anchoring text prompts to visual evidence makes them more robust—visual information is inherently more reliable than potentially incorrect labels, so using it to guide prompt updates reduces memorization of mislabeled samples.
VisPrompt is a lightweight framework that makes prompt learning for vision-language models more robust to mislabeled data. It uses visual information to guide and stabilize prompt learning by injecting image semantics into text prompts through a cross-modal attention mechanism, while adaptively controlling how much visual information to use per sample.