Representation geometry shapes task performance in vision-language modeling for CT enterography

Cristian Minoccheri, Emily Wittrup, Kayvan Najarian, Ryan Stidham|April 14, 2026arXiv

Key Takeaway

For medical imaging with vision-language models, representation geometry matters more than you might expect—how you aggregate information and encode tissue properties has bigger impact on performance than simply adding more spatial coverage.

Summary

This study explores how to best represent CT scan slices in vision-language models for diagnosing inflammatory bowel disease. The researchers find that different ways of combining slice embeddings work better for different tasks: simple averaging helps disease classification, while attention-based pooling improves image-text matching.

multimodal evaluation

Key Terms

vision-language-model mean-pooling attention-pooling rag lora