For medical imaging with vision-language models, representation geometry matters more than you might expect—how you aggregate information and encode tissue properties has bigger impact on performance than simply adding more spatial coverage.
This study explores how to best represent CT scan slices in vision-language models for diagnosing inflammatory bowel disease. The researchers find that different ways of combining slice embeddings work better for different tasks: simple averaging helps disease classification, while attention-based pooling improves image-text matching.