The size of images the model can process, measured in pixels; lower resolution (like 224px) means faster processing but less visual detail captured.
Quality of vision, audio, and image understanding (distinct from modality support)