A standardized test dataset that evaluates AI models on tasks combining multiple types of input like images and text.