Using multimodal large language models to evaluate outputs by assessing both visual and semantic correctness with rubrics.