Evaluation criteria that assess meaning and correctness of agent outputs beyond surface-level metrics.