Current LLM evaluation metrics fail to catch semantic contradictions, potentially hiding serious errors. MATCHA solves this by explicitly measuring both agreement with correct answers and distance from contradictory statements.
MATCHA is a new evaluation metric for LLMs that fixes a critical flaw in popular metrics like ROUGE and BERTScore: they give similar scores to contradictory texts. MATCHA uses a dual approach—rewarding similarity to correct answers while penalizing contradictions—and significantly outperforms existing metrics across question-answering, summarization, and other tasks.