Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

Sangwoo Cho, Kushal Chawla, Pengshan Cai, Zefang Liu, Chenyang Zhu et al.|June 25, 2026arXiv

Key Takeaway

Instead of asking an LLM for a single opaque score, ask it multiple specific binary questions about output quality, then aggregate the answers—this gives you both better evaluation accuracy and actionable feedback for improvement.

Summary

BINEVAL breaks down LLM evaluation into simple yes/no questions about specific criteria, then combines answers into interpretable scores. This makes evaluation transparent, debuggable, and useful for improving prompts—matching or beating existing LLM judges while being easier to understand and fix.

evaluation reasoning

Key Terms

llm-judge prompt-engineering factual-consistency decomposition