Instead of asking an LLM for a single opaque score, ask it multiple specific binary questions about output quality, then aggregate the answers—this gives you both better evaluation accuracy and actionable feedback for improvement.
BINEVAL breaks down LLM evaluation into simple yes/no questions about specific criteria, then combines answers into interpretable scores. This makes evaluation transparent, debuggable, and useful for improving prompts—matching or beating existing LLM judges while being easier to understand and fix.