LLMs often express high confidence in wrong answers, and standard evaluation metrics miss this problem—BAS provides a decision-focused alternative that rewards models for knowing when to say 'I don't know' instead of guessing confidently.
This paper introduces BAS (Behavioral Alignment Score), a new metric for measuring whether LLMs' confidence levels are actually useful for deciding when to abstain from answering. Unlike standard metrics that treat all errors equally, BAS penalizes overconfident wrong answers more heavily, reflecting real-world decision-making where false confidence is costlier than admitting uncertainty.