The difference between how well models assess their own confidence versus how well humans evaluate belief certainty against evidence.