Creating evaluation criteria by comparing gaps between teacher and model responses to identify what distinguishes good from bad outputs.