Language-based feedback preserves more information than scalar signals when learning from imperfect data, enabling policies to understand not just what went wrong but why and how to fix it.
This paper proposes using natural language critiques as structured supervision signals for learning from suboptimal demonstrations. Instead of compressing feedback into scalar scores, the method generates language labels describing task progress, failures, and corrections, then trains policies directly on these rich signals.