Task accuracy and conversational awareness are separate capabilities—a model can answer questions correctly without understanding how users naturally respond to those answers, revealing a blind spot in current LLM evaluation.
This paper reveals that language models can solve tasks correctly without understanding how conversations should naturally continue. Researchers tested this by asking models to generate the next user message after an assistant response—a task that requires understanding interaction flow.