LLMs can maintain surface-level syntax when following grammars but fail at deeper semantic understanding, especially with complex nested structures—a critical limitation for building reliable AI agents that need to follow formal specifications.
This paper tests whether large language models can correctly interpret and follow context-free grammars (formal rules for structured output). The researchers created RoboGrid, a testing framework that checks if LLMs produce syntactically correct, semantically meaningful outputs when given novel grammars.