LLMs can partially infer rules from demonstrations, but they struggle with procedural rules and often fail during multi-step execution—suggesting rule induction remains a significant limitation for current models.
This paper introduces HERO'S JOURNEY, a benchmark that tests whether large language models can learn hidden rules from examples and then follow those rules to complete multi-step tasks. The benchmark includes eight different tasks with varying rule types and finds that current LLMs struggle with this kind of rule learning, especially when rules involve procedures rather than simple attributes.