HERO'S JOURNEY: Testing Complex Rule Induction with Text Games

Anshun Asher Zheng, Kanishka Misra, David I. Beaver, Junyi Jessy Li|June 1, 2026arXiv

Key Takeaway

LLMs can partially infer rules from demonstrations, but they struggle with procedural rules and often fail during multi-step execution—suggesting rule induction remains a significant limitation for current models.

Summary

This paper introduces HERO'S JOURNEY, a benchmark that tests whether large language models can learn hidden rules from examples and then follow those rules to complete multi-step tasks. The benchmark includes eight different tasks with varying rule types and finds that current LLMs struggle with this kind of rule learning, especially when rules involve procedures rather than simple attributes.

evaluation reasoning

Key Terms

rule-induction in-context-learning multi-step-execution