LLMs fail at executing multi-step procedures faithfully, with accuracy collapsing as procedure length increases. This means strong benchmark performance can hide critical weaknesses in following instructions step-by-step.
This paper tests whether large language models actually follow step-by-step procedures correctly, not just whether they get the right final answer. Researchers created a benchmark where models execute arithmetic algorithms of varying length and complexity.