When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh|May 1, 2026arXiv

Key Takeaway

LLMs fail at executing multi-step procedures faithfully, with accuracy collapsing as procedure length increases. This means strong benchmark performance can hide critical weaknesses in following instructions step-by-step.

Summary

This paper tests whether large language models actually follow step-by-step procedures correctly, not just whether they get the right final answer. Researchers created a benchmark where models execute arithmetic algorithms of varying length and complexity.

evaluation reasoning alignment

Key Terms

procedural-execution reasoning-faithfulness look-back-dependencies under-executed-traces