The Illusion of Procedural Reasoning: Measuring Long-Horizon FSM Execution in LLMs

Large language models (LLMs) have achieved remarkable results on tasks framed as reasoning problems, yet their true ability to perform procedural reasoning, executing multi-step, rule-based computations remains unclear. Unlike algorithmic systems, which can deterministically execute long-horizon symbolic procedures, LLMs often degrade under extended reasoning chains, but there is no controlled, interpretable benchmark to isolate and measure this collapse. We introduce Finite-State Machine (FSM) Execution as a minimal, fully interpretable framework for evaluating the procedural reasoning capacity of LLMs. In our setup, the model is given an explicit FSM definition and must execute it step-by-step given input actions, maintaining state consistency over multiple turns. This task requires no world knowledge, only faithful application of deterministic transition rules, making it a direct probe of the model’s internal procedural fidelity. We measure both Turn Accuracy (local correctness) and Task Accuracy (global long-horizon correctness) to disentangle immediate computation from cumulative state maintenance. Empirical results reveal systematic degradation as task horizon or branching complexity increases. Models perform significantly worse when rule retrieval involves high branching factors (many actions per state) than when memory span is long (many states, few actions). Larger models show improved local accuracy but remain brittle under multi-step reasoning unless explicitly prompted to externalize intermediate steps. These findings expose a consistent illusion of procedural reasoning: LLMs can mimic algorithmic behavior for short traces but fail to sustain coherent execution as procedural depth grows. FSM-based evaluation offers a transparent, complexity-controlled probe for diagnosing this failure mode and guiding the design of inductive biases, memory mechanisms, and reasoning scaffolds that enable genuine longhorizon procedural competence. By grounding “reasoning” in measurable execution fidelity rather than surface correctness, this work helps establish a rigorous experimental foundation for understanding and improving the algorithmic reliability of LLMs.

📜 Original Paper Content