Procedural Refinement by LLM-driven Algorithmic Debugging for ARC-AGI-2

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In complex code-generation tasks, conversation-based LLM code repair exhibits limited ability to recover from first-pass programming errors, as such code revisions are usually driven by LLMs’ “plausible reasoning” rather than a formal, algorithmic debugging procedure. However, a formal foundation for such debugging exists in Udi Shapiro’s theory of algorithmic program debugging (APD), which frames program repair as an explicit, stepwise procedural refinement process. In this paper, we propose a neuro-symbolic procedural refinement approach, Abduction-Based Procedural Refinement (ABPR), which couples an LLM with a meta-interpreter that materialises program execution into compact, declarative tree-structured traces, following the principles of APD. We evaluate ABPR on ARC-AGI-2, a benchmark requiring strong abstraction and debugging capabilities, and adopt Prolog as the target language due to its declarative semantics, which are well-suited to algorithmic program debugging. Our experiments show that ABPR paired with Gemini-3-Flash achieves a Pass@2 score of 56.67% even in a language in which contemporary LLMs typically underperform. These results point towards a more auditable paradigm for program repair by integrating LLMs with classical formal methods.

💡 Research Summary

The paper addresses a fundamental limitation of current large language model (LLM)–based code generation: while LLMs can produce syntactically plausible programs, they lack a systematic, formally grounded method for correcting the inevitable errors that arise on complex tasks. The authors revive Udi Shapiro’s Algorithmic Program Debugging (APD) framework—a classic symbolic AI approach that treats debugging as a structured search over a “debugging tree” representing the logical structure of program execution. They then integrate APD with modern neural models to create a neuro‑symbolic system called Abduction‑Based Procedural Refinement (ABPR).

Core ideas

Declarative execution traces – A lightweight Prolog meta‑interpreter reifies the execution of a candidate program into a tree‑structured trace (tree(Goal, SubTraces)). Each node records the resolved goal, its sub‑derivations, variable bindings, and success/failure status. This trace provides concrete semantic evidence that can be inspected by an LLM.
LLM as oracle and repairer – The LLM plays two roles. First, as an “oracle” it traverses the trace, identifies a set of suspicious nodes (the bug candidate set N*), and explains why they are inconsistent with the intended specification. Second, as a “repairer” it proposes minimal abductive modifications (e.g., swapping an operator, adding/removing a literal) to the identified nodes. The process mirrors classic abductive reasoning but is driven by stochastic sampling from the LLM’s conditional distribution.
Iterative refinement loop – ABPR operates as a state machine (Algorithm 1). Starting from an initial hypothesis H₀ generated by prompting the LLM on bottom‑clause ILP‑style examples, the system repeatedly: (a) executes the current hypothesis to obtain a trace, (b) localises bugs via the oracle, (c) generates a refined hypothesis Hₜ₊₁, and (d) checks consistency against all training examples using a Prolog validator. A history buffer retains the top‑k most promising hypotheses, and the loop terminates when a hypothesis passes all examples or a time budget is exhausted.

Experimental setting
The authors evaluate ABPR on ARC‑AGI‑2, a benchmark that requires synthesising a hidden program that maps input grids to output grids from only a few examples. The task demands abstraction, systematic generalisation, and algorithmic reasoning, and it provides no intermediate supervision—making it a stringent test for self‑correction mechanisms. Prolog is chosen as the target language because its declarative semantics align naturally with APD’s logical view of execution.

Results
When paired with Gemini‑3‑Flash, ABPR achieves a Pass@2 score of 56.67 %, substantially outperforming existing LLM‑only baselines (≈31 %). An ablation study shows that removing the declarative trace drops performance back to baseline levels, confirming that the trace is the primary driver of improvement. The authors also report that the stochastic nature of LLM sampling does not destabilise the loop; instead, it enables exploration of high‑likelihood regions of the refinement space while the trace keeps the search focused.

Analysis and implications
The work demonstrates that auditability—the ability to trace a correction back to a concrete logical fault—is achievable when LLMs are coupled with formal symbolic artifacts. By converting the “plausible reasoning” of LLMs into a series of low‑entropy, verifiable sub‑problems, ABPR mitigates the circularity that plagues purely conversational self‑repair. Moreover, the approach is language‑agnostic in principle: any language that can be interpreted declaratively (e.g., Datalog, Answer Set Programming) could be equipped with a suitable meta‑interpreter to generate analogous traces.

Limitations and future directions
The current system relies on a single LLM both for bug localisation and for proposing fixes; errors in the oracle can propagate, limiting robustness. The authors suggest integrating confidence estimation, ensemble LLMs, or hybridizing with traditional theorem provers to improve reliability. Extending the meta‑interpreter to richer language features (e.g., constraints, higher‑order predicates) and evaluating on other symbolic benchmarks are natural next steps.

Conclusion
ABPR bridges a decades‑old formal debugging theory with state‑of‑the‑art neural generation, delivering a concrete, auditable, and empirically effective method for procedural program refinement. The results on ARC‑AGI‑2 illustrate that even in domains where LLMs traditionally underperform (declarative logic programming), a neuro‑symbolic loop grounded in APD can achieve competitive performance, opening a promising pathway for future LLM‑augmented software engineering tools.

Procedural Refinement by LLM-driven Algorithmic Debugging for ARC-AGI-2

💡 Research Summary

Comments & Academic Discussion

Leave a Comment