Enhancing Mathematical Problem Solving in LLMs through Execution-Driven Reasoning Augmentation
Mathematical problem solving is a fundamental benchmark for assessing the reasoning capabilities of artificial intelligence and a gateway to applications in education, science, and engineering where reliable symbolic reasoning is essential. Although recent advances in multi-agent LLM-based systems have enhanced their mathematical reasoning capabilities, they still lack a reliably revisable representation of the reasoning process. Existing agents either operate in rigid sequential pipelines that cannot correct earlier steps or rely on heuristic self-evaluation that can fail to identify and fix errors. In addition, programmatic context can distract language models and degrade accuracy. To address these gaps, we introduce Iteratively Improved Program Construction (IIPC), a reasoning method that iteratively refines programmatic reasoning chains and combines execution feedback with the native Chain-of-thought abilities of the base LLM to maintain high-level contextual focus. IIPC surpasses competing approaches in the majority of reasoning benchmarks on multiple base LLMs. All code and implementations are released as open source.
💡 Research Summary
**
The paper introduces Iteratively Improved Program Construction (IIPC), a novel reasoning framework designed to boost large language models’ (LLMs) ability to solve mathematical problems. Existing multi‑agent systems such as Cumulative Reasoning (CR) and Multi‑agent Condition Mining (MACM) suffer from rigid, forward‑only pipelines that cannot revise earlier steps, leading to cascading errors. Purely code‑based agents (e.g., Program of Thoughts, PAL) rely heavily on execution results; when execution succeeds but the underlying logic is flawed, the model may still produce an incorrect answer. IIPC tackles both issues by treating a generated program as an explicit, mutable representation of the entire reasoning chain and by iteratively refining that program using deterministic execution feedback.
Core Architecture
- Initial Proposition Extraction (
f_init) – From the problem statementx, the model extracts a set of key propositionssthat capture all required information. - Program Generation (
f_prog) – Usingxands, the model produces an initial Python programp₁. The program is constrained to a safe subset of libraries (numpy, math, sympy, scipy, scikit‑spatial), forbids list comprehensions and recursion, and includes verbose comments and print statements for easy debugging. - Execution (
E) – The program is run in an interpreter, yielding outputo₁, which may be a correct result, an error message, or both. - Error Memory (
M_t) – Whenever an error or logical inconsistency is detected, a concise descriptorm_tis stored in a persistent memory. This memory guides future revisions and prevents the same mistake from being repeated. - Refinement Loop – Two complementary pathways are applied:
- Error Correction (
f_err) – Ifo_tcontains an error, only the offending code segment is edited, and a new error descriptor is added toM_t. - Process Validation (
f_val) – If execution succeeds, the program’s logical consistency is checked; if inconsistencies are found, a refinement function (f_ref) modifies the program.
The loop allows up to two process validations and two error corrections per problem, storing each successful program in an arrayP_t. The most recent working programp*is later used for answer generation.
- Error Correction (
- Parallel Chain‑of‑Thought (
f_cot) – Independently of the code, the model generates a pure‑text CoT reasoning tracec. This preserves high‑level linguistic reasoning that is not polluted by potentially noisy program outputs. - Final Integration (
f_comb) – The latest programp*, its execution resulto*, and the CoTcare concatenated via a structured prompt and fed back to the LLM, which produces the final answery. This step ensures that both symbolic evidence (from the program) and natural‑language reasoning contribute to the decision, reducing over‑reliance on any single modality.
Experimental Setup
The authors evaluate IIPC on two large math benchmarks:
- MATH – A diverse set of problems spanning five difficulty levels and multiple topics (1483 problems selected for balanced coverage).
- AIME – 933 competition‑level problems (1983‑2024) that require creative, high‑level reasoning.
Five state‑of‑the‑art LLMs are tested: GPT‑4o mini, Gemini 2.0 Flash, Mistral Small 3.2 24B, Gemma 3 27B, and Llama 4 Maverick. IIPC’s performance is compared against CR, MACM, and PoT prompting. All agents are constrained to a single‑run setting (no voting/ensemble) to keep token usage comparable; MACM is adapted accordingly, which may disadvantage it but yields a fair head‑to‑head comparison.
Results
- Across most difficulty bins and topics, IIPC outperforms the baselines by 3–7 percentage points.
- On the high‑difficulty AIME set, the execution‑guided refinement yields the most pronounced gains, confirming that iterative program correction is especially valuable for complex reasoning.
- Ablation studies show:
- Removing the program‑refinement branch (using only CoT) drops accuracy dramatically, indicating the necessity of symbolic execution.
- Disabling the error‑memory leads to repeated mistakes and a ~40% increase in error recurrence.
- Raising decoding temperature increases token consumption without meaningful accuracy improvement.
The authors also report that IIPC’s dual‑branch design reduces overall token usage by roughly 12% compared with pure code‑first approaches, because the final integration step is the only point where program output and language reasoning are merged.
Open‑Source Release
All code, the full reasoning‑trace corpus (problem statements, initial propositions, generated programs, execution logs, reflections, and final answers), and evaluation scripts are released under an open‑source license. This enables reproducibility and provides a valuable benchmark for future research on program‑centric reasoning.
Conclusions and Future Directions
IIPC demonstrates that treating programs as mutable, inspectable representations of reasoning, coupled with deterministic execution feedback and a parallel natural‑language CoT stream, can overcome the two major shortcomings of prior systems: lack of a revisable global reasoning state and over‑dependence on execution signals. The framework yields consistent improvements across diverse LLM architectures and scales to challenging competition‑level math problems. Future work may explore: (1) extending the library set to cover more advanced domains (e.g., differential geometry, statistical modeling), (2) integrating other symbolic tools (computer algebra systems, graph databases) into the refinement loop, and (3) meta‑learning the error‑memory updates so that the model can generalize correction strategies across tasks. Overall, IIPC offers a practical, scalable, and transparent path toward more reliable symbolic reasoning in large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment