Figure 1 : Overview of the Hermes framework. Hermes is a Lean4-driven, multi-modular reasoning agent integrating LLM reasoning with formal verification for reliable mathematical problem solving. It comprises four modules: an LLM that generates reasoning steps, a translator that formalizes these steps into Lean code, a prover that symbolically verifies their correctness, and a feedback module that returns verification signals for subsequent reasoning. This design enables iterative reasoning with improved correctness and efficiency.
In recent years, Large Language Models (LLMs) have achieved remarkable proficiency in mathematical reasoning [1][2][3][4], with some systems even demonstrating the potential to solve Olympiad-level problems [5]. A key advancement driving this progress is the Chain-of-Thought (CoT) approach, which enables LLMs to plan step-by-step reasoning, decompose complex problems into sub-goals, generate intermediate reasoning steps, and iteratively assess and correct them. However, long CoTs remain susceptible to logical leaps, subtle errors, and hallucinations, stemming from incomplete domain knowledge, imprecise reasoning, or the accumulation of small mistakes over multiple steps [6][7][8]. These issues could lead to unstable reasoning and degraded performance, and when uncertain, LLMs may produce overly long or repetitive reasoning traces, increasing token usage without necessarily improving correctness.
To address these limitations, researchers have proposed Process Reward Models (PRMs) and Outcome Reward Models (ORMs) [9,10], which aim to guide LLM reasoning toward correctness by scoring intermediate reasoning steps or final solutions. Specifically, PRMs assign a score at each reasoning step, rewarding local correctness, while an ORMs provide a single scalar reflecting overall solution quality. While these models can improve reasoning performance, they function largely as black-box evaluators that assign numerical scores without explaining why a reasoning trajectory is valid or flawed, offering limited interpretability and no explicit verification of mathematical correctness [11]. Moreover, their training requires substantial human curation [6], and automated supervision methods introduce noise by inferring step correctness labels from final answers [12], leading to misalignment with true stepwise correctness. Ultimately, because both PRMs and ORMs rely on LLMs as evaluative backbone, their reward signals inherit the stochasticity, biases, and instability of LLM-based judgment [13].
In parallel to these, another line of work has focused on formal theorem proving, which relies on proof assistants with trusted kernels such as Lean4 [14], Coq [15], and Isabelle [16]. These systems enforce rigorous formal, machine-checkable reasoning, in contrast to the informal reasoning of traditional mathematics expressed in natural language. Recently, formal language-based systems such as AlphaProof and AlphaGeometry [17][18][19] have achieved remarkable success on International Mathematical Olympiad (IMO) problems, rivaling top human performers. Their principal strength lies in complete verifiability and strict immunity to hallucinations, as formal verification is embedded directly into the proof search process, ensuring that inference is rigorously justified.
Inspired by the strengths of both paradigms, in this work, we bridge the gap between formal and informal mathematics by combining the verifiability of formal reasoning with the flexibility and expressiveness of LLM-based informal reasoning. We introduce Hermes, a multi-modular scalable tool-augmented agent (abbreviated from “Hybrid AgEnt for Reasoning in Mathematics with NEuro-Symbolic Lean4 verification”), designed to integrate formal verification into the LLM reasoning process. Hermes leverages modern LLMs’ tool-calling capability to verify individual reasoning steps during inference and builds a memory block that ensures continuity of proof claims. For each critical proof step, the agent translates the natural-language statement into Lean goal, verifies translation consistency through back-translation, and invokes a prover module to attempt a proof or counter-proof. The resulting formal signal is then fed back into the LLM to inform its next reasoning step. The overview of our agentic framework is illustrated in Figure 1. We show that incorporating Hermes significantly enhances LLM accuracy across mathematical reasoning benchmarks of varying difficulty. It reduces token usage compared to traditional score-based methods and provides interpretable, step-level correctness feedback, offering transparency into how reasoning paths evolve and why certain trajectories lead to valid conclusions while others result in hallucinations. Our contributions are as follows:
• We develop the first tool-based Lean4 reasoning agent that verifies feasible intermediate proof steps during inference, providing LLMs with symbolic-engine-backed correctness signals for mathematical reasoning.
• We introduce a Lean4-powered memory block that accumulates and validates intermediate claims in context, ensuring cross-step consistency and reducing the propagation of errors in long reasoning chains.
• Comprehensive experiments evaluate the effectiveness and efficiency of Hermes against eight baseline methods across four benchmarks. Integrating Hermes constantly improves performance across all settings, yielding an average accuracy gain of 14% , and when using DeepSeek-V3.1 as the base model, it ach
This content is AI-processed based on open access ArXiv data.