Towards Real-World Industrial-Scale Verification: LLM-Driven Theorem Proving on seL4

Towards Real-World Industrial-Scale Verification: LLM-Driven Theorem Proving on seL4
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Formal methods (FM) are reliable but costly to apply, often requiring years of expert effort in industrial-scale projects such as seL4, especially for theorem proving. Recent advances in large language models (LLMs) have made automated theorem proving increasingly feasible. However, most prior work focuses on mathematics-oriented benchmarks such as miniF2F, with limited evaluation on real-world verification projects. The few studies that consider industrial-scale verification mostly rely on closed-source models with hundreds of billions of parameters, which cannot be locally deployed and incur substantial usage costs. In this paper, we propose AutoReal, an LLM-driven theorem proving method for real-world industrial-scale systems with support for lightweight local deployment. We evaluate AutoReal on the seL4-Isabelle verification project as a representative and challenging case study. AutoReal incorporates two key improvements: (1) chain-of-thought (CoT)-based proof training, which teaches the LLM the reasoning behind proof steps and enables step-wise explanations alongside proofs, and (2) context augmentation, which leverages proof context from the project to enhance LLM-driven proving. Based on the AutoReal methodology, we fine-tune a base model to obtain AutoReal-Prover, a compact 7B-scale prover for industrial-scale theorem proving. AutoReal-Prover achieves a 51.67% proof success rate on 660 theorems from seL4-designated Important Theories across all 10 seL4 proof categories, substantially outperforming prior attempts on seL4 (27.06%). To evaluate generalization, we further apply AutoReal-Prover to three security-related projects from the Archive of Formal Proofs (AFP), covering all 451 theorems and achieving a proof success rate of 53.88%. Overall, this work advances the application of LLM-driven theorem proving in real-world industrial-scale verification.


💡 Research Summary

The paper introduces AutoReal, a novel framework that leverages large language models (LLMs) to automate theorem proving in industrial‑scale formal verification, using the seL4 microkernel verification project as a primary case study. The authors identify two major obstacles that have limited the practical adoption of LLM‑driven theorem provers: (1) existing research focuses on mathematics‑centric benchmarks such as miniF2F, which do not reflect the complexity of real‑world verification tasks that involve long proof chains, rich auxiliary lemmas, and evolving proof contexts; (2) most successful prior attempts rely on closed‑source, hundred‑billion‑parameter models (e.g., GPT‑4) that cannot be deployed locally and incur high API costs, making them unsuitable for industrial environments that demand lightweight, on‑premise solutions.

To overcome these challenges, AutoReal incorporates two key methodological innovations. First, a chain‑of‑thought (CoT) based proof training pipeline is introduced. The authors extract proof traces from the seL4 Isabelle development (via the FVELER tool) and construct a step‑level CoT dataset containing roughly 200 k instances. Each instance pairs an Isabelle proof command with a natural‑language explanation of its effect on the proof state, together with the pre‑ and post‑step proof states and the surrounding lemma. This fine‑grained data teaches the LLM not only which commands to issue but also why each command advances the proof, enabling the model to generate step‑aligned rationales alongside the proof script. Such explanations improve transparency and facilitate human verification and debugging.

Second, AutoReal augments the LLM’s input with project‑specific proof context. When attempting to prove a target theorem, the system automatically gathers all relevant auxiliary lemmas, definitions, and previously proved theorems that constitute the “proof context” for that theorem. By feeding this context into the prompt, the model can reason under the same assumptions a human verifier would use, thereby handling the highly inter‑dependent nature of industrial proofs.

The concrete model, AutoReal‑Prover, is obtained by fine‑tuning the open‑source 7 B‑parameter Qwen2.5‑Coder model on the CoT dataset. This choice deliberately balances performance with deployability: a 7 B model can run on a single modern GPU, allowing local, cost‑effective deployment in secure industrial settings. The model is released as open source, together with the CoT dataset, to encourage reproducibility and further research.

Evaluation is performed on two fronts. On seL4, the authors select 660 theorems from the “Important Theories” collection, covering all ten proof categories (e.g., access control, assembly refinement). AutoReal‑Prover achieves a 51.67 % proof success rate, more than double the previously reported best result (27.06 %) obtained by Selene using GPT‑4. Success is defined strictly: the generated Isabelle script must be accepted without errors, solve all subgoals, and contain no placeholder commands such as “sorry” or “oops”. The system also produces natural‑language explanations for each proof step, which were shown to aid human inspection (see Appendix Fig. 10).

To assess generalization, the authors apply the same model to three security‑focused AFP projects (CRYSTALS‑Kyber_Security, RSA‑PSS, Elliptic_Curves_Group_Law), covering all 451 theorems in those libraries. AutoReal‑Prover attains a 53.88 % success rate, demonstrating that the CoT‑trained, context‑augmented approach transfers beyond seL4 to other domains.

The contributions of the work are fourfold: (1) a practical LLM‑driven theorem proving method tailored for industrial verification, integrating CoT‑based proof training and context augmentation; (2) a lightweight 7 B model that can be deployed locally, addressing cost and security constraints of real‑world projects; (3) a substantial empirical improvement on seL4 and AFP benchmarks, together with step‑wise explanations that enhance transparency; (4) the release of both the fine‑tuned model and the step‑level CoT dataset, providing a foundation for future research on LLM‑assisted formal methods.

The authors discuss several avenues for future work: scaling to larger models or multimodal representations (e.g., graph‑structured proof artifacts), incorporating automated failure analysis and repair loops, extending the approach to other proof assistants such as Coq or Lean, and designing richer human‑LLM collaborative interfaces. Overall, AutoReal demonstrates that LLM‑driven automated theorem proving can move from academic benchmarks to real‑world, high‑assurance verification tasks, offering a promising path toward reducing the massive human effort traditionally required for industrial formal verification.


Comments & Academic Discussion

Loading comments...

Leave a Comment