M2F: Automated Formalization of Mathematical Literature at Scale

M2F: Automated Formalization of Mathematical Literature at Scale
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automated formalization of mathematics enables mechanical verification but remains limited to isolated theorems and short snippets. Scaling to textbooks and research papers is largely unaddressed, as it requires managing cross-file dependencies, resolving imports, and ensuring that entire projects compile end-to-end. We present M2F (Math-to-Formal), the first agentic framework for end-to-end, project-scale autoformalization in Lean. The framework operates in two stages. The statement compilation stage splits the document into atomic blocks, orders them via inferred dependencies, and repairs declaration skeletons until the project compiles, allowing placeholders in proofs. The proof repair stage closes these holes under fixed signatures using goal-conditioned local edits. Throughout both stages, M2F keeps the verifier in the loop, committing edits only when toolchain feedback confirms improvement. In approximately three weeks, M2F converts long-form mathematical sources into a project-scale Lean library of 153,853 lines from 479 pages textbooks on real analysis and convex analysis, fully formalized as Lean declarations with accompanying proofs. This represents textbook-scale formalization at a pace that would typically require months or years of expert effort. On FATE-H, we achieve $96%$ proof success (vs.\ $80%$ for a strong baseline). Together, these results demonstrate that practical, large-scale automated formalization of mathematical literature is within reach. The full generated Lean code from our runs is available at https://github.com/optsuite/ReasBook.git.


💡 Research Summary

M2F (Math‑to‑Formal) tackles the long‑standing challenge of automatically formalizing entire textbooks and research papers in the Lean proof assistant. Existing work either focuses on proving already‑formalized goals (neural theorem proving) or translating short informal snippets into Lean code, but neither addresses the “project‑scale compilation” problem: ensuring that a large collection of files with imports, namespaces, and dependencies can be type‑checked as a single coherent Lean project.

The authors formulate this task as knowledge compilation under a fixed Lean environment E (a specific Lean version and pinned mathlib revision). The input is a LaTeX document (or a set of sections) that is first normalized into an ordered sequence of JSON items. Each item is linked back to its original character spans, enabling provenance tracking. The output is a Lean project P together with a map from each generated declaration to the source spans.

M2F’s pipeline consists of two stages, both driven by a verifier‑certified refinement primitive called VeriRefine. VeriRefine proposes a bounded edit (a “patch”) to a file, re‑runs the Lean verifier on that file, and accepts the edit only if the verifier reports a strict improvement according to a lexicographic objective. In Stage 1 (statement compilation) the system splits the document into atomic blocks, asks a large language model (LLM) to generate declaration skeletons (which may contain sorry placeholders), and then iteratively repairs compilation errors. Errors are localized, and a RepairPatch operator produces targeted fixes (e.g., adding missing imports, correcting namespace mismatches). The acceptance criterion is based solely on the reduction of error‑level diagnostics; warnings are ignored. This loop continues until the whole project type‑checks, albeit with sorrys still present in proof bodies.

Stage 2 (proof repair) freezes the signatures of all declarations and focuses on eliminating the remaining sorrys. For each proof hole the system extracts the goal type and local context via a GoalState query, then runs a goal‑conditioned proof planner (essentially an automated prover) to propose candidate proof scripts. If a candidate fails, an error‑fixing operator is invoked, and the loop repeats. The secondary objective in this stage is the count of sorrys outside comments; a patch is accepted only if this count strictly decreases while preserving zero compilation errors.

A key methodological contribution is the verifier‑normalized compute budget: the authors count the number of VerifyFile calls rather than wall‑clock time, providing a hardware‑independent measure of computational effort.

The experimental evaluation is extensive. The authors processed 479 pages of textbook material (312 pages of real analysis, 140 pages of convex analysis) and 27 pages of research exposition, producing a Lean library of 153 853 lines spread across 241 files and 4 116 declarations—all within roughly three weeks of compute time. Stage 1 succeeded in making the entire project compile with placeholders; Stage 2 eliminated the placeholders on the compiled project, achieving a Proof Success Rate (PSR) of 100 % under matched‑statement conditions. On the public benchmark FATE‑H, which supplies fully formalized Lean statements, M2F’s proof repair achieved 96 % PSR, outperforming the strong baseline Seed‑Prover 1.5 (80 %).

The paper’s contributions are threefold: (1) formalizing large‑scale auto‑formalization as a knowledge‑compilation problem with provenance; (2) introducing VeriRefine, a verifier‑in‑the‑loop accept/revert primitive that guarantees monotonic progress; and (3) demonstrating that the proof‑repair stage of M2F serves as a state‑of‑the‑art automated prover under matched statements.

Limitations include dependence on Lean 4 (the approach is not immediately portable to other proof assistants), reliance on LLM‑generated skeletons whose quality can still lag behind expert authors, and occasional difficulty with highly intricate definitions. Future work is outlined as extending the framework to multiple proof assistants, improving the LLM‑to‑Lean translation pipeline, integrating human‑in‑the‑loop feedback for higher fidelity, and automating the incorporation of generated libraries into existing mathlib ecosystems.

Overall, M2F demonstrates that end‑to‑end, textbook‑scale formalization is feasible with current AI and verification technology, provided that the system continuously consults the proof assistant’s own diagnostics to guide and validate every incremental change. This work paves the way for large‑scale, machine‑generated formal mathematics libraries that could accelerate both formal verification research and AI‑driven mathematical discovery.


Comments & Academic Discussion

Loading comments...

Leave a Comment