Learning-Assisted Automated Reasoning with Flyspeck

The considerable mathematical knowledge encoded by the Flyspeck project is combined with external automated theorem provers (ATPs) and machine-learning premise selection methods trained on the proofs, producing an AI system capable of answering a wide range of mathematical queries automatically. The performance of this architecture is evaluated in a bootstrapping scenario emulating the development of Flyspeck from axioms to the last theorem, each time using only the previous theorems and proofs. It is shown that 39% of the 14185 theorems could be proved in a push-button mode (without any high-level advice and user interaction) in 30 seconds of real time on a fourteen-CPU workstation. The necessary work involves: (i) an implementation of sound translations of the HOL Light logic to ATP formalisms: untyped first-order, polymorphic typed first-order, and typed higher-order, (ii) export of the dependency information from HOL Light and ATP proofs for the machine learners, and (iii) choice of suitable representations and methods for learning from previous proofs, and their integration as advisors with HOL Light. This work is described and discussed here, and an initial analysis of the body of proofs that were found fully automatically is provided.

💡 Research Summary

The paper presents a complete pipeline that couples the extensive formal mathematics of the Flyspeck project with external automated theorem provers (ATPs) and machine‑learning based premise selection, creating a system capable of proving a large fraction of Flyspeck theorems automatically. The authors first implement sound translations from HOL Light’s higher‑order logic into three ATP‑compatible formalisms: untyped first‑order (FOF), polymorphic typed first‑order (TFF), and typed higher‑order (THF). These translations preserve logical equivalence while exposing the problem to ATPs that excel at different fragments of logic.

Next, they extract detailed dependency information from HOL Light proofs, recording which earlier theorems, definitions, and axioms each proof uses. This metadata, together with ATP‑generated proof traces, forms the training data for premise‑selection models. Several learning algorithms are evaluated—including k‑nearest neighbours, Lasso regression, and multilayer perceptrons—with Lasso emerging as the most effective due to its ability to handle sparse high‑dimensional feature vectors derived from symbol occurrences, type signatures, and structural patterns.

The trained models act as advisors inside HOL Light: given a new conjecture, the advisor predicts a ranked list of the most relevant previously proved statements. The top‑N premises are then supplied to the ATP, which is given a strict 30‑second wall‑clock limit on a 14‑CPU workstation. The system operates in a “push‑button” mode, requiring no human‑provided hints or strategy tuning.

To evaluate the architecture, the authors simulate the historical development of Flyspeck in a bootstrapping scenario. Starting only from axioms and definitions, each newly proved theorem is added to the knowledge base and becomes available for subsequent premise selection. Over the entire corpus of 14,185 theorems, the system automatically proves 5,532 (≈39 %) within the 30‑second budget. Successful proofs are concentrated in algebraic and geometric domains where the relevant premises are few and highly inter‑connected; failures are mostly due to insufficient premise relevance, explosion of the ATP search space, or loss of information during higher‑order to first‑order translation.

An in‑depth analysis of the automatically proved theorems shows common patterns: they tend to rely on a small, tightly linked set of premises, and many share identical premise subsets, confirming the effectiveness of the learned relevance model. The paper also discusses the limitations of the current approach—such as handling deeply nested higher‑order constructs—and outlines future work, including richer feature engineering, ensemble learning for premise selection, tighter integration of parallel ATP strategies, and improved higher‑order encodings.

Overall, this work demonstrates that large‑scale formal mathematics can be substantially automated by combining sound logical translations, data‑driven premise selection, and state‑of‑the‑art ATPs. It provides the first empirical evidence that a fully automated “push‑button” system can meaningfully assist in the development of a major formalization effort, opening the door to broader AI‑driven exploration of mathematical knowledge bases.