s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs
Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demons…
Authors: Balaji Rao, John Harrison, Soonho Kong
Under revie w as a W orkshop paper at AIPV 2026 S 2 N - B I G N U M - B E N C H : A P R A C T I C A L B E N C H M A R K F O R E V A L UA T I N G L O W - L E V E L C O D E R E A S O N I N G O F L L M S Balaji Rao ∗ Stev ens Institute of T echnology Hoboken, NJ 07030 brao@stevens.edu John Harrison † Amazon W eb Services Seattle, W A 98101 jargh@amazon.com Soonho Kong † Amazon W eb Services Seattle, W A 98101 soonho@amazon.com Juney oung Lee † Amazon W eb Services Seattle, W A 98101 lebjuney@amazon.com Carlo Lipizzi Stev ens Institute of T echnology Hoboken, NJ 07030 clipizzi@stevens.edu A B S T R AC T Neurosymbolic approaches lev eraging Large Language Models (LLMs) with formal methods hav e recently achieved strong results on mathematics-oriented theorem-proving benchmarks. Howe ver , success on competition-style mathemat- ics does not by itself demonstrate the ability to construct proofs about real-world implementations. W e address this gap with a benchmark deriv ed from an indus- trial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at A WS for providing fast assembly routines for cryptography , and its correctness is established by formal verification. The task of formally verifying this library has been a significant achiev ement for the Automated Reasoning Group. It inv olved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In s2n-bignum-bench , we provide the formal specifica- tion and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. T o our knowledge, s2n-bignum-bench is the first public benchmark focused on machine-checkable proof synthesis for indus- trial low-le vel cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically rele vant testbed for ev aluating LLM-based theorem pro ving beyond competition mathematics. The code to set up and use the benchmark is av ailable here: s2n-bignum-bench. 1 I N T R O D U C T I O N Formal theorem proving with Large Language Models (LLMs) and interactive theorem prov ers has become a central testbed for LLM reasoning, but e xisting benchmarks emphasize competition-style mathematical problems. Solving complex math problems requires a rigorous framew ork of steps and logical proofs, and success on such tasks evidences structured reasoning [1]. Howev er , excellence on math-centric benchmarks does not automatically transfer to systems with practical engineering consequences. Therefore, the design of diverse and high-quality benchmarks is a key challenge in this research area. T o complement existing benchmarks, we propose s2n-bignum-bench , a machine-checkable bench- mark distilled from the s2n-bignum cryptographic library , focusing on verified low-le vel code. The benchmark tests whether LLMs can synthesize machine-checkable proofs about real lo w-level im- plementations rather than only competition-style mathematics. ∗ Extended work done after completing internship at A WS; Corresponding author † W ork independent from role at A WS 1 Under revie w as a W orkshop paper at AIPV 2026 Our contributions are fourfold. First, we package 2,284 1 proof obligations from the production- grade s2n-bignum cryptography library as isolated context–query tasks with stable per-problem identifiers and standalone artifacts. Second, we provide an end-to-end pipeline to build the bench- mark, retriev e selected problems, and run fully of fline ev aluation using the shipped artifacts. Third, we include integrity mechanisms that detect unsound or inv alid submissions, including checks for newly introduced axioms, forbidden placeholders such as CHEAT TAC , and parser-le vel v alidation of submitted proof expressions. W e also provide a contamination-mitigation mechanism based on type-annotation obfuscation. Fourth, because the benchmark is grounded in a deployed crypto- graphic codebase, it measures ISA-aware, bit-precise reasoning that is closer to real verification workflo ws than competition mathematics. 2 B A C K G RO U N D A N D R E L A T E D W O R K 2 . 1 T H E O R E M P RO V I N G Interactiv e theorem provers (ITPs) follo wing the LCF approach all hav e their deriv ations ultimately checked by a small, trusted kernel that produces v alues of type thm (the “theorem” type) [8]. Exam- ples of ITPs include HOL Light, Lean4, Isabelle/HOL, and Rocq Prov er . HOL Light is a minimalist proof system designed for higher -order logic (HOL) implemented in OCaml, with a very small trusted kernel and an emphasis on clarity [6]. Its proof script is an OCaml program and the proof system uses the OCaml toplev el (REPL) for interactivity . It relies on tactics to discharge goals. LLMs have become a central part of neural theor em pr oving (NTP), where the model proposes proof steps while an interacti ve theorem pro ver (ITP) acts as the verifier . Unlike answer-only benchmarks (e.g., gsm8k [4], CruxEval [5]), NTPs demand structur ed, verifiable reasoning . 2 . 2 T H E O R E M P RO V I N G B E N C H M A R K S MiniF2F is the most widely adopted cross-system benchmark, with 488 formalized Olympiad-lev el mathematics problems translated across multiple proof systems including Lean4, Metamath, Is- abelle, and HOL Light [22], saturated in 2025 by Seed Prov er [3]. PutnamBench introduces com- petition mathematics from the William Lowell Putnam Mathematical Competition, featuring 1,724 hand-constructed formalizations of 672 theorems across Lean4, Isabelle, and Rocq [16], 99.4% solve rate by Aleph [17]. Recent benchmarks ha ve e xpanded to ward verification conditions and repository-scale softw are v er- ification. In particular, NTP4VC studies theorem proving over verification conditions extracted from real systems code, while V eriSoftBench ev aluates proof synthesis over repository-scale Lean verifi- cation tasks[20; 19]. miniCTX and V eriBench-FTP are designed to test the use of conte xt as well as theorem-lev el, context-le vel, and project-level generalization across several mathematical, as well as code domains [9; 2]. These works represent a paradigm shift tow ard realistic theorem proving scenarios where models must leverage extensi ve context from real Lean projects. Other relev ant works that hav e moved tow ard verification-oriented proving and code-centered formal reasoning, include miniCodeProps, CLEVER, and VERINA [11; 15; 21]. W orks like SorryDB emphasize the need for dynamically-updating benchmarks [10]. Our work is complementary to these efforts: we focus specifically on HOL Light proof synthesis for industrial lo w-lev el cryptographic assembly with shipped object-code artifacts and trusted ISA semantics. 3 M O T I V A T I O N Recent work has extended neural theorem-proving ev aluation beyond competition mathematics to- ward software verification and repository-scale proof synthesis. W e aim to supplement these works by proposing an underrepresented ecosystem: mechanized proofs about industrial low-le vel cryp- tographic assembly in HOL Light. While some benchmarks hav e been ef fective at e valuating rea- soning models, the y do not test whether models can construct machine-checkable proofs about real implementations. The s2n-bignum proofs require a form of reasoning that is qualitati vely distinct from both abstract mathematics and higher-le vel verification-condition proving. Each proof must 1 As of this submission; extracted from s2n-bignum . V1.0, pinned to commit 9912d17... at s2n-bignum 2 Under revie w as a W orkshop paper at AIPV 2026 show that starting from a precondition on registers and memory , a specific sequence of decoded ARM or x86 instructions produces a final state satisfying a mathematical postcondition. Proving this in volves decomposing the program at specific program-counter of fsets, symbolically executing each segment by rewriting through ISA-specific decode and execute semantics, and simplifying the resulting symbolic state terms at each step. Math-centric proving does not generally in volv e archi- tectural state, aliasing, or endianness, and competence in abstract mathematics does not, by itself, establish capability for low-le vel code reasoning. The correctness of cryptographic libraries and systems code has immediate security and reliabil- ity consequences; this style of low-le vel implementation reasoning remains underrepresented in theorem-proving benchmarks. The s2n-bignum library contains hand-tuned big-integer assembly subroutines (x86/ARM) accompanied by HOL Light proofs that the object code meets functional correctness specifications under a trusted ISA model. Building a benchmark from this corpus lets us ev aluate an NTP system’ s ability to construct a proof that real assembly satisfies its specification. Benchmark Formal system Primary f ocus Problems (#) T ask setting miniF2F [22] Multiple Olympiad mathematics (formal) 488 Formal theorem proving PutnamBench [16] Multiple Competition mathematics (formal) 672 Formal theorem proving NTP4VC [20] Multiple V erification conditions from real code 600 Real-world VC proving V eriBench-FTP [2] Lean4 Code-verification artifacts 857 Proofs from verification artifacts miniCTX/miniCTX-v2 [9] Lean4 Context-dependent proving 762 Context / Project generalization V eriSoftBench [19] Lean4 Repository-scale software verification 500 Repository-scale verification proving s2n-bignum-bench HOL Light V erified cryptographic assembly programs 2284 Proof synthesis over machine-code T able 1: s2n-bignum-bench relativ e to representativ e theorem-proving and verification benchmarks 4 S 2 N - B I G N U M - B E N C H C O N S T R U C T I O N Our problems are deriv ed from the open-source s2n-bignum repository , an A WS cryptographic li- brary . Each problem in s2n-bignum-bench is a HOL Light context–query task. W e inline the rele- vant OCaml modules, locate top-level theorem bindings of the form let THM = prove(goal, proof) , and extract the goal as the query . The accompanying context is a self-contained OCam- l/HOL Light setup that loads the required definitions, constants, and previously prov ed results needed to reproduce the original proving environment, while replacing each original proof body with the placeholder CHEAT TAC . 2 W ith this, we isolate the task of synthesizing a new , machine- checkable proof under the same interfaces and imports as the source project. T o distinguish between dif ferent problems, we introduce the notion of a Problem identifier , since the same theorem name may appear across different proof files or multiple times within the same file. A problem identifier has the form arch.filename.thm.N , for example: arm.bignum montsqr p256.lemma1.0 , where N represents an occurrence index of a lemma. In this work, we also include the artifacts to extract selected problems into a directory with (i) setup.ml and (ii) query.txt . This will allow for reproducible, standalone attempts per problem. Using lightweight heuristics over theorem names and goal forms, we partition the bench- mark into four categories: Bit-vector lemmas ( 311 ), Program-state lemmas ( 552 ), Functional correctness ( 859 , comprising 437 ARM and 422 x86 problems), and Generic ( 562 ) for auxiliary facts not captured by the preceding categories. HOL Light proofs are typically dev eloped interactiv ely through the OCaml REPL. Existing tool- ing, such as hol server and the VSCode extension for HOL Light, can be used to provide an interactiv e dev elopment environment on top of the released benchmark artif acts [13]. 5 E V A L UA T I O N 5 . 1 A N S W E R S U B M I S S I O N A N D G R A D I N G Challengers submit a proof expression along with the name of the problem attempted. W e first perform a syntax and type pre-check by compiling a generated .synchk.ml file that pastes the 2 CHEAT TAC is a placeholder tactic in HOL Light, analogous to sorry in Lean or Isabelle. Any submission that uses it is rejected as cheating. 3 Under revie w as a W orkshop paper at AIPV 2026 submitted proof expression into the benchmark context. This catches malformed tactic expressions and other immediate parser or type errors before full ev aluation. Each submitted proof attempt yields exactly one verdict per problem: OK , FAIL , CHEATING , TIMEOUT , or ERROR . Results are aggregated into a CSV file, and the primary task metric is binary success at kernel-checked proof completion. T o make model comparisons meaningful, an official ev aluation configuration should fix the proof-check timeout, hardware setting, and submission b ud- get. W e also provide support for user -configurable timeouts for exploratory use 3 . Auxiliary lemmas may be defined inside the submitted proof expression, provided that the o verall expression e valuates to a tactic. The model is not gi ven access to the original proof bodies or the tactics used to prov e other theorems. This restriction is important for preserving the validity of the benchmark. It is important to note, howe ver , that the challenger can provide the rele vant machine-code context needed to understand the specification being prov ed to the LLM. 5 . 2 I N I T I A L B A S E L I N E E X P E R I M E N T S As a preliminary baseline, we ev aluate GPT -5.3-Codex [14] through codex-cli under the con- figuration described in Appendix A. The model achiev es a binary proof-completion rate of 4.4% in medium-ef fort mode and 5.3% in high-ef fort mode o ver the full benchmark. W e treat this as an initial baseline rather than an exhausti ve estimate of current model capability 3 . 5 . 3 I N T E G R I T Y A N D C O N TA M I N A T I O N D E F E N S E S T o mitigate contamination from memorized theorem statements, we implement an obfuscation mechanism that makes type annotations more e xplicit. W e set the print types of subterms of HOL Light to the most verbose mode and reprint the queries. Howe ver , this works for only ≈ 70% of the problem set because HOL Light’ s printer and parser are not fully Pollack-consistent [18]. For such queries, we chose to use their original representations without obfuscation 4 . T o ensure that a challenger did not introduce any forbidden tactics like CHEAT TAC or functions like new axiom , our e valuation checks the output of the axioms() function in HOL Light. The function returns a list of theorems that have been axiomatized so far . If the answer from challenger did not use any forbidden functions, the result of axioms() must be identical before and after the solution. If they are different, our benchmark script marks the result as “CHEA TING”. A separate class of attacks attempts to introduce a syntactically complex solution that resembles “SQL injection”. T o pre vent this, we in voke an OCaml parser for each solution and check whether it has one valid e xpression. Submissions that fail this check are rejected before proof e valuation. 6 C O N C L U S I O N W e introduce s2n-bignum-bench , a benchmark for machine-checkable proof synthesis over a de- ployed corpus of verified low-le vel cryptographic assembly proofs in HOL Light. This benchmark targets a capability that remains underrepresented in current e valuation: constructing sound proofs about real low-le vel implementations under trusted ISA semantics. By releasing isolated problem artifacts, an offline ev aluation harness, and integrity checks against unsound submissions, we aim to provide a reproducible testbed for future work on theorem pro ving and verification-oriented rea- soning beyond competition mathematics. Although the current release focuses on functional cor- rectness, recent HOL Light dev elopments around s2n-bignum also formalize relational properties, including constant-time discipline and equi valence between optimized and verification-friendly rou- tines, suggesting a natural path toward future benchmark extensions beyond extensional correct- ness [12; 7]. 3 Detailed explanation in Appendix 4 W e are communicating with HOL Light’ s maintainers to fix this 4 Under revie w as a W orkshop paper at AIPV 2026 R E F E R E N C E S [1] Achim, T . and V . T enev (2023). Harmonic: Building mathematical superintelligence. [2] Barkallah, S., S. Daruru, B. Miranda, L. Aniv a, A. Nie, and S. Ko yejo (2025). V eribench-ftp: A formal theorem proving benchmark in lean 4 for code verification. In The 5th W orkshop on Mathematical Reasoning and AI at NeurIPS 2025 . [3] Chen, L., J. Gu, L. Huang, W . Huang, Z. Jiang, A. Jie, X. Jin, X. Jin, C. Li, K. Ma, C. Ren, J. Shen, W . Shi, T . Sun, H. Sun, J. W ang, S. W ang, Z. W ang, C. W ei, S. W ei, Y . W u, Y . W u, Y . Xia, H. Xin, F . Y ang, H. Y ing, H. Y uan, Z. Y uan, T . Zhan, C. Zhang, Y . Zhang, G. Zhang, T . Zhao, J. Zhao, Y . Zhou, and T . H. Zhu (2025). Seed-prover: Deep and broad reasoning for automated theorem proving. [4] Cobbe, K., V . K osaraju, M. Ba varian, M. Chen, H. Jun, L. Kaiser , M. Plappert, J. T worek, J. Hilton, R. Nakano, et al. (2021). T raining verifiers to solve math word problems. arXiv pr eprint arXiv:2110.14168 . [5] Gu, A., B. Rozi ` ere, H. Leather , A. Solar-Lezama, G. Synnaev e, and S. I. W ang (2024). Cruxev al: A benchmark for code reasoning, understanding and ex ecution. arXiv pr eprint arXiv:2401.03065 . [6] Harrison, J. (2009). Hol light: An overvie w . In International Confer ence on Theor em Pr oving in Higher Or der Logics , pp. 60–66. Springer . [7] Harrison, J. (2026). Soundness of s2n-bignum formal verification. [8] Harrison, J., J. Urban, and F . Wiedijk (2014). History of interacti ve theorem proving. In Hand- book of the History of Logic , V olume 9, pp. 135–214. Else vier . [9] Hu, J., T . Zhu, and S. W elleck (2024). minictx: Neural theorem proving with (long-) conte xts. arXiv pr eprint arXiv:2408.03350 . [10] Letson, A., L. Sarra, A. Poiroux, O. Dressler , P . Lezeau, D. Aranha, F . Pu, A. Hill, M. C. Hidalgo, J. Berman, et al. (2026). Sorrydb: Can ai provers complete real-world lean theorems? arXiv pr eprint arXiv:2603.02668 . [11] Lohn, E. and S. W elleck (2024). minicodeprops: a minimal benchmark for proving code properties. arXiv pr eprint arXiv:2406.11915 . [12] Mazzucato, D., A. Mohamed, J. Lee, C. Barrett, J. Grundy , J. Harrison, and C. S. P ˘ as ˘ areanu (2025). Relational hoare logic for realistically modelled machine code. In International Confer- ence on Computer Aided V erification , pp. 389–413. Springer . [13] monadius (2026). Hol light extension for vs code. [14] OpenAI (2026). Gpt-5.3-codex system card. [15] Thakur , A., J. Lee, G. Tsoukalas, M. Sistla, M. Zhao, S. Zetzsche, G. Durrett, Y . Y ue, and S. Chaudhuri (2025). Clev er: A curated benchmark for formally verified code generation. arXiv pr eprint arXiv:2505.13938 . [16] Tsoukalas, G., J. Lee, J. Jennings, J. Xin, M. Ding, M. Jennings, A. Thakur, and S. Chaudhuri (2024). Putnambench: Evaluating neural theorem-pro vers on the putnam mathematical competi- tion. [17] Vlad Isenbaev , B. H. (2026). Logical intelligence’ s aleph solves putnambench. [18] W iedijk, F . (2012). Pollack-inconsistency . Electr onic Notes in Theoretical Computer Sci- ence 285 , 85–100. [19] Xin, Y ., Q. Chen, G. Durrett, and I. Dillig (2026). V erisoftbench: Repository-scale formal verification benchmarks for lean. arXiv preprint . [20] Xu, Q., X. Luan, R. W ang, J. O. J. Leang, P . W ang, H. Li, W . Li, and C. W att (2026). Neural theorem proving for verification conditions: A real-world benchmark. In The F ourteenth Inter - national Confer ence on Learning Representations . 5 Under revie w as a W orkshop paper at AIPV 2026 [21] Y e, Z., Z. Y an, J. He, T . Kasriel, K. Y ang, and D. Song (2025). V erina: Benchmarking verifiable code generation. arXiv pr eprint arXiv:2505.23135 . [22] Zheng, K., J. M. Han, and S. Polu (2022). Minif2f: a cross-system benchmark for formal olympiad-lev el mathematics. 6 Under revie w as a W orkshop paper at AIPV 2026 A A G U I D E F O R C H A L L E N G E R S The of ficial repository walks a user through setting up the benchmark, and the e xperimental protocol ov erview described here is a good place to start to begin e xploring this problem set. A . 1 E X P E R I M E N TA L P R O T O C O L OV E RV I E W W e conducted preliminary pass@1 experiments to ev aluate whether current language models can synthesize HOL Light tactic proofs for s2n-bignum-bench . The experiment(s) follow the benchmark workflo w described in the repository documentation: s2n-bignum-bench. Our primary reported baseline uses a closed-source reasoning model accessed through the Codex- CLI. The zero-shot prompt template and ev aluation scripts are av ailable in official repository . A . 2 B E N C H M A R K P R E PA R A T I O N A N D P RO B L E M R E T R I E V A L Follo wing the repository w orkflow , we first build the benchmark and extract theorem metadata from the pinned HOL Light and s2n-bignum sources. This produces a corpus of 2,284 problems. Each problem consists of: • a HOL Light boolean term stored in query.txt , and • a corresponding setup.ml file containing the HOL Light session preamble needed to establish the proof context. For inference, problems were (and can be) retrie ved in two equi valent formats: • as a flat CSV using retrieve-problem.py ....... --csv-only , and • as a directory tree of /query.txt files under a problems directory . A . 3 P R O M P T I N G A N D I N F E R E N C E P I P E L I N E S Prompt template. All runs use the same zero-shot prompt template, using the repository prompt: You are an expert in HOL Light. I am going to give a HOL Light boolean term. Please write a HOL Light proof of it in a THEN form. Do not use CHEAT_TAC or new_axiom. If the input is ‘x * (y + z) = x * z + x * y‘ A possible answer is REWRITE_TAC[LEFT_ADD_DISTRIB] THEN GEN_REWRITE_TAC LAND_CONV [ADD_SYM] THEN REFL_TAC Please include the proof in your proof but not other natural statements, so that I can easily evaluate your answer. {query.txt} The prompt includes a single worked example and instructs the model to output only a tactic ex- pression, with no surrounding explanation. This matches the e xpected input format of the e valuation pipeline. At this point, we hav e built the benchmark, and we ha ve all the required components to e valuate the answers that are generated by the LLM. 7 Under revie w as a W orkshop paper at AIPV 2026 A . 4 E X A M P L E P RO B L E M x86.sha3 keccak f1600.WORD NEG EL DEMORGAN from the benchmark asks the prov er to establish a De Morgan identity o ver machine words: ‘!(p:N word) (q:N word). (word_or p (word_not q)) = word_not(word_and (word_not p) q)‘ The ground-truth proof discharges this in tw o tactics: REPEAT GEN_TAC THEN WORD_BITWISE_TAC An alternativ e accepted proof does it through re-writes: REWRITE_TAC[WORD_NOT_AND] THEN REWRITE_TAC[WORD_NOT_NOT] THEN REFL_TAC This illustrates that multiple distinct tactic sequences can be used to solve the same goal. And our benchmark simply accepts any correct proof expression. A . 5 C O D E X C L I B A S E L I N E . Our main preliminary baseline uses GPT -5.3-Codex through codex-cli . F or each problem, we substitute the goal term into the prompt and in voke the model in a restricted read-only sandbox with shell access, databases, and web search disabled. Responses are written to a CSV together with problem identifiers and auxiliary metadata. W e also parse the CLI ev ent stream to detect unexpected tool-related ev ents; these are logged for audit purposes but are not used for e valuation. A . 6 C O N S T R U C T I O N O F T H E P R O B L E M T I M E O U T M A P P I N G The benchmark e valuator executes each submitted proof attempt with a timeout in order to pre- vent runaw ay tactics from consuming unbounded compute. W e support two timeout mechanisms: category-le vel defaults in timeouts.json and per-problem overrides in timeout-map.json . In practice, proof-check times in s2n-bignum-bench vary by several orders of magnitude, from a few seconds or milliseconds for small HOL Light lemmas to multiple hours for the largest routine- correctness theorems. A single blanket timeout therefore creates an undesirable tradeoff: if set too low , it incorrectly penalizes legitimate but expensiv e proofs; if set too high, failed submissions may waste hours before timing out, this might also lead to OOM issues which could lead to other problems. T o address this, we supply a heuristically deri ved per-problem timeout map based on repeated pro- filing of the ground-truth proofs. T o construct this map, we profile the original human-written proofs using the same benchmark e val- uation harness that is later used for submitted answers. During profiling, category-lev el timeouts are temporarily set to a very large v alue so that the ground-truth proofs are not prematurely termi- nated. The e valuator then assembles, compiles, and ex ecutes the benchmark proof files exactly as in normal assessment, while recording per-problem wall-clock time for the internal prove(goal, tactic) call. W e refer to this measurement as prove secs ; it excludes compilation ov erhead and captures only proof ex ecution time inside HOL Light. Because proof ex ecution times v ary across runs due to operating-system scheduling, g arbage collec- tion, memory pressure, and parallel ex ecution effects, we repeat this profiling procedure three times ov er the full benchmark. Across all runs, every ground-truth proof completes successfully . The repeated runs rev eal substantial variance for some hea vy proofs, including large absolute spreads for the most expensi ve routine-correctness theorems. For each problem, we collect the profiled prove secs values from all profiling runs and compute summary statistics including the maximum runtime and an empirical high-percentile runtime. W e then divide problems into tw o broad classes: 8 Under revie w as a W orkshop paper at AIPV 2026 • Light problems : proofs whose profiled runtimes remain comfortably below the heavy- proof regime. • Heavy problems : proofs whose observed runtimes indicate substantially higher cost or variance. Light and heavy problems are assigned timeouts using different multiplicati ve safety mar gins. Intu- itiv ely , heavy problems receive more generous scaling because the y exhibit larger absolute runtime variance. In both cases, the assigned timeout is bounded below by a fixed minimum floor (120 sec- onds) and above by a global cap (10,800 seconds). This yields a timeout map that is conservati ve enough to accommodate runtime variation while still avoiding the e xtreme inef ficiency of a blanket timeout. The resulting timeout-map.json contains one timeout entry per benchmark problem. During ev aluation, the benchmark first checks whether a submitted problem identifier appears in this map. If so, the corresponding per-problem timeout is used. If not, the evaluator falls back to the cate gory- lev el default from timeouts.json . This fallback is primarily intended for e xploratory use or for newly added benchmark problems that ha ve not yet been profiled. A . 7 E V A L U A T I O N P I P E L I N E The inference pipeline is left up to the challenger to iterate on; with their own models or with API end-point based prompting with more intricate techniques. But ultimately for ev aluation, the same artifact is expected: a directory of /answer.txt files. Evaluation proceeds in three stages: (1) Syntax checking and assembly . Each candidate answer is first validated against the bench- mark problem identifiers and then syntax-checked by compiling it in conte xt as a HOL Light tactic expression. Concretely , the answer is wrapped into a small OCaml/HOL Light scaf fold and checked with the benchmark syntax-checking script. Problems that f ail this stage are excluded from further ex ecution. (2) Proof execution. Answers that pass syntax checking are assembled into benchmark e valuation files and ex ecuted in HOL Light using the repository e valuation harness. Each proof attempt is run with: • a per-problem timeout, • pre/post axiom-count comparison to detect unsound submissions, and • standard compile-and-run logging. (3) V erdict collection. Each problem receives exactly one verdict: OK , FAIL , CHEATING , TIMEOUT , or ERROR . Our primary reported baselines use GPT -5.3-Codex in medium-ef fort and high-effort modes under the e valuation configuration described abov e with a Zero-shot, query-only assessment. Each mode produces one answer per problem. Out of the full set of 2,284 benchmark problems, the medium-effort run produced 743 answers that passed syntax checking and reached proof execution, while the high-effort run produced 766 ; the remaining answers were excluded at the syntax-check stage because they did not compile as valid OCaml/HOL Light tactic expressions. Across the full benchmark, the medium-ef fort run solved 101 / 2,284 problems ( 4.4% ) and the high-effort run solved 121 / 2,284 ( 5.3% ), a net gain of +20 proofs. Gains concentrate in the program state category (+12 problems) and generic (+7 problems). W e also host a leader- board for other research groups to make their o wn attempts on this problem set. 9 Under revie w as a W orkshop paper at AIPV 2026 Medium Effort High Effort Category T otal Eval OK F AIL TO/ERR OK/T ot T otal Eval OK F AIL TO/ERR OK/T ot generic 562 330 59 239 32 10.5% 562 349 66 246 37 11.7% bit v ector 311 142 26 109 7 8.4% 311 140 27 107 6 8.7% program state 552 235 16 211 8 2.9% 552 229 28 193 8 5.1% fc arm 437 33 0 33 0 0.0% 437 42 0 42 0 0.0% fc x86 422 3 0 3 0 0.0% 422 6 0 6 0 0.0% T otal 2,284 743 101 595 47 4.4% 2,284 766 121 594 51 5.3% TO/ERR combines TIMEOUT and ERROR verdicts. In aggre gate, the medium-effort run produced 44 timeouts and 3 errors; the high-effort run produced 47 timeouts and 4 errors. 10
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment