Beyond the Answer: Decoding the Behavior of LLMs as Scientific Reasoners

Accepted at the Post-A GI Science and Society W orkshop at ICLR 2026 B E Y O N D T H E A N S W E R : D E C O D I N G T H E B E H A V I O R O F L L M S A S S C I E N T I FI C R E A S O N E R S Rohan Pandey ∗ Eric Y e ∗ Univ ersity of W ashington { rpande, ericy4 } @uw.edu Michael Li ∗ Carnegie Mellon Uni versity ml7@andrew.cmu.edu A B S T R AC T As Large Language Models (LLMs) achie ve increasingly sophisticated perfor - mance on complex reasoning tasks, current architectures serv e as critical proxies for the internal heuristics of frontier models. Characterizing emergent reasoning is vital for long-term interpretability and safety . Furthermore, understanding ho w prompting modulates these processes is essential, as natural language will likely be the primary interface for interacting with A GI systems. In this work, we use a custom variant of Genetic Pareto (GEP A) to systematically optimize prompts for scientiﬁc reasoning tasks, and analyze how prompting can af fect reasoning behavior . W e in vestigate the structural patterns and logical heuristics inherent in GEP A-optimized prompts, and ev aluate their transferability and brittleness. Our ﬁndings rev eal that gains in scientiﬁc reasoning often correspond to model-speciﬁc heuristics that fail to generalize across systems, which we call “local” logic. By framing prompt optimization as a tool for model interpretability , we ar gue that mapping these preferred reasoning structures for LLMs is an important prerequisite for effecti vely collaborating with superhuman intelligence. 1 I N T R O D U C T I O N As the capabilities of Lar ge Language Models (LLMs) increasingly push performance frontiers (Bubeck et al., 2023), research focus will foreseeably shift from benchmark tracking to ward under - standing ho w to best collaborate with LLMs. W e identify reasoning as the most critical capability of LLMs in a post-AGI landscape, as it provides the veriﬁable logical scaffolding necessary for autonomous systems to na vigate nov el, high-stakes scenarios (Huang & Chang, 2023). A rigorous understanding of these mechanisms enables more effecti ve human-A GI collaboration in a wide range of scenarios, ensuring these systems act reliably when tackling complex challenges. Current LLMs provide a v aluable window into the reasoning paradigms that may deﬁne a post-A GI world. W e see scientiﬁc reasoning as a robust testbed for probing these internal paradigms. Scientiﬁc reasoning provides essential foundational logic and structured frame works, which can generalize to a v ariety of high impact do wnstream tasks, including but not limited to scientiﬁc disco very , spatial navigation, and engineering. In this work, we e xamine such reasoning in two domains: (1) GPQA, which tests graduate-lev el scientiﬁc reasoning (Rein et al., 2023), and (2) a formally veriﬁed algebra dataset implemented in Lean (Y ang et al., 2023b; Zheng et al., 2022). These benchmarks allow for the observation of comple x logic under controlled and interpretable conditions. Since natural language will likely be the primary interf ace for interacting with A GI systems, un- derstanding how prompting affects reasoning capabilities is crucial for reliable collaboration. In this w ork, we ﬁrst use Genetic Pareto (GEP A) to optimize prompts and discover which speciﬁc instructions result in the best performance (Agrawal et al., 2025; Khattab et al., 2023). W e then analyze these results to identify the patterns that elicit higher-le vel reasoning in these models. W e ﬁnd that high-performing prompts often rely on model-speciﬁc heuristics that f ail to generalize across models (Mirzadeh et al., 2025). Mapping these machine-preferred structures is vital for ov erseeing future general-purpose systems (Ber glund et al., 2024). This work establishes a foundation for decoding LLM reasoning to ensure safety framew orks are prepared for a post-A GI society . ∗ Equal contribution 1 Accepted at the Post-A GI Science and Society W orkshop at ICLR 2026 2 B A C K G R O U N D LLMs for Math and Science LLMs are rapidly moving be yond simple linguistic pattern matching, dev eloping complex multi-step reasoning skills that are necessary for mathematical and scientiﬁc general intelligence. Current methodologies often rely on specialized prompting paradigms; for instance, Chain-of-Thought (W ei et al., 2022) encourages explicit symbolic deri v ation, while Program- of-Thought (Chen et al., 2022) ofﬂoads complex scientiﬁc computing to external Python interpreters. More recently , reasoning-centric models such as OpenAI’ s o1 and DeepSeek’ s R1 ha ve utilized large-scale reinforcement learning to internalize these logical trajectories (DeepSeek-AI et al., 2025). Howe ver , breaking records on benchmarks is only a partial milestone. As these systems approach A GI-le vel performance, research must shift from merely tracking performance to understanding ho w to best collaborate with these models. In this work, we argue that identifying the structural biases and implicit reasoning strategies within these models is a vital prerequisite for the effecti ve o versight and safe deployment of A GI systems in high-stakes and unsupervised scenarios. A utomated Prompt Engineering Optimization of model performance has e volv ed from the manual engineering of prompts to a systematic algorithmic search. Early works in automated prompt engineering hav e demonstrated that LLMs are often the best optimizers of their own instructions (Zhou et al., 2022; Y ang et al., 2023a). Genetic Pareto (GEP A) utilizes an evolutionary approach to iteratively optimize high performing prompts (Agrawal et al., 2025). While previous works on automated engineering focused primarily on maximizing accuracy , we reposition these algorithms as tools for understanding. By allo wing GEP A to explore the v ast search space of possible instructions, we treat the resulting optimized prompts and the trajectory of prompt ev olution as valuable windo ws into the latent preferences of the model. W e can then identify speciﬁc heuristics that can be rev erse engineered and transferred back to improv e human-authored prompting methodologies. Knowledge T ransferability A fundamental question in the path to ward A GI is whether machine intelligence is a singular phenomenon or a collection of fragmented epistemologies (Quattrociocchi et al., 2025). Previous research into model distillation and generalization has sho wn that while knowledge can be transferred between architectures, internal reasoning protocols often remain model speciﬁc (Hinton et al., 2015). This brittleness poses challenges for the post-A GI era. For AI systems to truly be of use in collaborati ve scientiﬁc disco very , their logic must be interoperable. If a reasoning strategy optimized for one model fails on another , it suggests a closed epistemology , which represents a detached form of intelligence that lacks a univ ersal logical foundation (Pal et al., 2026). 3 M E T H O D O L O G Y W e assess LLM reasoning in two domains: (1) formal mathematical theorem pro ving via Lean from the MiniF2F dataset (Zheng et al., 2022) and (2) scientiﬁc reasoning via the GPQA Diamond benchmark (Rein et al., 2023). W e refer to these benchmarks as “ Algebra” and “GPQA ” respectiv ely . These domains assess scientiﬁc reasoning in complementary ways: Lean requires rigid veriﬁable logic and is open ended, while GPQA ev aluates high le vel conceptual reasoning and is multiple choice. W e apply a custom variant of GEP A which has been simpliﬁed and adapted for Lean theorem proving and GPQA. Implementation details are speciﬁed in Algorithm 1. W e conduct the entirety of GEP A optimization using DeepSeek-V3.2, which currently has state-of- the-art performance on reasoning benchmarks. T o gain insight into the transferability of optimized prompts across dif ferent models, we also test the same prompts on ChatGPT -5.4-mini, GLM 5, and Claude Sonnet 4.6, all of which were released within the past few months and ha ve competitiv e performance. By analyzing the prompt optimization process and comparing prompt performance across models, we hope to gain insight into ho w prompting strategy af fects reasoning performance. For e very combination of model and benchmark, we run ev aluations across four prompts: (1) Hand- Crafted Simple: A hand-crafted simple prompt, intended to emulate the prompting ability of a verage technical users, which serves as a baseline; (2) Hand-Crafted CoT : A hand-crafted Chain-of-Thought prompt, intended to emulate best practice prompting strategies to current kno wledge, which serves as a baseline; (3) GEP A Optimized Baseline: The initial prompt from the GEP A optimization process; and (4) GEP A Optimized Final: The ﬁnal prompt from the GEP A optimization process. Prompts are further discussed in the Appendix, and examples are pro vided. 2 Accepted at the Post-A GI Science and Society W orkshop at ICLR 2026 Algorithm 1 Custom V ariant of Genetic Pareto (GEP A) 1: Input: Seed prompt P 0 , Lean theorems L , GPQA questions G , iterations T , samples ( n, m ) . 2: Initialize: P opulation ← { P 0 } , P ar eto ← ∅ . 3: for t = 1 to T do 4: P ← Sample ( P ar eto  = ∅ ? P ar eto : P opul ation ) 5: S L ← Sample ( L, n ) , S G ← Sample ( G, m ) 6: Evaluate: v [ i ] ←  Lean V erify ( LLM ( P, c i )) c i ∈ S L Check Answer ( LLM ( P , c i )) c i ∈ S G for c i ∈ S L ∪ S G 7: Update Par eto: Add P if non-dominated by existing prompts; remov e dominated. 8: E r r ors ← { c i : v [ i ] = 0 } 9: C ritiq ue ← LLM Critic ( P , Logs ( E r ror s )) 10: P ′ ← LLM Evolve ( P, C r itiq ue ) 11: P opulation ← P ar eto ∪ { P ′ } { Prune to Pareto + ne w child } 12: end for 13: return P ar eto 4 R E S U L T S 4 . 1 B E N C H M A R K P E R F O R M A N C E T able 1: Performance comparison of models on benchmark datasets. Model Method Algebra GPQA DeepSeek-V3.2 Hand-Crafted Simple 86.11% 91.67% Hand-Crafted CoT 97.22% 91.67% GEP A Optimized Baseline 91.67% 88.89% GEP A Optimized Final 100.00% 94.44% GPT -5.4-mini Hand-Crafted Simple 50.00% 91.67% Hand-Crafted CoT 47.22% 91.67% GEP A Optimized Baseline 50.00% 88.89% GEP A Optimized Final 61.11% 91.67% GLM 5 Hand-Crafted Simple 91.67% 91.67% Hand-Crafted CoT 97.22% 86.11% GEP A Optimized Baseline 91.67% 88.89% GEP A Optimized Final 94.44% 91.67% Claude Sonnet 4.6 Hand-Crafted Simple 30.56% 77.78% Hand-Crafted CoT 52.78% 83.33% GEP A Optimized Baseline 50.00% 80.56% GEP A Optimized Final 50.00% 80.56% Prompt performance on benchmarks is displayed in T able 1. Our ﬁrst major observation is that the GEP A Optimized Final prompt achieves its most signiﬁcant gains on DeepSeek-V3.2, reaching 100.00% on Algebra and 94.44% on GPQA. This conﬁrms that the GEP A optimization process is highly effecti ve when ev aluated on the same model used during optimization. Our second major observation is that the superiority of the GEP A-optimized prompts does not reliably transfer to other models. While some models show marginal beneﬁts – for example, GPT -5.4-mini sees an improv ement in Algebra from 50.00% to 61.11% – the optimized prompt is rarely the undisputed best performer elsewhere. Notably , for GLM 5 (Algebra) and Claude Sonnet 4.6 (both benchmarks), the Hand-Crafted CoT prompts actually outperform the GEP A Optimized Final prompt. These results suggest that the optimization process is highly model-speciﬁc, and suggests a trade-off between prompt univ ersality and performance. Since DeepSeek was used as the optimizer , the ev olved prompts lik ely capture patterns speciﬁc to DeepSeek’ s architecture that do not resonate with the internal logic of other models. This highlights a signiﬁcant lack of interoperability and reinforces the brittleness of automated prompt engineering across div erse foundation models. 3 Accepted at the Post-A GI Science and Society W orkshop at ICLR 2026 Figure 1: The length of GEP A proposed prompts increases over the course of optimization, with the ﬁnal prompt often being about twice as long in characters as the initial prompt. This shows that detailed prompting is likely required to unlock better reasoning capabilities in LLMs. 4 . 2 P R O M P T E VO L U T I O N P AT T E R N S By manually analyzing the evolution of prompts ov er the course of GEP A optimization, we primarily ﬁnd that prompts ev olve from telling the model what to do to coaching it on how to do it , similar to a human expert. In the process, signiﬁcant domain knowledge and best practices are often added to the context. For e xample, in the GEP A optimization process for the Algebra benchmark, prompts ev entually referenced domain speciﬁc strategies such as using Eisenstein’ s Criterion for minimal polynomials. Similarly , later GPQA prompts mentioned speciﬁc strategies such as quantum ﬁeld theory loop counting. Later Algebra prompts also explicitly warn against common pitfalls, especially with regard to Lean formatting. Interestingly , a robust protocol for “Handling False Statements” is also introduced to prev ent hallucinating proofs. The Appendix contains prompt examples. This observation is also supported by analyzing the prompt lengths ov er the course of optimization, as sho wn in Figure 1. The ﬁnal prompt turns out to be around twice as long as the initial prompt. This demonstrates that LLMs often rely on detailed prompting to fully elicit their reasoning capabilities. By analyzing the embedding space for prompts over the optimization process, we also ﬁnd that the embeddings of proposed prompts tends to drift in a consistent direction ov er the course of optimization. This suggests that for the same task, some re gions of prompting space may offer more promising performance. A plot of this behavior is in the Appendix. 5 D I S C U S S I O N Conclusion In this w ork, we demonstrated that using cle ver prompt optimization techniques can unlock reasoning potential in LLMs. W e ﬁnd that longer prompts, as well as prompts that guide the model along the how of the target task, tend to elicit superior reasoning capabilities in LLMs. Such prompts can be optimized autonomously , and can beat hand-crafted baselines, e ven when we hand-craft prompts using state-of-the-art best practices. Our ﬁndings suggest that if current trends continue, post-AGI reasoning may not manifest as a uni versal logic, b ut rather as a collection of task-speciﬁc and model-speciﬁc heuristics that require precise coaching to elicit. Limitations Our work is still relatively preliminary , and is limited to two benchmarks and four models. W e would need a broader experimental scope to more conﬁdently conﬁrm the v alidity of our hypotheses. Furthermore, the stochastic nature of our LLM-driv en prompt optimization loop could lead to inconsistency across multiple runs. Future W ork Our preliminary results suggest that prompt optimization remains dangerously architecture dependent. T o prepare for a post-AGI landscape, future research should work on identifying “reasoning primitiv es”, or logical structures inv ariant across models. If A GI develops logic that humans cannot understand, we need to build automated tools to keep these systems interpretable. W ithout this, we risk a future where our most capable tools are also our least predictable. 4 Accepted at the Post-A GI Science and Society W orkshop at ICLR 2026 6 S TA T E M E N T S 6 . 1 E T H I C S S TA T E M E N T While the discovery of ef fectiv e reasoning paths could potentially be repurposed for harmful tasks, our analysis of the logic inherent in these models primarily as a vital safeguard. If we better understand ho w LLMs reason, we can better utilize LLMs for reasoning related tasks. Ultimately , identifying how LLMs approach reasoning tasks is essential for the de velopment of rob ust, aligned, and transparent post-A GI systems. W e also agree to NO T rev eal examples from our ev aluation datasets in plain text or images online, to reduce the risk of leakage into foundation model training corpora. 6 . 2 R E P RO D U C I B I L I T Y S TA T E M E N T Our results are reproducible to the extent of nondeterminism in the outputs of LLMs used. R E F E R E N C E S Lakshya A Agrawal, Shangyin T an, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, K oushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. Gepa: Reﬂecti ve prompt ev olution can outperform reinforcement learning, 2025. URL abs/2507.19457 . Lukas Berglund, Meg T ong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, T omasz K orbak, and Owain Ev ans. The rev ersal curse: Llms trained on ”a is b” f ail to learn ”b is a”, 2024. URL https://arxiv.org/abs/2309.12288 . S ´ ebastien Bubeck, V arun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar , Peter Lee, Y in T at Lee, Y uanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco T ulio Ribeiro, and Y i Zhang. Sparks of artiﬁcial general intelligence: Early experiments with gpt-4. March 2023. URL https://www.microsoft.com/en- us/research/publication/ sparks- of- artificial- general- intelligence- early- experiments- with- gpt- 4/ . W enhu Chen, Xueguang Ma, Xin yi W ang, and William W Cohen. Program of thoughts prompt- ing: Disentangling computation from reasoning for numerical reasoning tasks. arXiv pr eprint arXiv:2211.12588 , 2022. DeepSeek-AI et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint , 2025. Geoffre y Hinton, Oriol V inyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv pr eprint arXiv:1503.02531 , 2015. Jie Huang and K evin Chen-Chuan Chang. T owards reasoning in lar ge language models: A survey , 2023. URL . Omar Khattab, Arna v Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, K eshav Santhanam, Sri V ard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T . Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarativ e language model calls into self-improving pipelines, 2023. URL . Iman Mirzadeh, Kei van Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar . Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2025. URL . K oyena Pal, David Bau, and Chandan Singh. Do explanations generalize across large reasoning models?, 2026. URL . 5 Accepted at the Post-A GI Science and Society W orkshop at ICLR 2026 W alter Quattrociocchi, V alerio Capraro, and Matja ˇ z Perc. Epistemological fault lines between human and artiﬁcial intelligence, 2025. URL . David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty , Richard Y uanzhe P ang, Julien Dirani, Julian Michael, and Samuel R. Bo wman. Gpqa: A graduate-lev el google-proof q&a benchmark, 2023. URL . Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information pr ocessing systems , 35:24824–24837, 2022. Chengrung Y ang, Xuezhi W ang, Y utai Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv pr eprint arXiv:2309.03409 , 2023a. Kaiyu Y ang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Y u, Saad Godil, Ryan Prenger , and Anima Anandkumar . Leandojo: Theorem proving with retriev al-augmented language models, 2023b. URL . Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. Minif2f: a cross-system benchmark for formal olympiad-lev el mathematics, 2022. URL . Y ongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster , Silvian Puzicanin, Amar T erzic, Laurent Itti, et al. Large language models are human-level prompt engineers. In The Eleventh International Confer ence on Learning Repr esentations , 2022. A A P P E N D I X A . 1 A D D I T I O N A L R E S U LT S Figure 2: The embeddings of GEP A proposed prompts tend to drift ov er the course of optimization. For both algebra and GPQA, there seems to be a signiﬁcant jump in embedding space at around iteration 12. This suggests that for the same task, some regions of prompting space may of fer more promising performance. A . 2 P R O M P T S The 4 prompts we tested for the Algebra benchmark are listed below . For sake of bre vity , we do not include the 4 GPQA prompts in this section, but the y are similar in nature. Listing 1: Hand-Crafted Simple Prompt 1 You are a mathematician and Lean 4 practitioner. 2 Your goal is to solve formal IMO-style problems and produce correct Lean 4 proofs. 3 4 Approach and tricks: 6 Accepted at the Post-A GI Science and Society W orkshop at ICLR 2026 5 - Try to reduce the goal using simp/linarith/nlinarith or rewriting. 6 - Look for standard lemmas in Mathlib (algebra, number theory, inequalities, parity). 7 - Prefer structured proofs: have, refine, apply, and use ‘by‘ blocks. 8 9 Output format: 10 - Return the response in a single fenced code block with ‘‘‘lean ... ‘‘‘ containing valid Lean 4 code. 11 - Do not include any text outside the code block. Listing 2: Hand-Crafted CoT Prompt 1 You are an expert mathematician and Lean 4 developer. 2 Your goal is to solve formal IMO-style problems and produce correct Lean 4 proofs. 3 4 ### Output Format 5 - Return the response in a single fenced code block with ‘‘‘‘lean ... ‘‘‘‘ containing valid Lean 4 code. 6 - Do not include any text outside the code block. 7 - The code should be complete, including necessary ‘import‘ and ‘open‘ statements. 8 9 ### Imports 10 - Do not use a single monolithic ‘import Mathlib‘. 11 - Instead, import the specific files you need, e.g., ‘import Mathlib.Data .Real.Basic‘, ‘import Mathlib.Algebra.Order.Field.Basic‘, ‘import Mathlib.Tactic.Linarith‘. 12 13 ### Chain-of-Thought (internal) 14 - Think step-by-step and consider small test cases for ‘ N ‘/‘ Z ‘ statements . 15 - Do not reveal chain-of-thought. Output only the Lean code block. 16 17 ### General Approach & Tactics 18 - Start with structured proofs: ‘have‘, ‘refine‘, ‘apply‘, ‘by‘ blocks. Use ‘calc‘ for equational reasoning. 19 - Use ‘simp‘, ‘rw‘, ‘ring‘, ‘norm_num‘. ‘linarith‘ for linear inequalities, ‘nlinarith‘ for non-linear. 20 - ‘ac_rfl‘ is useful for equalities up to associativity/commutativity. 21 22 ### Handling False Statements 23 - If you find a counterexample, do not modify the statement. 24 - Provide a Lean code block that proves the counterexample, then leave the original theorem with ‘sorry‘. 25 26 ### Domain-Specific Hints 27 - Polynomial identities: ‘ring‘ is often a one-line proof. 28 - Inequalities: reduce to ‘0 ≤ x - y‘, then show nonnegativity via squares or ‘positivity‘. 29 - Linear systems over ‘ C ‘: use ‘linear_combination‘ or a ‘calc‘ block with ‘ring‘. Listing 3: GEP A Optimized Baseline Prompt 1 You are an expert mathematician and Lean 4 developer. 2 Your goal is to solve formal IMO-style problems and produce correct Lean 4 proofs. 3 4 ### General Approach 5 6 1. ** Analyze the Problem: ** Read the theorem statement carefully. Restate the goal in your own words to ensure you understand it. Identify the core mathematical concepts involved (e.g., number theory , inequalities, algebra). 7 Accepted at the Post-A GI Science and Society W orkshop at ICLR 2026 7 2. ** Formulate a Mathematical Strategy: ** 8 * Before writing any code, sketch a proof plan on paper (mentally). 9 * Consider standard theorems like AM-GM, Cauchy-Schwarz, Jensen’s inequality, or properties of quadratic equations (discriminant). 10 * For sums and series, look for telescoping patterns or opportunities for integral bounds. For example, ‘1/ √ k‘ is often bounded using ‘2( √ k - √ (k-1))‘. 11 * A very common and powerful trick for inequalities is to use the fact that ‘(x - y)ˆ2 ≥ 0‘, which expands to ‘xˆ2 + yˆ2 ≥ 2xy‘. 12 3. ** Structure the Lean Proof: ** 13 * ** DRY (Don’t Repeat Yourself): ** If you find yourself proving the same thing multiple times with different variables, define a local helper ‘lemma‘ to abstract the repeated logic. For example, if you need ‘xˆ2/y + y ≥ 2x‘ for several pairs ‘(x, y)‘, prove it once in a general lemma. 14 * ** Structured Proofs: ** Prefer structured proofs using ‘have‘, ‘ let‘, ‘calc‘, ‘refine‘, and ‘apply‘. Use ‘by‘ blocks for short tactic sequences. 15 * ** Backward Reasoning: ** Use ‘suffices‘ to simplify the goal to an easier-to-prove intermediate statement. 16 17 ### Lean 4 Tactics and Tricks 18 19 * ** Automation: ** 20 * ‘ring‘: Use to prove equalities that hold in any commutative ring (polynomials without division). 21 * ‘field_simp‘: Use to prove equalities in fields (expressions with division). Provide non-zero hypotheses, e.g., ‘field_simp [h.ne.symm ]‘. 22 * ‘linarith‘: For proving linear arithmetic inequalities. 23 * ‘nlinarith‘: For proving non-linear arithmetic inequalities. It is very powerful when combined with a hypothesis like ‘sq_nonneg‘. 24 * ‘positivity‘: To automatically prove goals of the form ‘0 < x‘ or ‘0 ≤ x‘. 25 26 * ** Rewriting and Simplification: ** 27 * ‘rw‘: For rewriting terms using equalities. 28 * ‘simp‘: For general simplification. You can add specific lemmas to its ruleset, e.g., ‘simp [add_comm]‘. 29 30 * ** Finding Lemmas: ** 31 * Mathlib is vast. If you think a lemma for a common mathematical fact (e.g., telescoping sums) should exist, it probably does. For example, ‘Finset.sum_Icc_telescope‘ is available for telescoping sums over an ‘Icc‘. 32 * Use ‘exact?‘ or ‘apply?‘ to ask Lean to search for a lemma that solves the current goal. 33 * ** Crucially, only use lemma and tactic names that you are certain exist in Mathlib. ** Do not invent names. If your code uses a non- existent identifier, it is incorrect. Double-check your toolset. 34 35 * ** Common Mathlib Lemmas for Inequalities: ** 36 * ‘sq_nonneg (x : R )‘: Proves ‘0 ≤ x ˆ 2‘. 37 * ‘pow_two_nonneg (x : R )‘: Same as ‘sq_nonneg‘. 38 * ‘div_nonneg {a b : R } (ha : 0 ≤ a) (hb : 0 ≤ b) : 0 ≤ a / b‘. You need to prove the numerator and denominator are non-negative. 39 40 ### Output Format 41 42 * Return the response in a single fenced code block with ‘‘‘lean ... ‘‘‘ containing valid, complete, and compilable Lean 4 code. 43 * Ensure all necessary ‘import‘ and ‘open‘ statements are present. 44 * Do not include any text, explanation, or comments outside the code block. 8 Accepted at the Post-A GI Science and Society W orkshop at ICLR 2026 Listing 4: GEP A Optimized Final Prompt 1 You are an expert mathematician and Lean 4 developer. 2 Your goal is to solve formal IMO-style problems and produce correct Lean 4 proofs. 3 4 ### Output Format 5 - Return the response in a single fenced code block with ‘‘‘‘lean ... ‘‘‘‘ containing valid Lean 4 code. 6 - Do not include any text outside the code block. 7 - The code should be complete, including necessary ‘import‘ and ‘open‘ statements. 8 9 ### General Approach & Tactics 10 - ** Verify the statement first. ** Before attempting a proof, especially for statements over ‘ N ‘ or ‘ Z ‘, quickly test small values (e.g., n=0, 1, 2). If you find a counterexample, follow the "Handling False Statements" protocol below. 11 - Start with structured proofs: ‘have‘, ‘refine‘, ‘apply‘, ‘by‘ blocks. ‘ calc‘ blocks are excellent for equational reasoning. 12 - Use ‘simp‘, ‘rw‘, ‘ring‘, ‘norm_num‘. ‘linarith‘ is for linear inequalities over ordered fields (‘ R ‘, ‘ Q ‘). ‘nlinarith‘ is for non- linear inequalities. 13 - ‘ac_rfl‘ is useful for proving equalities up to associativity and commutativity, which is often more robust than a sequence of ‘rw [ add_comm, add_assoc]‘. 14 - ** DRY (Don’t Repeat Yourself). ** If you prove the same intermediate result multiple times with different variables, define a local helper lemma inside the proof using ‘have‘. This improves clarity and reduces errors. 15 ‘‘‘lean 16 have my_lemma : ∀ (x y : R ), 0 < y → xˆ2 / y + y ≥ 2 * x := by 17 ... -- proof of the lemma 18 -- then use it: 19 have h1 := my_lemma a b hb 20 have h2 := my_lemma b c hc 21 ‘‘‘ 22 23 ### Handling False Statements 24 - If you find a counterexample (e.g., the theorem fails for ‘n=1‘), do not attempt to prove a modified version of the theorem (e.g., changing ‘<‘ to ‘ ≤ ‘ or adding a hypothesis like ‘n ≥ 2‘). 25 - Your task is to prove the theorem * as given * . If it is false, you cannot complete the task. 26 - In this case, your response should be a Lean code block that formally proves the counterexample. Then, leave the original theorem with ‘ sorry‘. This correctly signals that the requested proof is impossible . 27 - ** Example a assistant should produce for a false statement: ** 28 ‘‘‘lean 29 import Mathlib 30 -- The user’s theorem statement is false for n=1. 31 -- Here is a proof of the counterexample. 32 example : ¬ ( ∀ n : N , (n : R ) ˆ (1 / n : R ) < 2 - 1 / n) := by 33 intro h 34 specialize h 1 35 norm_num at h -- This will fail, proving the negation. 36 37 -- The original problem, which is unprovable. 38 theorem algebra_ineq_nto1onlt2m1on (n : N ) : (n : R ) ˆ (1 / n : R ) < 2 - 1 / n := by 39 sorry 40 ‘‘‘ 41 42 ### Domain-Specific Strategies & Mathlib Knowledge 9 Accepted at the Post-A GI Science and Society W orkshop at ICLR 2026 43 44 #### 1. Polynomial & Ring Identities 45 - For identities involving addition, subtraction, multiplication, and integer powers (i.e., polynomial identities), the ‘ring‘ tactic is extremely powerful. It works over any commutative (semi)ring like ‘ Z ‘, ‘ Q ‘, ‘ R ‘, ‘ C ‘. For simple identities, this is often a one-line proof. 46 47 #### 2. Inequalities over R , Q (AM-GM, Cauchy-Schwarz, etc.) 48 - A common strategy is to prove ‘x ≥ y‘ by showing ‘x - y ≥ 0‘. 49 - ** Pattern: Prove ‘x ≥ y‘ by showing ‘x - y ≥ 0‘ ** 50 1. State the subgoal: ‘suffices 0 ≤ x - y by linarith‘. 51 2. Establish an algebraic identity for the difference, where the right-hand side is in a form that is clearly non-negative (e.g., a square or a sum of squares). 52 ‘‘‘lean 53 have h_ident : x - y = (a - b)ˆ2 / c := by 54 field_simp -- Handles division. Provide non-zero hypotheses, e. g., ‘field_simp [ne_of_gt hc]‘. 55 ring -- Finishes the polynomial part. 56 ‘‘‘ 57 3. Prove the right-hand side is non-negative. This often uses ‘ div_nonneg‘, ‘mul_nonneg‘, ‘sq_nonneg‘, and ‘positivity‘. 58 - ** Pattern: Proving ‘(a-b) * (c-d) ≥ 0‘ ** 59 * This is common in induction proofs. The key is to show that ‘a-b‘ and ‘c-d‘ have the same sign. 60 * A robust method is to use ‘mul_nonneg_of_signs_sync‘. For example , to prove ‘0 ≤ (a - b) * (aˆn - bˆn)‘ for ‘0 ≤ a, b‘ and ‘1 ≤ n‘: 61 ‘‘‘lean 62 have h_iff : a ≤ b ↔ a ˆ n ≤ b ˆ n := 63 Real.pow_le_pow_iff_of_nonneg (by linarith) (by linarith) (by omega) 64 exact mul_nonneg_of_signs_sync h_iff 65 ‘‘‘ 66 * This is much cleaner than a ‘by_cases‘ block. 67 68 #### 3. Linear Algebra over C 69 - ** ‘linarith‘ fails over ‘ C ‘ ** because it is not an ordered ring. 70 - ** ‘linear_combination‘ Tactic: ** This is the most direct way to solve systems of linear equations. It combines * other equational hypotheses * . 71 ‘‘‘lean 72 -- h 0 : f + 3 * z = 11 73 -- h 1 : 3 * f - 5 * z = -65 74 have h_14z : 14 * z = 98 := by linear_combination 3 * h 0 - h 1 75 ‘‘‘ 76 - ** Warning: ** Do not use ‘linear_combination‘ to manipulate a single equation with a constant (e.g., ‘linear_combination h 0 - 21‘ is incorrect). To solve for a variable in one equation, use ‘calc‘ or ‘ rw‘. 77 - ** Solving Equations: ** 78 1. ** From a system: ** After using ‘linear_combination‘ to get e.g. ‘14 * z = 98‘, use ‘mul_right_cancel 0 ‘ to solve for the variable. 79 ‘‘‘lean 80 -- h_eq: 14 * z = 98 81 have hz : z = 7 := by 82 apply mul_right_cancel 0 (by norm_num : (14 : C )  = 0) 83 rw [h_eq] -- Goal becomes: 98 = 14 * 7 84 norm_num -- Finishes the proof 85 ‘‘‘ 86 2. ** From one equation: ** To isolate a variable in an equation like ‘f + 21 = 11‘, use a ‘calc‘ block. 87 ‘‘‘lean 88 -- h 0 : f + 21 = 11 89 have hf : f = -10 := by 10 Accepted at the Post-A GI Science and Society W orkshop at ICLR 2026 90 calc f = (f + 21) - 21 := by ring 91 _ = 11 - 21 := by rw [h 0 ] 92 _ = -10 := by norm_num 93 ‘‘‘ 94 95 #### 4. Algebraic Numbers & Field Theory (Linear Independence) 96 - Problems of the form "if ‘a + b * m + c * mˆ2 = 0‘ with ‘a,b,c ∈ Q ‘, then ‘ a=b=c=0‘" are about linear independence and minimal polynomials. 97 - The strategy: 98 1. Identify the algebraic number (e.g., ‘m = 2ˆ(1/3)‘). 99 2. Find its minimal polynomial over ‘ Q ‘ (e.g., ‘p = X 3 - 2‘). 100 3. Prove ‘p‘ is irreducible over ‘ Q ‘ using ** Eisenstein’s Criterion ** : ‘Polynomial.irreducible_of_eisenstein_criterion‘. For ‘X 3 - 2‘, use the prime ‘p=2‘. (Requires ‘import Mathlib.RingTheory.Polynomial. Eisenstein‘). 101 4. Verify ‘p‘ is the minimal polynomial using ‘minpoly. eq_of_irreducible_of_monic‘. 102 5. Use ‘minpoly.dvd Q m hq‘ to show the minimal polynomial divides your polynomial. (Requires ‘import Mathlib.FieldTheory.Minpoly.Basic ‘). 103 6. Get a contradiction from degrees (‘Polynomial. eq_zero_of_dvd_of_degree_lt‘). 104 7. Deduce coefficients are zero using ‘Polynomial. coeff_eq_zero_of_eq_zero‘. 105 106 ### Troubleshooting 107 - ** ‘rw‘ or ‘calc‘ block failures: ** If ‘rw‘ or ‘calc‘ fails due to term ordering (e.g., ‘a * b‘ vs ‘b * a‘), try ‘ac_rfl‘ or add an explicit ‘rw [mul_comm]‘. 108 - ** Unknown Identifier Errors: ** If you get an ’unknown identifier’ error for a common mathematical lemma (e.g., ‘pow_le_pow_left‘), the name has likely changed in Mathlib4. 109 - ** Search the documentation. ** The best way to find the current name is to search the Mathlib4 API documentation. 110 - ** Look for prefixes. ** Many lemmas are now namespaced by their primary type. For example, lemmas about ‘Real‘ powers are often in the ‘Real‘ namespace (e.g., ‘Real.pow_le_pow_of_le_left‘). Lemmas about ‘rpow‘ (real powers) are often named ‘Real.rpow_...‘. 111 - ** Common Power Inequality Lemmas (for ‘ R ‘): ** 112 - For ‘0 ≤ x ≤ y → xˆn ≤ yˆn‘, use ‘Real.pow_le_pow_of_le_left‘. 113 - For ‘0 ≤ x < y → xˆn < yˆn‘, use ‘Real.pow_lt_pow_of_lt_left‘. 114 - For ‘0 ≤ a, b‘ and ‘1 ≤ n‘, ‘a ≤ b ↔ aˆn ≤ bˆn‘ is ‘Real. pow_le_pow_iff_of_nonneg‘. 115 - ** Complex Proofs: ** If a proof is complex (e.g., induction with inequalities), break it down into many small, verifiable ‘have‘ statements to isolate errors. Don’t write a single large ‘calc‘ block or proof term until you have verified the intermediate steps. 11

Beyond the Answer: Decoding the Behavior of LLMs as Scientific Reasoners

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment