The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning

T H E A G E N T I C R E S E A R C H E R : A P R A C T I C A L G U I D E T O A I - A S S I S T E D R E S E A R C H I N M A T H E M A T I C S A N D M A C H I N E L E A R N I N G Max Zimmer ∗ Nico Pelleriti Christophe Roux Sebastian Pokutta Department for AI in Society , Science, and T echnology , Zuse Institute Berlin, Germany Institute of Mathematics, T echnische Uni versit ¨ at Berlin, Germany { zimmer, pelleriti, roux, pokutta } @zib.de Let me now wait for the final check: ● Task Output bwzs8ty88 1-hour mid-training status check Waiting for task (esc to give additional instructions) ✳ Gitifying… (8h 39m 19s · ↓ 39.3k tokens · thought for 20s) └ Tip: Run /install-github-app to tag @claude right from your Github issues and PRs Background tasks 10 active shells ❯ sleep 2700 & & echo " ≡ Final completion check ≡ " & & python3 & < 'PYEOF'… (running) sleep 3600 & & echo " ≡ 1h mid-training ≡ " & & python3 -c "… (running) sleep 5400 & & echo " ≡ Final completion check ≡ " & & for exp in E023_no_nor… (running) source ~/.bashrc & & CUDA_VISIBLE_DEVICES=6 uv run torchrun & - standalone & - npr… (running) source ~/.bashrc & & CUDA_VISIBLE_DEVICES=5 uv run torchrun & - standalone & - npr… (running) source ~/.bashrc & & CUDA_VISIBLE_DEVICES=4 uv run torchrun & - standalone & - npr… (running) source ~/.bashrc & & CUDA_VISIBLE_DEVICES=3 uv run torchrun & - standalone & - npr… (running) source ~/.bashrc & & CUDA_VISIBLE_DEVICES=2 uv run torchrun & - standalone & - npr… (running) source ~/.bashrc & & CUDA_VISIBLE_DEVICES=1 uv run torchrun & - standalone & - npr… (running) Figure 1: A command-line interf ace (CLI) agent during an autonomous research session: ov er 8 hours in, managing six parallel GPU training runs and three scheduled monitoring tasks. The same framew ork supports mathematical derivations, pr oofs, and veriﬁcation alongside computational e x- periments. The agent is idle, consuming no tokens while waiting for a status check to complete. A B S T R AC T AI tools and agents are reshaping how researchers work, from proving theorems to training neural networks. Y et for many , it remains unclear how these tools ﬁt into everyday research practice. This paper is a practical guide to AI-assisted r e- sear ch in mathematics and machine learning : W e discuss how researchers can use modern AI systems productively , where these systems help most, and what kinds of guardrails are needed to use them responsibly . It is organized into three parts: (I) a ﬁv e-le vel taxonomy of AI integration, (II) an open-source frame work that, through a set of methodological rules formulated as agent prompts, turns CLI coding agents (e.g., Claude Code, Codex CLI, OpenCode) into autonomous research assistants, and (III) case studies from deep learning and mathematics. The framew ork runs inside a sandboxed container, works with any frontier LLM through existing CLI agents, is simple enough to install and use within minutes, and scales from personal-laptop prototyping to multi-node, multi-GPU experi- mentation across compute clusters. In practice, our longest autonomous session ran for ov er 20 hours, dispatching independent e xperiments across multiple nodes without human intervention. W e stress that our framework is not intended to re- place the researcher in the loop, but to augment them. Our code is publicly av ail- able at github .com/ZIB-IOL/The-Agentic-Researcher . ∗ W e welcome contrib utions, issue reports, improv ement suggestions, additional case studies via issues, PR, github .com/ZIB-IOL/The-Agentic-Researcher , to keep this up-to-date and useful. 1 1 I N T RO D U C T I O N In 2024, DeepMind’ s AlphaProof ( Hubert et al. , 2025 ) combined with AlphaGeometry ( T rinh et al. , 2024 ) became the ﬁrst AI system to achie v e medal-level performance at the International Mathemat- ical Olympiad (IMO), reaching silver -medal standard by solving four of the six competition prob- lems through reinforcement learning and formal veriﬁcation. AlphaEvolv e ( Novikov et al. , 2025 ) demonstrated that LLM-guided evolutionary search can discover new mathematical constructions, rediscov ering best-kno wn solutions across a broad collection of problems and improving on them in sev eral cases ( Georgie v et al. , 2025 ). Most recently , Aletheia ( Feng et al. , 2026b ), an autonomous mathematical research agent, resolved sev eral open problems originally posed by Erd ˝ os while op- erating with minimal human intervention. Aletheia also solved several open problems from Fir st Pr oof ( Abouzaid et al. , 2026 ), a benchmark of previously unpublished research-lev el mathematics questions drawn from the authors’ own research process, within weeks of its release. These results are remarkable, and recent systems no w address not only well-deﬁned benchmarks but also genuine open mathematical problems. In parallel, the Machine Learning (ML) community has seen a surge in agentic experimentation: for instance, Karpathy’ s autor esear ch ( Karpathy , 2026 ) demonstrated how agents can run automated ML experiment pipelines through iterative code modiﬁcation, and such pipelines are becoming increasingly common. Most of the current literature, including the works discussed above, focuses on what AI systems can achieve . Much less attention has been giv en to the complementary practical question of how r esear chers should integr ate such systems into everyday research. In practice, research rarely pro- ceeds by pursuing a ﬁxed objectiv e from the outset: researchers must decide which questions to ask, which e xperiments to run, when to reformulate a conjecture, and ho w to respond to une xpected results. Supporting this kind of work requires workﬂo ws that accommodate shifting objectiv es, it- erativ e experimentation, and sustained human guidance, yet how to build and use such workﬂo ws remains an open question. For most researchers, the challenge is not building a discov ery pipeline from scratch but understanding which tools are a v ailable and ho w to use them ef fecti vely . A growing body of work has begun to map this landscape, including conceptual frameworks for human-AI co-creativity ( Haase & Pokutta , 2026 ), visions of the “augmented mathemati- cian” ( Henkel , 2025 ), formal-proof assistants ( Y ang et al. , 2023 ; Song et al. , 2025 ), and numerous ﬁrst-hand accounts of AI-assisted research ( Bubeck et al. , 2025 ; Diez et al. , 2025 ; Alex ee v & Mixon , 2026 ; Ivanisvili & Xie , 2025 ; Feldman & Karbasi , 2025 ; Salim , 2025 ; Dobriban , 2025 ; Schmitt , 2025 ). A vigad ( 2026 ) make this point especially clearly: mathematicians should not merely react to AI but should take an active role in deploying and shaping it for their own purposes. Y et none of these works provides actionable, end-to-end guidance that a researcher could follo w today . W e hope to make some progress on these questions and aim to ﬁll parts of that gap. The frame works, approaches, and insights presented here hav e been dev eloped over roughly the last one and a half years in the context of the MA TH+ project Ag entic AI in Mathematics 1 but apply beyond mathemat- ics and hav e proven to be very po werful, e.g., in ML research. This also explains our choice of use cases in machine learning and mathematics. The four authors approached AI-assisted research from complementary directions: some built on existing CLI coding agents with either an experimental or a theoretical and proof-oriented focus, while others de veloped a custom multi-agent system from scratch. The insights gained from these div erse e xperiences form the basis of the uniﬁed frame work we present here. Contributions. Our contributions are as follo ws. 1. A practical taxonomy (Section 2 ). W e identify ﬁv e lev els of AI integration into mathematical and ML research, ranging from full human control to high agent autonomy . 2. An open-source, sandboxed agentic research framework (Section 3 ). W e present a set of methodological rules, formulated as agent prompts, which we call commandments , together with a sandboxed container environment and reporting con ventions that turn general-purpose CLI cod- ing agents into autonomous research assistants. The commandments encode the norms of scien- tiﬁc practice and guide the agent throughout the research workﬂow . The framew ork is model- and harness-agnostic, supports any frontier LLM through existing CLI agents (such as Claude 1 https://iol.zib.de/project/agentmath.html 2 Code ( Anthropic ), Codex CLI ( OpenAI ), or OpenCode ( Anomaly )), and can be set up within minutes. 3. Case studies (Section 4 ). W e demonstrate the framew ork in action across div erse domains, including deep learning as well as pure and applied mathematics, illustrating both successes and failure modes. W e provide screenshots of the agent’ s reports as they were produced. W e want to emphasize what this paper is not : we do not claim that AI replaces research creativity , insight, or the researcher . Rather , we demonstrate that speciﬁc parts of the research workﬂo w can be signiﬁcantly accelerated when a researcher directs an AI agent in a structured way . Unlike ap- proaches that seemingly remove the human from the research process entirely (cf., e.g., Lu et al. , 2024 ), our framew ork keeps the researcher as the principal in vestigator , who can now operate at greater scale and speed. W e belie ve that mathematical research is not a fully automatable task, and we will not speculate on whether this will change in the future. What we do claim is that mathe- maticians and researchers in general should take an active role in this partial transformation of the ﬁeld and, echoing A vigad ( 2026 ), should own the technology . The rest of this paper is or ganized as follows. Section 2 presents our taxonomy of integration lev els. Section 3 describes the agentic research framework in detail, the core contribution of this paper . Section 4 presents case studies, and Section 5 concludes with lessons learned, limitations, and future directions. W e defer the surv ey of related work to Section 6 at the end of the paper . 2 L E V E L S O F A I I N T E G R A T I O N I N M AT H E M A T I C A L A N D M L R E S E A R C H Inspired by Haase & Pokutta ( 2026 ), we propose a taxonomy of ﬁ ve levels that characterize how deeply AI is integrated into the research process, ranging from no AI inv olvement to fully au- tonomous research loops. These lev els are not mutually exclusi v e, and a researcher might use dif- ferent lev els for different tasks, all within the same project. In particular , even (fully) autonomous systems can delegate subtasks to less autonomous components. This regularly happens also in our setup when subagents are spawned to accomplish subtasks. In general, the key lies in recognizing which lev el is appropriate for which task. T able 1 summarizes the taxonomy , and we describe each lev el in detail belo w . Level 0: Classical. The classical level is the baseline of our taxonomy and the traditional mode of mathematical and ML research. The researcher uses all traditional computational tools, includ- ing typesetting software (e.g., L A T E X), mathematical software (e.g., Mathematica, MA TLAB), and programming languages for custom implementations (e.g., Python, Julia, PyT orch), but no AI as- sistance. This remains the predominant mode of research and is perfectly appropriate. The goal of this paper is not to argue that AI should render it obsolete, but to show when and how AI can complement it. T able 1: Fi ve lev els of AI integration in mathematical research. Each (not necessarily mutually exclusi v e) lev el represents a qualitati vely different trade-off between agent autonomy and human in v olvement. Level Name T ools AI T asks Human Role 0 Classical L A T E X, math. software No AI integration Everything 1 Consultant LLM chatbots T argeted queries for e xplanation, lit- erature, brainstorming Asks, ev aluates 2 T ypist Editor plugins (Copilot, Cursor) Code and text generation without ex ecution Thinks, revie ws, decides 3 Collaborator CLI coding agents Human describes task, AI imple- ments and iterates Revie ws each output, assigns next task 4 Research Assoc. Our framew ork Autonomous experiment loop fol- lowing structured research plan Steers, audits 3 Level 1: AI as Consultant. The researcher uses LLM-based chatbots (e.g., ChatGPT , Claude, Gemini) for speciﬁc queries and assistance. T ypical cases include concept explanation ( Explain the differ ence between str ong and weak duality in linear pr ogr amming ), literature search ( What ar e the curr ent best con ver gence rates for SGD with heavy-tailed noise? ), brainstorming ( What techniques exist for pr oving con ver gence of iterative algorithms when the operator is only approximately con- tractive? ), and deb ugging ideas ( Her e is my pr oof attempt. Where does the ar gument br eak down? ). The core intellectual work remains with the researcher; the AI provides targeted assistance. The ke y skill is asking the right questions and crafting sufﬁciently detailed prompts to guide the AI toward a useful answer . A clear limitation is that the interaction is stateless across sessions unless the user manually provides conte xt. Getting started: A web browser and access to an LLM chatbot (free tiers available from most providers). No setup required. Level 2: AI as T ypist. The researcher uses AI for code and text generation, ranging from tab completion (e.g., GitHub Copilot predicting the next line) to more complex prompt-based generation that produces entire functions or L A T E X paragraphs from a natural-language description. Every output is revie wed by the researcher and accepted, edited, or rejected. The deﬁning characteristic of this level is that the AI generates code or text but neither ex ecutes nor iterates on the results. The researcher remains responsible for all design decisions, and the AI accelerates the writing process without closing the loop between implementation and ev aluation. Getting started: Install a code editor plugin (e.g., Cursor , or VS Code with GitHub Copilot). Level 3: AI as Collaborator . The full implementation and ex ecution are deleg ated to a CLI cod- ing agent , i.e., a terminal-based tool (e.g., Claude Code ( Anthropic ), OpenCode ( Anomaly ), Codex CLI ( OpenAI )) that can read and edit ﬁles, execute shell commands, and iterate on results within a persistent project context. This differs qualitatively from Lev els 1–2 because the agent possesses a much broader set of capabilities, including ﬁle modiﬁcations, code execution, and iteration based on results it has obtained, all within a single con versation. For a prompt like “Implement the F rank- W olfe algorithm for the semideﬁnite r elaxation of max-cut, with step size γ t = 2 / ( t + 2) ” or “Im- plement a learning rate scheduler with linear warmup, ” the agent reads the codebase, implements the algorithm, runs it, and re-ev aluates if conv er gence sho ws unexpected beha vior . The researcher describes each task in natural language and provides the necessary context, such as an existing codebase. After each completed task, the researcher revie ws the output, decides what to do next, and assigns the next task; the agent handles how . At no point does the agent independently set the research direction. Getting started: Install a CLI coding agent and start a session in the project directory . Level 4: AI as Research Associate. The highest degree of autonomy in our taxonomy . The re- searcher arriv es with a research idea (initial intuitions, failed strategies, partial results, or simply a well-posed question) and outlines a research plan: goals, metrics, constraints, approaches al- ready tried, and promising directions to explore. The agent then formulates a detailed plan and autonomously executes an experiment loop: formalizing mathematical ideas, implementing ap- proaches, running ev aluations, recording results, analyzing outcomes, and updating both a structured research report and a TODO.md . It iterates this loop, continuously reﬁning and expanding the plan, operating for hours to days to achiev e the research goal or unco ver something une xpected. T o operate for extended periods, structured and clear instructions that govern scientiﬁc rigor , docu- mentation, and veriﬁcation are needed: our framework (Section 3 ) provides exactly these. The key difference from Lev el 3 is that the agent does not wait for human input between experiments but follows a research plan and a set of commandments encoding the norms of good scientiﬁc practice: one variable per experiment, structured reporting, staged ev aluation (from quick sanity checks to full benchmarks), and veriﬁcation protocols, among others (cf. Section 3 ). Intermittent human revie w and course correction are an integral part of Le vel 4, not a fallback to Le vel 3: the researcher period- ically inspects the report, adjusts priorities, and reﬁnes the research plan while the agent continues to execute autonomously . The researcher’ s role shifts from execution to direction-setting, periodic revie w , and ev aluation. Level 4 is most appropriate when the search space is lar ge. 4 Concept Research Question Problem formulation, hypotheses, objectiv es, ev aluation criteria T ools, Methods & Data Software stack & packages, datasets, compute resources, custom scripts Prior W ork & Domain Knowledge Existing codebase, L A T E X notes & deriv ations, references, preliminary results In practice Examples: C A S E S T U DY A Deep Learning Improve LLM pretraining: exploit Muon’ s memory savings ov er AdamW PyT orch, CUDA, uv , FineW eb dataset, multi-GPU allocation LLM pretraining benchmark codebase C A S E S T U DY D Mathematics Prove lo wer bounds for Frank-W olfe on uniformly con vex sets Python, Julia, uv T wo recent lower-bound proofs for the strongly con vex case as references Figure 2: Setting up a research project. T op: the three categories of input the researcher provides, with their conceptual role (dark) and concrete realization (light). Bottom: two examples from our case studies: a deep learning project (Section 4.1 ) and a mathematics project (Section 4.4 ). Despite the guardrails described in Section 3 , limitations remain. The agent may pursue an unpro- ductiv e direction for too long, especially when the research plan lacks sufﬁcient detail. V eriﬁcation is only partially solv ed: while we pro vide strategies for symbolic and numerical veriﬁcation of math- ematical claims and implementations, a high (to full) degree of certainty requires the researcher to perform a rigorous revie w of the work. W e consider this a feature, not a bug. Similarly , while the agent is instructed to search the literature, it cannot guarantee that its ideas are genuinely nov el. Thorough knowledge of the related work remains the researcher’ s responsibility . As such, the re- searcher still faces a non-tri vial amount of work both throughout and tow ard the end of a project: revie wing intermediate results and providing steering, verifying correctness, deciding what results merit publication, and conﬁrming originality as well as adding conte xt and interpretation. Ho wev er , instead of conducting the entire research process alone, the researcher now externalizes parts of the work to a capable r esear ch associate who deliv ers a structured, well-documented report. This report then requires careful and rigorous re view with subsequent steering and guidance. Through repeated interactions of this kind, new results emer ge in a process of Human-AI co-creation. Getting started: Clone the project repository 2 and follo w the setup instructions in the README.md . The setup takes a couple of minutes, and the ﬁrst autonomous experiments can begin immediately . A detailed description of the framew ork initialization is gi ven in Section 3.1 . 3 T H E A G E N T I C R E S E A R C H F R A M E W O R K W e describe our core contrib ution: the agentic research frame work, its design principles, and the ten commandments, distilled from our o wn e xperience, that guide the agent’ s behavior . The instructions described in the following subsections are provided to the agent through a persistent instruction ﬁle ( INSTRUCTIONS.md ) that is read at the start of every session. This conﬁguration ﬁle contains univ ersal instructions as well as a ﬁnal section that serves as a template placeholder for project- speciﬁc instructions; these are automatically ﬁlled in by the agent once the researcher provides the research instructions. 3 . 1 O V E RV I E W A N D W O R K FL OW T o start a ne w project, the researcher pro vides three things (Figure 2 ): a r esear ch question (problem formulation, hypotheses, ev aluation criteria), the tools, methods, and data needed to in v estigate it (software stack, packages, datasets, compute resources), and any prior work or domain knowledge that should inform the in v estigation (existing codebase, L A T E X notes with deriv ations, references, preliminary results). In the following, we will use the term experiment to refer to one (broad) agentic iteration loop with the researcher: depending on the context, this can be one proof attempt, an actual computational experiment, or the design of a new algorithm. The framew ork is built around 2 github .com/ZIB-IOL/The-Agentic-Researcher 5 CLI coding agents, e.g., Claude Code ( Anthropic ), Codex CLI ( OpenAI ), Gemini CLI ( Google ), or OpenCode ( Anomaly ), which operate inside a sandboxed container that provides a secure, isolated workspace. Starting a new project. The typical workﬂo w is as follo ws: 1. The researcher begins in a project directory that contains the practical-layer materials described abov e (Figure 2 ). From this directory , they launch the sandbox and provide the research instruc- tions to the agent. The more detailed the instructions, the better; we found it especially useful to provide a working codebase if one exists, along with a L A T E X write-up of the research problem and previously tried approaches. 2. The agent asks clarifying questions about scope, constraints, and ev aluation metrics. 3. After this back-and-forth, the agent explores all relev ant ﬁles and writes the ﬁnal project-speciﬁc instructions into a persistent instruction ﬁle ( INSTRUCTIONS.md ), alongside the univ ersal commandments that are already in place (Section 3.2 ). 4. The agent creates a plan and initializes report.tex and TODO.md , the two main artifacts of the research process. Upon approv al by the researcher (or after further reﬁnement of the plan), the agent begins autonomous execution and only requires human intervention in case of unexpected behavior or when the research plan needs adjustment. Why CLI agents. Across our research workﬂows, three practical requirements arose repeatedly . CLI agents are easy to use : they ﬁt naturally into local working en vironments, can be launched inside an existing project, and operate directly on local ﬁles without additional infrastructure. They remain fully interactive : the researcher can intervene at any point to inspect progress, redirect the in v estigation, stop execution, or restart with re vised instructions. Finally , they are extensible : the toolchain can be readily e xtended with custom utilities; in our case, this included scripts for handling literature and L A T E X sources, extracting relev ant algorithmic sections, and running specialized search and veriﬁcation routines. The same mechanism also supports hard guar drails : automated checks can be triggered after ﬁle edits or experiment runs, enforcing formatting, running tests, or updating reports. Because CLI agents are maintained by model providers and e v olve with model capabilities, while our rules sit on top, the framew ork automatically beneﬁts from improv ements to the underlying tools. Figure 1 shows an autonomous session in practice. Infrastructure. Because the frame work is b uilt around CLI agents, the surrounding infrastructure can remain intentionally minimal. The sandbox conﬁnes all actions to a container , enabling unat- tended sessions without the risk of damaging the host system. For compute-intensiv e projects, a multi-node launcher dispatches independent experiments to remote Slurm nodes. W e recommend using reproducible, project-local package managers ( uv for Python, Julia’ s Pkg , among others). Structured r eporting and experiment tracking. All experimental progress is recorded in a single L A T E X ﬁle ( report.tex ) that accumulates experiments, deri v ations, and analysis, complemented by a TODO.md checklist for open questions, un veriﬁed claims, and deferred work. Each experiment subsection must contain the following ﬁelds, enforced by the commandments (Section 3.2 ): Listing 1: Required ﬁelds for each experiment in report.tex . \paragraph{Goal} What problem are we solving? \paragraph{Hypothesis} Why should this approach work? \paragraph{Method} Mathematical formulation with proper notation. \paragraph{Implementation} Files and lines changed. \paragraph{Results} Table with method, model/instance, metric, delta. \paragraph{Analysis} Why it worked or didn’t. What it reveals. \paragraph{Next Steps} What to try based on these results. Rather than introducing a separate experiment-tracking system, we use Git directly . Each experiment is recorded as a commit with a structured message of the form exp(EXXX): -- = . Branches group related experiments, tags mark important outcomes, and Git’ s worktree feature allo ws multiple agent sessions to run concurrently on separate copies of 6 Researcher Instruction File { CLAUDE,GEMINI,AGENTS } .md CLI Agent Sandbox Python, L A T E X, Git, GPU prompts governs runs in reports back 1 Explore 2 Plan 3 Implement 4 Evaluate 5 Analyze 6 Record 7 Commit 8 Iterate Experiment Loop VI : one variable VIII : bound expectations V : make it work II : honest evaluation VII : three tier s X : verify before claiming III : verify citations IX : recor d everything I : keep pr omises IV : complete all work Git history report.tex / TODO.md All steps governed by the T en Commandments in Section 3.2 . Figure 3: Overview of the agentic research framework. T op: The researcher writes a persistent instruction ﬁle that gov erns the CLI agent operating within a sandboxed environment. Bottom: Each experiment follo ws an eight-step loop. the codebase without interference. This keeps the full experimental history lightweight, portable, and directly searchable through Git logs. Once running, each experiment follows the eight-step loop shown in Figure 3 : Explor e → Plan → Implement → Evaluate → Analyze → Recor d → Commit → Iterate . At the beginning of e very session (or after a context window reset), the agent re-reads report.tex , TODO.md , and the git log to restore continuity . 3 . 2 T H E T E N C O M M A N D M E N T S At the core of our framew ork are the lessons we distilled through experimentation into ten com- mandments that apply independently of the speciﬁc domain and research problem. The y form a major part of the instructions giv en to the agent. The full instructions are a v ailable in our repository . In deriving the ten commandments through continuous improvement of the agent’ s behavior , we followed three guiding principles: (1) explicit over implicit: language models follo w instructions literally; implicit expectations (“ob viously you should record your results”) are reliably violated, so ev ery important behavior must be stated as a rule; (2) falsiﬁable over aspirational: “be rigorous” is not a commandment, “change exactly one variable per experiment” is, allowing both human and agent to assess compliance; (3) failur e-driven over theory-driven: every commandment exists be- cause we observ ed a speciﬁc failure mode in practice, not because it seemed theoretically desirable. The commandments are grouped into categories, each addressing a speciﬁc aspect of the research process. Below , we state each rule and describe the failure mode it addresses. W e present slightly shortened versions for brevity; the full prompts are av ailable on GitHub. At the implementation lev el, each commandment is a prompt-engineering directiv e; we found that naming and structuring these behaviors as e xplicit rules makes them signiﬁcantly easier to maintain, deb ug, and iterate on. 7 3 . 2 . 1 I N T E G R I T Y A N D T R U S T The following three commandments address the integrity of the agent’ s promises and announced actions. I. Never Br eak a Promise If you say “I will do X, ” do it. Under-promise, ov er-deli ver . Failur e mode: In early experiments, the agent frequently stated intentions (“I will now run the full ev aluation”) and then skipped steps or mov ed on to dif ferent tasks. After adding the commandment, the agent either follows through on all stated tasks or states upfront which tasks will be deferred and why . II. Never Manipulate Ev aluation Do not change metrics, test sets, ﬁxed hyperparameters, or problem deﬁnitions. Do not hard- code results or cherry-pick seeds. Failur e mode: The agent subtly changes ev aluation conditions to make results look better . The LLM may adjust ev aluation parameters “helpfully” to reach its goal, b ut this is not a genuine improvement. For instance, the agent changed the number of ev aluation samples to “speed up ev aluation”, which happened to produce better metrics and created an unfair adv antage ov er baseline methods. III. Never F abricate Citations Every bibliography entry must be veriﬁed against the actual source before adding it. Search for the paper via web search. Conﬁrm the exact title, full author list, year , venue, and identiﬁer from the source. If you cannot ﬁnd the paper , do not guess. Never write a citation from memory alone. Failur e mode: This commandment aims to address a well-known limitation of these systems: they hallucinate plausible but incorrect bibliographic entries. 3 . 2 . 2 A U T O N O M Y A N D E FFI C I E N C Y A major problem we encountered was that, despite having a long todo-list of potential tasks and experiments, the agent consistently stopped to ask whether it should continue. The following two commandments aim at maximizing productiv e work within each session. IV . Complete All A utonomous W ork Bef ore Reporting Finish e very task that does not need user input. Report once with all results. Nev er skip work because you estimate it “takes too long to implement”. Failur e mode: The agent frequently stops to ask whether it should continue, ev en when the research plan speciﬁes many more experiments that could be executed without additional input from the researcher . A related failure mode is that the agent often discards approaches because they “would take too long to implement” and potentially “only have modest impact”. Modest impact aside, agents drastically underestimate their own coding speed; in fact, the implementation typically takes less than a minute. The only valid time concern is actual compute runtime measured in days. V . Make It W ork Befor e Moving On An experiment crash is a bug, not a bad idea. Do not discard methods because of implementa- tion failures. Inv estigate, ﬁx, and re-run. Failur e mode: When encountering an implementation failure, agents often claim that the approach “doesn’t work” and mov e on to an alternativ e. In practice, howe v er , most of these crashes are simple bugs that can be ﬁxed easily . For instance, when hitting an out-of-memory error, the agent concluded that the method “doesn’t scale”. Upon further in vestigation, it found an unnecessary materialization 8 of a memory-intensive matrix, replaced it, and the method ran successfully , yielding signiﬁcant improv ements ov er the baseline. 3 . 2 . 3 S C I E N T I FI C R I G O R The following three commandments ensure that the agent follo ws the norms of scientiﬁc practice. VI. One V ariable per Experiment Change exactly one thing per experiment. If two things change and the metric improv es, you cannot know which helped. Failur e mode: If one e xperiment is successful and the agent has an idea for further impro vement, it is often tempted to combine both the successful change and the ne w idea simultaneously in the next experiment. This makes it impossible to determine which change caused the improv ement. VII. Evaluate in T iers T ier 1 (seconds): does it run without crashing? T ier 2 (minutes): any signal on a small subset? T ier 3: full ev aluation, i.e., the real metric that goes into the report. Use small-scale runs to catch bugs only . Nev er draw conclusions from small-scale results. Failur e mode: W e want the agent to iterate quickly and distinguish between trivial and meaningful improv ements. Consequently , we enforce that the agent (a) does not run a full, potentially e xpensi v e ev aluation after ev ery minor code change, and (b) does not discard ideas based on unsuccessful small-scale runs on toy problem instances. VIII. Bound Y our Expectations Before implementing a heuristic, identify the theoretical best case, ev en if it is not realizable in practice. If you are “correcting” something, measure how much correction is theoretically possible. Failur e mode: T o decide whether a method is successful, it is crucial to understand a theoretical upper bound on the possible improvement. The agent often observes a small improv ement and reports it as a success, without assessing proximity to the theoretical maximum. 3 . 2 . 4 D O C U M E N TA T I O N A N D R E P RO D U C I B I L I T Y The following two commandments ensure that the agent documents its work reproducibly . This is one of the most important categories, as it enables restarting the research process from any given point. IX. Record Everything Every experiment gets a subsection in the report: goal, hypothesis, method, results table, anal- ysis, next steps. Include failures. If it is not in the report, it did not happen. V isualize, don’t just describe: create plots for distributions, comparisons, and scaling. Maintain TODO.md as a living checklist for open questions, un veriﬁed claims, and deferred work. Failur e mode: W ithout the rule, the agent runs experiments, observes results, and keeps them in its context window . As soon as this context window is compacted or cleared, the information is lost. At the same time, the strict rule “if it is not in the report, it did not happen” ensures that the agent does not mistakenly believ e it has already obtained a result that was nev er recorded. Apart from the report, which we sav e as a L A T E X document, we also maintain a TODO.md ﬁle, which is equally critical, as it prev ents the agent from for getting about open questions, un veriﬁed claims, and deferred work. 9 X. V erify Befor e Claiming Assume you are wrong until veriﬁed. Write veriﬁcation scripts, not just explanations. Actively try to falsify your own claims, test edge cases, randomize inputs, search for counterexamples. Grade claims: veriﬁed , partially veriﬁed , or un veriﬁed . Failur e mode: Mathematical veriﬁcation remains a major challenge for LLMs. W e observed signif- icant impro vements when enforcing at least numerical veriﬁcation of claims. For instance, the agent deriv es a formula whose deriv ation contains an error (e.g., a missing factor of two), but the results look plausible. A veriﬁcation script that checks the formula against a brute-force computation on small instances catches this immediately and pre vents the agent from continuing its argument on a false premise. This activ e falsiﬁcation, i.e., the process of deliberately trying to break your own hypothesis before conﬁrming it, often re veals the k ey structural insight that mak es the proof work. 3 . 3 D O M A I N - S P E C I FI C C O M M A N D M E N T S The ten commandments presented abov e are intended to be univ ersal. In addition, we found it beneﬁcial to provide domain-speciﬁc commandments tailored to the research style of the domain, whether primarily theoretical or empirical. Beyond these broad categories, further specialization is useful: for instance, research in a speciﬁc subﬁeld of mathematics beneﬁts from commandments tailored to its particular challenges. Domain: Compute-Intensi ve Research. F or empirical projects in v olving GPU experiments, deep learning, or large-scale numerical simulations, we apply the following additional command- ments: • One experiment per GPU; use them all (C1). Check nvidia-smi before every batch of work. Assign each independent experiment to its own GPU. Never leave GPUs idle when independent tasks remain. • Context window hygiene (C2). Prefer redirecting long-running output to log ﬁles and monitoring with tail . Only in v estigate logs in detail if something looks wrong. • Memory management (C3). When observing out-of-memory (OOM) errors, do not conclude that the method “does not scale”. Instead, systematically reduce memory: clear the GPU cache be- tween experiments ( torch.cuda.empty cache() ), enable gradient checkpointing, or pro- cess layers sequentially instead of in parallel. Print torch.cuda.memory summary() to identify the allocation that causes the spike. Only after these mitigations fail is it valid to report a genuine scaling limitation. • Discover nodes ﬁrst; dispatch independent experiments (C4). When a multi-node Slurm allo- cation is activ e, discov er av ailable nodes at session startup and dispatch independent experiments to remote nodes via remote-run . Each dispatched job runs in its own container on the tar- get node with full GPU access. Never dispatch dependent work: only experiments that are fully independent may run on remote nodes. Domain: Mathematical Research. For theory-heavy projects in v olving proofs and deriv ations, we apply the following additional commandments: • Derivations befor e code (M1). Write deriv ations step-by-step before implementing. Cross- reference paper equations. Before implementing a new method, search for prior work to ﬂag potential rediscov ery . • Precise notation (M2). Use precise index notation ( G j j , not G j , for diagonal elements of a matrix). Deﬁne all notation before ﬁrst use; dimensions, ranges, scalar vs. vector vs. matrix. Apply the same rigor to negati ve results as to positiv e ones. • Counterexample-ﬁrst reasoning (M3). Before attempting a proof, activ ely search for coun- terexamples: randomize inputs, test boundary cases, enumerate small instances exhausti vely . If a counterexample exists, the search ﬁnds it f aster than a failed proof attempt re v eals the obstruction. If no counterexample surviv es, the search often exposes the structural property that makes the proof work. 10 4 C A S E S T U D I E S W e present case studies demonstrating the framework across dif ferent research domains and inte gra- tion levels. The ﬁrst three (A–C) deal with LLM-related research questions: pretraining, pruning, and quantization. The remaining three (D–F) concern mathematical research: con ve x optimization, combinatorial optimization, and algebraic geometry . Each case study follo ws a consistent structure: domain, problem, what the agent did, results, and lessons learned. Throughout, we include ﬁg- ures, screenshots, and excerpts from the agent’ s reports as they were produced (indicated by a thin border); minor errors or rendering artifacts are preserved and mark ed with [ sic ] where appropriate. 4 . 1 S Y S T E M AT I C O P T I M I Z E R E X P L O R A T I O N F O R L L M P R E T R A I N I N G This case study demonstrates the framework’ s core experimental loop on a computationally intensiv e deep learning task: systematic, single-variable experimentation across a non-trivial optimizer design space, with multiple GPUs running independent experiments in parallel. Domain and pr oblem. AdamW ( Kingma & Ba , 2014 ; Loshchilov & Hutter , 2017 ) has long been the dominant optimizer for language model pretraining. It maintains tw o buf fers per parameter (ﬁrst and second moments), requiring additional memory 2 N compared to vanilla Stochastic Gradient Descent (SGD), where N is the number of parameters. The Muon optimizer ( Jordan et al. , 2024 ) takes a fundamentally different approach: instead of adaptiv e step sizes, it computes a momentum vector M t = µM t − 1 + G t and then applies Newton-Schulz (NS) orthogonalization to approximate U V ⊤ from the Singular V alue Decomposition (SVD) of the momentum buf fer M t = U Σ V ⊤ , so that W t +1 = W t − η · NS( M t ) . This operation equalizes all singular values of the update and achiev es strong results on LLM pretraining while using only N additional memory units (one momentum buf fer) compared to SGD, half of AdamW’ s 2 N . A natural question arises: can the spar e N memory b udget be e xploited to make Muon better? The agent was gi ven this open-ended research question, the codebase of Semenov et al. ( 2025 ) as a standardized LLM pretraining benchmark (124M-parameter Llama on FineW eb, 10,000 iterations), and a multi-GPU compute allocation. What the agent did. After establishing baselines (Muon, AdamW), the agent explored modiﬁca- tions to the Muon update rule, changing exactly one variable per experiment ( Commandment VI ). The central insight was that Muon con ver ges faster when the vector it orthogonalizes is well- conditioned: normalizing the momentum buf fer before orthogonalization means the same number of iterations yields a better update. The agent tested multiple normalization strategies, swept hy- perparameters one at a time, and discovered two independent improvements: (1) a normalization technique applied before orthogonalization, and (2) the addition of weight decay to Muon’ s matrix parameters. W eight decay is a standard regularization technique and its beneﬁt is not surprising in itself; howe ver , the reference codebase implemented Muon without it, and because the agent tested each modiﬁcation in isolation ( Commandment VI ), it was able to quantify this contribution separately and still identify the normalization impro v ement on top of it. A zero-o verhead variant re- quiring no extra buf fer was found to achie ve nearly identical results. Follo wing Commandment IX , each of the more than 40 experiments was documented in the agent’ s report.tex with goal, hypothesis, method, results table, and analysis. The agent also identiﬁed several independent papers exploring normalization in the context of Muon: NorMuon ( Li et al. , 2025 ), AdaMuon ( Si et al. , 2025 ), and Muon+ ( Zhang et al. , 2026 ), each propos- ing a different normalization strategy . It implemented two of these methods in its codebase and ran a detailed comparison, analyzing the theoretical and empirical differences between the approaches ( Commandment V ). The existence of multiple concurrent works exploring the same design space underscores the need to carefully characterize how the agent’ s approach relates to and dif fers from each of them. While the agent conducted thorough literature searches, we cannot guarantee that its speciﬁc combination of modiﬁcations is truly novel. Accordingly , we keep the presentation at a high lev el and view these results primarily as initial directions to build on: the experiments are limited to a single architecture and dataset, and a full comparison across model scales, training se- tups, and concurrent methods would be necessary to draw an y deﬁnitiv e conclusions. A standalone publication would further require a more in-depth prior -art in vestigation to establish precisely which aspects, if any , are ne w . 11 33.0 33.5 34.0 34.5 35.0 35.5 36.0 36.5 37.0 Final validation perplexity (lower is better) AdamW Muon row-norm (lr=0.01) pre-NS (lr=0.01) pre-NS + wd=0.1 row-norm + wd=0.1 row-norm + wd=0.05 pre-NS + wd=0.03 pre-NS + wd=0.05 36.254 35.128 34.075 ( 3.0%) 34.018 ( 3.2%) 33.705 ( 4.1%) 33.698 ( 4.1%) 33.427 ( 4.8%) 33.423 ( 4.9%) 33.352 ( 5.1%) Muon baseline (35.128) AdamW (36.254) Figure 4: Final validation perplexity [ sic ] from the agent’ s report in Section 4.1 . Lower is better . The dashed line marks the Muon baseline; the agent’ s modiﬁcations achieve ∼ 5% improvement ov er Muon and ∼ 8% ov er AdamW . 0 2k 4k 6k 8k 10k Training iteration 30 40 50 60 70 80 90 100 V alidation perplexity (a) Full training Muon AdamW NewMuon (pre-NS) NewMuon (best) NewMuon (row-norm) 7k 7k 8k 8k 9k 9k 10k Training iteration 33 34 35 36 37 38 39 40 V alidation perplexity 35.13 33.35 (b) Final 3000 iterations Figure 5: Training curves [ sic ] from the agent’ s report in Section 4.1 . Left: full training run. Right: ﬁnal 3,000 iterations (zoomed). The agent’ s optimizer modiﬁcations consistently outperform both Muon and AdamW baselines throughout training, not only in the ﬁnal iterations. Note that here, the agent named the new method Ne wMuon, which is inconsistent with the naming in Figure 4 . Results. Across more than 40 experiments documented in the agent’ s report.tex , the best conﬁguration achiev ed a ∼ 5% improvement in validation perplexity over Muon (and ∼ 8% over AdamW) at the same 2 N memory budget as AdamW (Figure 4 ). The two improvements are nearly additiv e: normalization alone provides ∼ 3% , weight decay alone ∼ 2% , and the combination ∼ 5% (Figure 5 ). The zero-ov erhead variant achieves ∼ 4 . 8% improvement at the same N memory foot- print as baseline Muon, within a fraction of a perplexity point of the full method. Results were replicated across random seeds and a broader hyperparameter sweep. Lessons learned. The one-variable-at-a-time commandment ( Commandment VI ) was critical in this design space: the agent discovered that normalization and weight decay provide independent, nearly additiv e improv ements only because it tested each in isolation before combining them. A 2 × 2 factorial ablation (normalization × weight decay) conﬁrmed the near-additi vity , which would hav e been obscured by testing them jointly from the start. An interesting aspect of the agent’ s research behavior is that, while the task explicitly granted an extra N memory budget, the agent proacti vely 12 explored whether the same gains could be achieved without it, and found a zero-overhead v ariant that nearly matched the full method at the same N memory footprint as baseline Muon. The entire ses- sion ran for over twenty hours without human intervention. W ith multiple GPUs av ailable, the agent ran independent experiments in parallel (one per GPU, Commandment C1 ); the frame work’ s multi- node dispatch capability (Section 3.3 ) enables large-scale concurrent experiments across compute nodes. Despite the long wall-clock time, actual token consumption remained modest: most time was spent waiting for training runs to ﬁnish while the agent redirected output to log ﬁles and monitored progress with lightweight commands (Figure 1 ), as encouraged by Commandment C2 . The frame- work’ s emphasis on literature veriﬁcation ( Commandment III ) prompted the agent to proactively search for related work, identify concurrent papers, and implement their methods for comparison. While this is a useful ﬁrst step, the limitations noted above sho w that such automated searches are not a substitute for the thorough prior-art inv estigation a human researcher would conduct before claiming nov elty or asserting that the resulting method truly outperforms concurrent approaches. 4 . 2 W E I G H T R E C O N S T RU C T I O N I N L A R G E L A N G UA G E M O D E L P RU N I N G This case study illustrates a characteristic side effect of the agentic framew ork we propose: the agent was tasked with one research objective and discover ed a different, more effecti ve technique along the way (i.e., we observed ser endipity ). Domain and problem. Pruning large language models (LLMs) reduces memory and compute costs by zeroing out weights, i.e., selecting a binary sparsity mask M ∈ { 0 , 1 } d out × d in per weight matrix (cf., e.g., Zimmer et al. , 2023a ; Frantar & Alistarh , 2023 ; Sun et al. , 2024 ). The constraints on M determine the sparsity pattern and, with it, the potential for hardware acceleration: unstructured sparsity removes arbitrary individual weights ( Han et al. , 2015 ; Zimmer et al. , 2023b ; 2024 ; 2025 ), while semi-structured patterns such as N : M ( Mishra et al. , 2021 ; Zhang et al. , 2023 ; Lasby et al. , 2025 ) impose structure that is more amenable to hardware acceleration. The core challenge across all settings is mask selection : choosing which weights to zero out so that the pruned network’ s output remains close to the original ( Roux et al. , 2025 ; Zimmer et al. , 2026 ). Once a mask is ﬁxed, the pruned model’ s performance degrades compared to the dense original; one way to counteract this is weight r econstruction , i.e., adjusting the surviving weights to compensate for the remov ed connections ( Frantar & Alistarh , 2023 ). Calibration data is drawn from C4 ( Raf fel et al. , 2020 ); quality is measured by perplexity on the W ikiT ext ( Merity et al. , 2016 ) test set (lo wer is better). The project started with a concrete task: we had developed a pruning approach that aimed to ﬁnd better masks, but it produced inconsistent results, sometimes failing catastrophically . The agent w as provided with an e xisting codebase containing implementations of se veral pruning methods and the L A T E X deriv ation of our approach, and instructed to analyze why it failed, ﬁx or replace the method, and empirically beat a set of baselines ( Sun et al. , 2024 ; Zhang et al. , 2023 ) at 60% sparsity . What the agent did. The agent ﬁrst established that the existing approach was mathematically ﬂawed and could not be repaired. While analyzing why it failed, the agent studied how pruning distorts the post-layer activ ations of each weight matrix and observed a sev ere imbalance: some rows lose over 50% of their activ ation-weighted output magnitude while others lose less than 10%. This byproduct of debugging led the agent to propose a simple post-pruning weight correction that restores the activ ation balance across rows and columns. Follo wing Commandment VIII , the agent ﬁrst computed an oracle bound via least-squares reconstruction to determine the theoretical limit, then validated the new method through the tiered ev aluation protocol ( Commandment VII ) across ﬁv e model scales. Results. The method consistently reduces perplexity by 18–50% across ﬁ ve model scales (125M to 9B parameters), three architectures (OPT , Qwen, Gemma), and tw o pruning methods (RIA, W anda). It requires only 10 lines of code, adds less than 1% computational ov erhead, and needs no hyper- parameter tuning. The oracle comparison shows that this simple heuristic captures 92% of the im- prov ement achiev able by full least-squares reconstruction, leaving little room for more sophisticated approaches. Across 27 experiments documented in the agent’ s report, the improv ements are robust and transfer to ev ery model and pruning method tested. Figure 6 sho ws the scaling behavior across model sizes, reproduced [ sic ] from the agent’ s report; note, for instance, that the 50% sparsity line 13 125M 1.5B 3B 7B 9B Model Size (Million P arameters) 20 10 0 10 20 30 40 50 P erplexity Improvement (%) P eak: 49.4% Stable ~20% improvement RIA+Recon Scaling Behavior Across Model Sizes 60% sparsity 50% sparsity opt-125m Qwen-1.5B Qwen-3B Qwen-7B gemma-9B Model 0 10 20 30 40 50 60 70 P erplexity (W ikiT ext-2) 70.3 44.7 22.7 13.0 17.3 57.0 22.6 15.7 10.4 13.9 Absolute P erplexity: RIA vs RIA+Recon at 60% Sparsity RIA RIA+Recon Figure 6: Plots [ sic ] from the agent’ s report for Section 4.2 , produced by the agent. Left: relativ e perplexity improv ement vs. model size. Right: absolute perplexity comparison showing that the weight reconstruction method consistently outperforms the baseline across all tested model sizes. in the left panel ends at 1.5B because the agent found the 60% setting more promising and did not complete the remaining experiments. Lessons learned. The original task was to ﬁx a broken pruning mask; the actual outcome was a novel weight reconstruction method. The commandments forced the agent to analyze why the approach failed rather than simply trying the next idea, and this systematic analysis led to the dis- cov ery . Computing the oracle baseline ( Commandment VIII ) early on established that 92% of the theoretical optimum was already achiev ed, pre venting wasted ef fort on a nearly closed gap. Finally , sev eral e xtensions sho wed no beneﬁt on small models b ut 7–11% improvement at 1.5–7B scale; the tiered ev aluation protocol ( Commandment VII ) caught this systematically . 4 . 3 C O L U M N O R D E R I N G I N L L M Q UA N T I Z AT I O N This case study shows the framework operating as a systematic empirical researcher: giv en a well- deﬁned design space, the agent mapped it comprehensiv ely and discovered that the most important ﬁnding was not which method wins, b ut when and why it matters. Domain and problem. Post-training quantization compresses a pretrained language model by rep- resenting its weights in lower precision, substantially reducing the memory footprint and enabling deployment on consumer-grade hardware. GPTQ ( Frantar et al. , 2023 ), a widely used method, processes each weight matrix W ∈ R d out × d in column by column to minimize the layer-wise re- construction error ∥ ( W − ˆ W ) X ∥ 2 F , where ˆ W denotes the quantized matrix and X ∈ R d in × n are calibration activ ations. Each column’ s rounding error is propagated to subsequent columns via the in v erse of the Hessian H = 2 X X ⊤ ∈ R d in × d in . The order in which columns are processed affects the ﬁnal quality . A post-publication variant kno wn as “act-order” 3 sorts columns by descending Hes- sian diagonal, with the intuition that high-sensitivity columns beneﬁt from having more subsequent columns av ailable for error compensation. The agent was tasked with inv estigating whether better orderings exist, ho w the effect depends on model architecture, and v alidating ﬁndings across model families. Calibration data is drawn from C4 ( Raf fel et al. , 2020 ); quality is measured by perplexity on the W ikiT e xt ( Merity et al. , 2016 ) test set (lo wer is better). What the agent did. The agent began with a mathematical analysis of why column ordering matters, then implemented and compared seven ordering strategies, ﬁrst on single weight matrices, then at full model scale. Following Commandment X , it created veriﬁcation scripts for all error propagation and reﬁnement formulas before running any benchmarks (Figure 7 ). Cross-architecture validation ( Commandment VII ) across ﬁve model families (Qwen, Llama, Gemma, Mistral, Y i) rev ealed the central ﬁnding: the ordering effect v aries by more than two orders of magnitude across architectures. 3 Commit a4c3c89 , March 2023, in https://github.com/IST- DASLab/gptq . 14 V eriﬁcation What: GP T Q new error propagation and reﬁnement formulas Metho d: Numeric tests on small matrices (32 × 64, 32 × 128) Script: scripts/verify gptq new.py Outcome: All 5 tests pass. No bugs found. One known approximation do cumen ted (within-c hunk propagation in GP T Q new ). Status: Complete. Figure 7: A screenshot [ sic ] from the agent’ s report in Section 4.3 . Before running any benchmarks, the agent audited all error propagation and reﬁnement formulas through numeric tests on small matrices ( Commandment X ). Results. Column ordering is the single most impactful improvement to GPTQ, but its magnitude is entirely architecture-dependent: it reduces perplexity by 74% on Llama-3.1-8B but only 0.1% on Gemma-2-9B at 4-bit. This ﬁnding would have been missed without systematic multi-architecture validation: on Qwen-1.5B alone, the effect is 20%, gi ving no indication that it ranges from 0.1% to 74% across architectures. Among the sev en ordering strategies tested, alternativ es that incorporate the quantization error magnitude alongside column sensiti vity occasionally outperformed act-order (e.g., at 3-bit on certain architectures), b ut no single strate gy dominated consistently across all archi- tectures and bit widths. Nine of the 24 experiments produced negativ e results, each documented with the same rigor as positi ve ones ( Commandment IX ): many approaches failed because GPTQ’ s error propagation via Ordinary Least Squares (OLS) already minimizes the correlations these methods would exploit. A critical implementation bug in group quantization was caught because the agent in v estigated a failure rather than abandoning the method ( Commandment V ): pre-computing scale parameters from initial instead of error-propag ated weights produced catastrophic results (perplexity 437 vs. 9.22 after the ﬁx). The agent’ s report documents all 24 experiments and 11 ke y ﬁndings. Lessons lear ned. The negati ve results (9 of 24 e xperiments) were more informativ e than the posi- tiv e ones: each failure clariﬁed why simpler methods w ork, re vealing that GPTQ’ s OLS-based error propagation already handles what sophisticated alternati ves attempt. W ith four GPUs, the agent ran independent model ev aluations in parallel (one per GPU, Commandment C1 ), efﬁciently covering ﬁv e model families with multiple conﬁgurations each. The “Make It W ork” commandment ( Com- mandment V ) prev ented a false negati v e: group quantization initially appeared broken on Llama, but in vestigation rev ealed a subtle implementation bug whose ﬁx turned a catastrophic failure into the best result. 4 . 4 T I G H T L O W E R B O U N D S F O R F R A N K - W O L F E O N U N I F O R M L Y C O N V E X S E T S This case study demonstrates the frame work on a problem in con ve x optimization, where the agent’ s primary output is the proof of a new theorem. Unlike the computational and empirical case stud- ies, the research here required sustained interaction between numerical exploration and theoretical dev elopment: the agent discovered the correct proof strategy through systematic experimentation before formalizing it. Domain and problem. The Frank-W olfe (FW) algorithm minimizes a smooth con ve x function ov er a con v ex constraint set using only a linear minimization oracle (LMO). On strongly conv ex sets, the kno wn O (1 /T 2 ) upper bound was recently sho wn to be tight: Halbey et al. ( 2026 ) gav e a lower bound for v anilla FW in dimension 2 by analyzing the dynamics of the iterates on a worst-case instance. Shortly after, Grimmer & Liu ( 2026 ) proved an information-theoretic lower bound in the high-dimensional setting for a broad class of LMO-based algorithms. For uniformly con vex sets of order p > 2 (e.g., ℓ p -balls), K erdreux et al. ( 2021 ) established an upper bound of O (1 /T p/ ( p − 1) ) , but no matching lower bound was known. The goal was to prove lower 15 bounds for the uniformly con vex setting based on the techniques used by Halbey et al. ( 2026 ) or Grimmer & Liu ( 2026 ). What the agent did. The agent began by studying both existing lo wer-bound techniques and at- tempting to generalize the high-dimensional construction by Grimmer & Liu ( 2026 ) to ℓ p -balls. This did not succeed: the construction relies on decomposing strongly con ve x sets as intersections of shifted Euclidean balls, and the agent did not ﬁnd a direct analogue for uniformly conv ex sets of order p > 2 . Follo wing Commandment IX , the agent documented this negati ve result and piv oted to the alternativ e approach of Halbey et al. ( 2026 ), which analyzes the FW iterates directly on a worst-case instance. The agent derived the FW dynamics on ℓ p -balls in closed form and veriﬁed each component numerically ( Commandment X ). Experiments across multiple v alues of p rev ealed that the iterates alternate in sign and settle onto a lo w-dimensional curve whose shape can be char- acterized analytically , which suggested the right proof strategy . The agent ﬁrst estimated the key constants numerically , then derived them in closed form, and ﬁnally assembled a rigorous proof for p ≥ 3 with e xplicit conv ergence rates. Each proof step was accompanied by Julia v eriﬁcation scripts using BigFloat arithmetic, totaling over 30 indi vidual checks. The case p ∈ (2 , 3) was identiﬁed as qualitatively different: sign alternation breaks down intermittently , and the proof technique does not apply . Results. The main result establishes a lower bound of Ω(1 /T p/ ( p − 1) ) for vanilla FW on p - uniformly con ve x sets for any p ≥ 3 , matching the upper bound of Kerdreux et al. ( 2021 ) and resolving the open question for this regime. The proof provides explicit con ver gence constants, all veriﬁed numerically to < 0 . 2% relative error . The case p ∈ (2 , 3) remains open: numerical evidence supports the same rate, but the proof technique does not e xtend. 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 I t e r a t i o n t 1 0 9 1 0 8 1 0 7 1 0 6 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 P r i m a l g a p h t = x t e 1 2 p = 3 , u 0 = 0 . 0 1 x 0 = e 2 ( = 1 . 5 0 0 , R 2 = 1 . 0 0 0 0 ) x 0 = x s l o w ( u 0 ) ( = 1 . 4 9 7 , R 2 = 1 . 0 0 0 0 ) r e f t 1 . 5 0 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 I t e r a t i o n t 1 0 8 1 0 7 1 0 6 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 p = 4 , u 0 = 0 . 0 1 x 0 = e 2 ( = 1 . 3 3 3 , R 2 = 1 . 0 0 0 0 ) x 0 = x s l o w ( u 0 ) ( = 1 . 3 2 9 , R 2 = 1 . 0 0 0 0 ) r e f t 1 . 3 3 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 I t e r a t i o n t 1 0 7 1 0 6 1 0 5 1 0 4 1 0 3 1 0 2 1 0 1 p = 6 , u 0 = 0 . 0 1 x 0 = e 2 ( = 1 . 2 0 0 , R 2 = 1 . 0 0 0 0 ) x 0 = x s l o w ( u 0 ) ( = 1 . 1 9 3 , R 2 = 1 . 0 0 0 0 ) r e f t 1 . 2 0 Figure 8: A plot [ sic ] from the agent’ s report: Log-log conv ergence of ∥ x t − e 1 ∥ 2 for p ∈ { 3 , 4 , 6 } starting from x 0 = e 2 (blue) and from x 0 = x slow 0 (10 − 2 ) (orange) where x slow 0 is the worst-case initialization from the proof and α is the ﬁtted coefﬁcient of t − α . Lessons learned. The correct proof strate gy emerged from the agent’ s numerical e xploration: pat- terns observed in the iterates suggested the right analytical approach, and the k ey constants were ﬁrst estimated computationally before being deri ved in closed form. This “conjecture from computation, then prove” loop, enabled by the framework’ s emphasis on creating veriﬁcation scripts alongside ev ery mathematical claim ( Commandment X ), is a natural workﬂow for this type of problem. The failed generalization of Grimmer & Liu ( 2026 ) was equally informative: it helped us understand which parts of the proof are hard to extend to the uniformly con ve x setting, guiding the piv ot to the successful approach. Follo wing Commandment IX , this ne gati v e result w as documented thoroughly . 4 . 5 M U LT I - V A R I A B L E D UA L T I G H T E N I N G F O R M I X E D - I N T E G E R O P T I M I Z AT I O N This case study demonstrates the framework in combinatorial optimization. Its main contribution is a multi-v ariable generalization of dual tightening, together with a prototype implementation in the Boscia solver . The case study spans the full research cycle: deriving the result, proving it, implementing it, and ev aluating it computationally . 16 Domain and problem. Boscia ( Hendrych et al. , 2025 ) is a Frank-W olfe-based branch-and-bound solver for mixed-integer nonlinear optimization ov er polytopes ( min x ∈ X ∩ Z J f ( x ) with f smooth con v ex), where X ⊆ R n . A key pruning mechanism is dual tightening . At a relaxed solution x t with gradient g = ∇ f ( x t ) and Frank-W olfe dual gap γ ( x t ) = max v ∈ X ⟨ g , x t − v ⟩ , con vexity implies that any feasible point x ∈ X with objectiv e value at most some upper bound UB (e.g., from an incumbent) satisﬁes g j ( x j − ℓ j ) ≤ RHS for each v ariable j , where RHS : = UB − f ( x t ) + γ ( x t ) and ℓ j is the lo wer bound of x j . This allows v ariables to be ﬁx ed one at a time. The project in vestigated whether this extends to subsets : for a set S of variables at their lower bounds, P j ∈ S g j ( x j − ℓ j ) ≤ RHS , so when the combined gradient contribution exceeds the budget, a conﬂict constraint prev ents all variables from simultaneously de viating from their current bounds. For binary variables, a pairwise conﬂict g i + g j > RHS implies x i + x j ≤ 1 (a conﬂict graph edge); higher -order conﬂicts (triples, quadruples) capture interactions that pairwise constraints miss. The goal was to deriv e the mathematical result, implement it as a conﬂict graph with constraint propagation integrated into Boscia via callbacks, and benchmark on a diverse set of Mixed-Inte ger Nonlinear Programming (MINLP) instances. What the agent did. The agent started from Boscia’ s existing single-v ariable dual tightening re- sult (Theorem 3 of Hendrych et al. ( 2025 )), identiﬁed the natural generalization via the con ve xity inequality , and formulated and proved a multi-variable dual tightening theorem with corollaries for pairwise and higher-order binary conﬂicts. Before implementation, the agent ﬁrst tried to verify the proof both symbolically , using Symbolics.jl with 2,387 checks, and numerically , using a script that exhaustiv ely enumerated all 2 n feasible points for small instances (487 checks). This veriﬁcation caught an error in the initial deriv ation: the bound for the at-least set constraint had been in v erted, which would hav e led to overly aggressiv e ﬁxings for upper-bound variables. The agent then implemented a ConflictGraph data structure with constraint propagation and integrated it into Boscia via two callbacks (Figure 9 ), requiring no source modiﬁcations beyond ﬁxing a pre- existing Dict type bug. A key design decision made by the agent was to deriv e conﬂicts only at the root node. Because these conﬂicts use the global Frank-W olfe gap, the y remain v alid throughout the search tree, but are more conservati v e than conﬂicts deriv ed locally at child nodes. The agent also explored tighter child-node conﬂicts, but early tests suggested that the additional overhead and numerical instability were not worth the potential gain. Results. Across 33 instances in six problem categories ( n = 12 to n = 300 , 10-minute time limit), partition-constrained instances sho w the strongest improvement (up to 18.9% node reduction, from 127 to 103 nodes on a 48-variable instance), where partition constraints create tight cross-block coupling that the conﬂict graph captures. The root-only design is deliberately conservati ve, and most instances show 0% node reduction because the root budget is loose. Howe v er , this guaran- tees correctness, which is critical for an exact mixed-inte ger con ve x optimization solver , and all 33 instances produce identical optimal objecti ves in both modes. As expected, separable quadratic in- stances show no beneﬁt because diagonal objectives create no cross-variable coupling, conﬁrming the theoretical prediction. Lessons learned. This case study shows that the framework is effecti ve for projects that com- bine theorem proving, veriﬁcation, implementation, and experiments in a single workﬂo w . The veriﬁcation-ﬁrst approach ( Commandment X ) was crucial for the ov erall correctness. It caught the in v erted at-least bound b ug before it entered the experiment phase. The negati ve results were useful as well. The lack of improvement on separable instances matched the theory , while the 26 × over - head on a sparse re gression instance with 150 indicator v ariables exposed a concrete bottleneck and pointed to straightforward ﬁxes, including better data structures and a cap on propagated conﬂicts. Follo wing Commandment IX , these outcomes were all documented in the report, which made the ev aluation more transparent and more useful for guiding future improvements. 4 . 6 F I N D I N G M A X I M A L R E A L S O L U T I O N S I N K 7 P O W E R N E T W O R K S This case study shows the framew ork operating as a computational scientist for discov ery . Starting from a published method for characterizing typical behavior , the agent reconstructed the pipeline and repurposed it for directed extremal search, disco vering an impro ved lo wer bound. 17 4.1 Callback architecture The conﬂict graph is integrated via tw o standard Boscia callbacks. No Boscia source mo diﬁcations are required b ey ond the existing Dict type ﬁx for settings.tightening (commit c8f86437b ). propagate bounds(tree, node) Called at eac h node b efor e the F rank-W olfe solve. Propagates conﬂict- implied ﬁxings from the ro ot-derived conﬂict graph into node.local.bounds , rebuilds the LMO, and cleans the active set (see Section 4.2). bnb callback(tree, node) Called after each no de is pro cessed. A t the root ( node.std.id = 1): derives conﬂicts in to the global graph and stores a gradient/iterate snapshot for re- scanning. At non-ro ot no des: chec ks whether the incumbent impro ved and, if so, re-scans the ro ot snapshot with the tighter RHS = UB new − f ( x t root ) + τ · γ root . Figure 9: A screenshot [ sic ] from the agent’ s report: The callback architecture in Section 4.5 . The conﬂict graph is integrated into Boscia via two standard callbacks, propagate bounds (before each Frank-W olfe solve) and bnb callback (after each node), without modifying Boscia’ s source code. Domain and problem. Electrical power grids can be modeled as networks of buses connected by transmission lines, where the physics imposes a system of polynomial equations whose real solutions correspond to feasible operating states. Solutions to these po wer ﬂo w equations deﬁne the operating points of the network and underpin decisions ranging from long-term planning and capital in v estment to day-to-day resource scheduling, market operations, and real-time stability analysis. The equations depend on tunable parameters (susceptances), which appear as coefﬁcients in the system. This motiv ates a natural structural question, raised explicitly by Lindberg et al. ( 2020 ): for a ﬁxed network topology , what is the maximum number of feasible operating states over all parameter choices? Lindber g et al. ( 2020 ) characterized the distribution of solution counts for sev eral topologies, including K 7 (sev en buses, ev ery pair connected), using a continuation pipeline orders of magnitude faster than naiv e solving. Howe v er , they did not target extremal instances, i.e., those with a maximal number of real solutions, e xplicitly . Our goal is therefore to adapt the sampling technique from Lindberg et al. ( 2020 ) to identify parameter settings that yield e xtremal instances. What the agent did. The agent ﬁrst reconstructed the pipeline of Lindberg et al. , which was a nontrivial task. Reproducing the published results required sev eral rounds of reﬁnement to align the implementation with the paper’ s symmetry con ventions, parameterization choices, and solution- counting bookkeeping. Once this baseline was validated, the agent adapted the pipeline from sam- pling to extremal search. T o explore the parameter space effecti v ely , the agent combined sev eral heuristic search strategies, including hill climbing, simulated annealing, and warm starts from the best susceptance vectors found so far . These methods were used iteratively to bias the search toward regions of parameter space with unusually lar ge numbers of real solutions, with each successful run informing the next. Results. Random sampling of 1.4 million parameter vectors, following the original paper’ s sam- pling protocol, found at most 120 (nontrivial) feasible states. T argeted search instead identiﬁed a parameter vector with 192 feasible states. The agent also perturbed this parameter vector to verify that the 192-solution count is not conﬁned to an isolated parameter point, but persists in a neigh- borhood of parameter space. Figure 10 supports this interpretation by sho wing that, when only b 1 , b 8 , and b 9 are varied and the remaining 18 parameters are ﬁxed, the 192-solution conﬁguration lies in a small region with constant solution count. The maximum real solutions problem for K 7 re- mains open. Howev er , adapting Lindberg et al. ’ s continuation pipeline for extremal search yields a substantially stronger computational lower bound. Lessons learned. This case study highlights the importance of veriﬁable intermediate artifacts: published tables and solution-count distributions were essential for checking that the reconstructed 18 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 b 1 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 b 8 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 b 9 30 44 60 74 90 104 120 134 150 164 180 192 Nontrivial r eal solutions Figure 10: A plot [ sic ] from the agent’ s report: a three-parameter slice of the 21-dimensional K 7 susceptance space, obtained by varying b 1 , b 8 , and b 9 while ﬁxing the remaining 18 parameters at the v alues of the best-found instance. Each point is colored by the number of nontrivial feasible operating states. Although the color map appears nearly continuous, it represents discrete solution counts and reveals a localized high-count region around the 192-solution conﬁguration. This sug- gests that the best-found parameter vector lies in a small but open region of parameter space rather than at an isolated point. pipeline matched prior work before launching the extremal search ( Commandment X ). It also under- scored the value of staged e v aluation ( Commandment VII ): because individual searches can run for hours, the agent beneﬁted from ﬁrst v alidating correctness on cheaper checks and only then scaling up to long-running optimization runs. More broadly , the study shows that the agent need not rely on an existing codebase to be gin exploration. 5 D I S C U S S I O N A N D C O N C L U S I O N W e ha ve presented a practical frame work for AI-assisted research in mathematics and machine learn- ing, organized around a taxonomy of ﬁ ve integration le vels, an open-source frame work for working with general-purpose CLI coding agents, and case studies demonstrating this framework in practice. A central claim of this paper is that effecti v e agentic research does not require a specialized system built from scratch. Instead, it can be built around existing general-purpose agents, pro vided the y are embedded in a disciplined and inspectable workﬂo w . In our setup, the agent operates with persistent instructions, a sandboxed environment, written progress reports, TODO.md ﬁles, and a small set of methodological rules: change one variable at a time, ev aluate in stages, and verify results before reporting them, among others. In practice, these additions were suf ﬁcient to e xtend the agent from a tool for isolated coding tasks into a useful research collaborator for exploratory and implementation-hea vy work. Our experience suggests a simple conclusion: model capability matters, but workﬂo w design matters just as much. These systems are only useful when their outputs can be checked and their intermediate steps revisited. This keeps the researcher responsible for direction, judgment, and veriﬁcation, e v en when substantial exploratory or technical work is dele gated. At the same time, this approach does not eliminate the need for e xpert ov ersight or ﬁnal veriﬁcation. In our framework, howe ver , oversight is not reserved only for the end of the process; it is built into the workﬂo w itself. A central requirement is that the agent must be able to test, challenge, and potentially refute its own claims through staged ev aluation, intermediate checks, and explicit internal v alidation procedures. In our experience, these internal veriﬁcation mechanisms are crucial. W ithout them, experiments can easily become structured to simply conﬁrm an initial hypothesis. 19 Final e xpert veriﬁcation remains necessary , but it is f ar more reliable when supported by a workﬂo w that already produces inspectable and continuously tested intermediate results. W e emphasize that the case studies and reports do not constitute ﬁnished papers that are ready for publication, but rather records of meaningful research progress. T o make this approach usable by others, we release the instruction set, templates, and container deﬁnitions, with the broader goal of making AI-assisted research more systematic, reproducible, and accessible. 5 . 1 L I M I TA T I O N S V eriﬁcation. A fundamental limitation of our framew ork, shared with other agentic systems, is result v eriﬁcation. Natural-language proofs remain dif ﬁcult to v erify and require manual inspection. While code is usually easier to check, subtle implementation errors can still in validate conclusions. Citations must also be veriﬁed carefully , since agents may hallucinate references or bibliographic details. This is not only a technical limitation but also a matter of responsible use: researchers must in vest substantial effort in verifying agent outputs, especially because such outputs may be ev en harder for others to assess independently . As Su ( 2022 ) argue, researchers are often the best revie wers of their o wn papers; like wise, we argue that they are ultimately responsible for verifying the work produced by their agents. Context. Long experimental sessions with many runs and large outputs can exceed a model’ s con- text window and trigger compaction. Because compaction is inherently lossy , the agent may forget details from earlier in the session, revisit failed approaches, or miss important observations. Practi- cal mitigations include routing long outputs to log ﬁles and monitoring them with tail , manually in v oking compaction commands such as /compact , and relying on persistent artifacts such as report.tex and TODO.md as re-entry points and external memory . W e also tested autonomous compaction, but found it to hav e no positive impact. Robust context management remains an open challenge. Cost. Long autonomous sessions with frontier models can incur nontri vial API costs. In practice, howe ver , these costs are often relati vely small since much of the wall-clock time in Lev el 4 ses- sions is spent waiting for experiments to ﬁnish rather than generating tokens. Still, cost remains a meaningful limitation, particularly for long-running studies and large-scale e valuations. 5 . 2 F U T U R E D I R E C T I O N S Extension to other domains. While our paper focuses on the application of our framew ork to machine learning and mathematical research, in principle it could be applied more broadly to other disciplines, such as physics, chemistry , economics, or the social sciences. Adapting the framework to these settings would require domain-speciﬁc tools, e v aluation protocols, and safety checks, but the general paradigm of iterati v e experimentation, artif act management, and human v eriﬁcation may transfer well beyond our current case studies. More robust memory . Another important direction is improving ho w the system stores, retrieves, and updates information over long research sessions. Better memory mechanisms could help agents maintain continuity across experiments, avoid revisiting f ailed approaches, and make more ef fective use of prior observations. This would be especially valuable for complex projects that unfold over many iterations and generate substantial intermediate state. Multi-user collaboration. Our setup is currently designed for a single user interacting with a single main agent. An important future direction is extending this setting to support collaboration among multiple users, multiple agents, or both. Such a setting raises new challenges in coordination, communication, provenance tracking, and conﬂict resolution, b ut it could also make agentic research workﬂo ws more ef fecti ve for team-based projects. 20 6 R E L A T E D W O R K W e surv ey three bodies of work: AI systems that produce mathematical results autonomously (Sec- tion 6.1 ), research on mathematicians acti vely using AI in their workﬂo w (Section 6.2 ), and agentic framew orks for scientiﬁc discovery (Section 6.3 ). For broader surveys of AI for mathematics and scientiﬁc discov ery , we refer to Ju & Dong ( 2026 ), Carbone ( 2025 ), and Zheng et al. ( 2025b ). 6 . 1 A I G E N E R A T I N G M A T H E M A T I C S Competition-level mathematics. In recent years, progress in AI mathematical reasoning has been especially visible in competition-lev el mathematics, where performance is relativ ely easy to com- pare because problems typically have a single, closed-form ﬁnal answer that can be scored auto- matically . 4 Early results came from specialized systems: AlphaProof ( Hubert et al. , 2025 ) com- bined reinforcement learning with the Lean proof assistant to reach silver -medal performance at the 2024 IMO, while AlphaGeometry ( Trinh et al. , 2024 ) and AlphaGeometry2 ( Chervon yi et al. , 2025 ) paired a neural model with a symbolic deduction engine to achiev e gold-medalist perfor- mance on historical olympiad geometry . More recently , the emphasis has shifted to ward off-the- shelf frontier models strengthened by veriﬁcation and reﬁnement: Huang & Y ang ( 2025 ) report a model-agnostic pipeline that, with Gemini 2.5 Pro, Grok-4, or GPT -5, solves ﬁ ve out of six problems on the 2025 IMO under contamination-av oiding protocols. In parallel, proprietary sys- tems such as Aristotle ( Achim et al. , 2025 ) combine informal reasoning with formal veriﬁcation to achiev e gold-medal-equi v alent performance on the 2025 IMO. Finally , the same veriﬁcation-ﬁrst approach is no w claimed at the under graduate le vel: AxiomMath ( 2025 ) reports that AxiomProver produced Lean-checked solutions to all Putnam 2025 problems (a perfect 120 / 120 ). 5 T o move be- yond competition-style evaluation, recent benchmarks increasingly probe research-level questions arising in activ e mathematical workﬂows, such as the encrypted, author-curated problem set in F irst Pr oof ( Abouzaid et al. , 2026 ). Constructions and algorithms. Beyond pro ving theorems, AI has generated nov el mathematical constructions and faster classical algorithms by sear ching over pr ogr ams : an LLM proposes can- didate code, an automated ev aluator scores it, and an iterative loop improv es the best candidates. FunSearch ( Romera-Paredes et al. , 2024 ) introduced this template, yielding new constructions for the cap set problem and improved online bin packing heuristics. AlphaEvolv e ( Noviko v et al. , 2025 ) scales the same ev olutionary idea; in large-scale mathematical experiments it rediscovered best-known solutions across 67 problems and improv ed se veral, including autocorrelation inequal- ities ( Georgie v et al. , 2025 ). Recent open-source works hav e proposed methodological extensions, including OpenEvolv e, ShinkaEvolve, ThetaEvolve, DeltaEvolv e, and AdaEvolve ( Sharma , 2025 ; Lange et al. , 2025 ; W ang et al. , 2025b ; Jiang et al. , 2026 ; Cemri et al. , 2026 ). Most such systems are closed-loop and largely non-interactive : progress comes from automated propose–ev aluate iter- ations rather than back-and-forth dialogue with a human. Related approaches have also produced faster algorithms: AlphaT ensor ( Fawzi et al. , 2022 ) discovered efﬁcient tensor decompositions for matrix multiplication, and AlphaDe v ( Manko witz et al. , 2023 ) found improved sorting routines now deployed in production software. Data-driven and learning-augmented mathematics. A complementary line of work uses AI to generate candidate mathematical objects from data, whose correctness is then veriﬁed either auto- matically (via symbolic or optimization-based methods) or by human experts. Examples include data-driv en conjecturing and candidate ﬁltering ( Davies et al. , 2021 ; Mishra et al. , 2023 ; Chuharski et al. , 2024 ), learning-augmented L yapunov , Sum-of-Squares, and Border basis pipelines ( Alfarano et al. , 2024 ; Zou et al. , 2025 ; Pelleriti et al. , 2025 ; Kera et al. , 2025 ), neural-guided discovery of six-colorings for the Hadwiger–Nelson problem ( Mundinger et al. , 2024 ; 2025 ), and ML+high- precision optimization uncovering unstable self-similar solutions in ﬂuid dynamics ( W ang et al. , 2025c ). Symbolic regression further extracts interpretable laws from data ( Udrescu & T egmark , 2020 ; Ruan et al. , 2026 ). 4 Correct ﬁnal answers need not imply correct proofs ( Dekoninck et al. , 2026 ). 5 cf. https://axiommath.ai/territory/from- seeing- why- to- checking- everything 21 Formal theorem pr oving and autof ormalization. A rich ecosystem of LLM-based formal prov- ing tools has emerged around Lean 4 ( de Moura & Ullrich , 2021 ). LeanDojo ( Y ang et al. , 2023 ) provides an interface to Lean proof states and retriev al over mathlib ( mathlib Community , 2020 ), while Lean Copilot ( Song et al. , 2025 ) integrates LLM assistance into the Lean workﬂo w . Dedicated prov ers include DeepSeek-Prover ( Xin et al. , 2024 ), which lev erages large-scale synthetic proof data, and DeepSeek-Prover -V2 ( Ren et al. , 2025 ), which adds reinforcement learning with explicit subgoal decomposition and introduces Prov erBench for ev aluation. Goedel-Prover -V2 ( Lin et al. , 2025 ) scales expert iteration with scaffolded data synthesis and veriﬁer -guided self-correction. Com- plementary directions focus on knowledge reuse and structured reasoning: LEGO-Prover ( W ang et al. , 2023 ) builds and reuses a gro wing library of veriﬁed lemmas, while Hilbert ( V arambally et al. , 2025 ) connects informal reasoning with formal veriﬁcation through recursiv e decomposition. TheoremLlama ( W ang et al. , 2024 ) and Mathesis ( Xuejun et al. , 2025 ) explore adapting general- purpose models and end-to-end pipelines from natural language to Lean proofs. Recent ag entic framew orks emphasize tool use and iterative compiler-feedback loops rather than one-shot gener- ation: APOLLO ( Ospanov et al. , 2025 ) performs modular proof repair and sub-lemma isolation, Ax-Prov er ( Breen et al. , 2025 ) uses multi-agent tool-based proving across scientiﬁc domains, and LeanAgent ( Kumarappan et al. , 2025 ) studies continual adaptation across e v olving repositories. In a different direction, LeanProgress ( Geor ge et al. , 2026 ) guides search by predicting proof progress to improv e performance on long proofs. On the data side, MUST ARD ( Huang et al. , 2024 ) generates uniform theorem-and-proof training data with formal veriﬁcation. For ev aluation, miniF2F ( Zheng et al. , 2022 ) and PutnamBench ( Tsoukalas et al. , 2024 ) provide competition-style benchmarks, while SorryDB introduces a dynamically updating stream of open sorry tasks mined from real-world Lean projects, mitigating contamination. Autoformalization, i.e., translating informal mathematics into machine-checkable form, was shown to be feasible with LLMs by W u et al. ( 2022 ). Recent work addresses this through dependenc y-graph decomposition ( W ang et al. , 2025a ), chain-of-states proof translation ( W ang et al. , 2025d ), and ev aluation on real-world mathematical deﬁnitions ( Zhang et al. , 2025b ). Agentic end-to-end pipelines such as MerLean ( Ren et al. , 2026 ) extend this to scientiﬁc domains. W e refer to W eng et al. ( 2025 ) for a comprehensive surv ey . Frontier systems and resear ch-le vel ev aluation suites. Beyond competition benchmarks, sev eral recent efforts target r esear ch-level mathematics. F irst Pr oof ( Abouzaid et al. , 2026 ) introduces an author-curated set of ten questions arising naturally in the authors’ research, with answers not publicly released. Other benchmarks include continuously refreshed collections drawn from arXi v papers (RealMath ( Zhang et al. , 2025a )) and curated sets of exceptionally challenging, unpublished problems revie wed by domain experts (FrontierMath ( Glazer et al. , 2025 )). Aletheia was e v aluated directly on F irst Pr oof : roughly three weeks after the challenge was introduced, Feng et al. ( 2026a ) report that Aletheia autonomously solved six out of ten problems. Notably , some of these results are now accompanied by machine-checked proofs: for example, Sothanaphan ( 2026 ) provide a Lean formalization of a resolution of an Erd ˝ os problem attributed to Achim et al. ( 2025 ). 6 . 2 M AT H E M AT I C I A N S U S I N G A I Frameworks and perspectives. The literature on AI and mathematical practice is broad, so we highlight only those lines of w ork most directly rele v ant to our framework. Haase & Pokutta ( 2026 ) propose four levels of human-AI co-creati vity: Digital Pen, AI T ask Specialist, AI Assistant, and AI Co-Creator . These categories provide a conceptual v ocabulary that we build on in Section 2 . Their treatment is intentionally broad and domain-agnostic, serving primarily as a conceptual template to which domain-speciﬁc details can be added. Henkel ( 2025 ) offer a complementary perspectiv e from mathematics, arguing that AI should augment rather than replace mathematical reasoning and proposing ﬁve guiding principles for its responsible use. Noorani et al. ( 2025 ) formalize the comple- mentary strengths of humans and AI in uncertainty quantiﬁcation, providing theoretical guarantees for collaborativ e prediction. Most recently , A vigad ( 2026 ) consider recent developments in AI- driv en mathematics and argue that mathematicians should remain activ ely in v olved in the use of these systems. Our work shares these perspectiv es but addresses a different question: given these emerging capabilities, ho w should a working researcher use them in practice? Documented case studies. Over the past sev eral months, a growing number of papers have doc- umented how mathematicians interact with chat-based AI systems to obtain ne w research results 22 ( Bubeck et al. , 2025 ; Diez et al. , 2025 ; Alexee v & Mixon , 2026 ; Ivanisvili & Xie , 2025 ; Feldman & Karbasi , 2025 ; Salim , 2025 ; Dobriban , 2025 ; Schmitt , 2025 ). More specialized agentic systems with varying degrees of autonomy are also being de veloped ( Liu et al. , 2025 ; Feng et al. , 2026b ) and have already produced ne w mathematical results ( Lee & Seo , 2026 ; Feng , 2026 ). AI coding agents provide yet another pathway by enabling large computational searches: Knuth ( 2026 ) report that Claude solved an open Hamiltonian c ycle decomposition problem through iterati ve exploration. These examples likely represent only a small fraction of emer ging workﬂows. 6 . 3 A G E N T I C R E S E A R C H F R A M E W O R K S A utomated scientiﬁc discovery . Lu et al. ( 2024 ) introduced The AI Scientist , an end-to-end sys- tem that generates hypotheses, runs experiments, and writes papers; its successor ( Y amada et al. , 2025 ) reported an AI-generated paper accepted at a peer-revie wed workshop. Subsequent systems explore adjacent design points, from semi-automated, code-centric experimentation (CodeScien- tist ( Jansen et al. , 2025 )) to end-to-end agent pipelines that incorporate explicit mechanisms for human feedback and cumulativ e reporting ( Schmidgall et al. , 2025 ; Schmidgall & Moor , 2025 ). Al- phaApollo ( Zhou et al. , 2026 ) combines multi-turn tool use, reinforcement learning, and iterative ev olution with tool-assisted veriﬁcation, showing improved performance on se veral mathematical reasoning benchmarks. As these pipelines grow more comple x, rigorous benchmarking has emerged as a central challenge, with recent work proposing ev aluations that target both full workﬂows and their indi vidual steps ( Chen et al. , 2025 ; Bragg et al. , 2025 ). T aken together , these works highlight a common requirement: agent outputs must be checkable (e.g., as code, logs, or derived claims) and include explicit points for veriﬁcation and human steering, rather than being treated as opaque end-to-end generations. Karpathy’ s autoresear ch ex empliﬁes a minimalist variant: an agent itera- tiv ely modiﬁes a single ﬁle, runs ﬁxed-b udget training, and keeps or discards based on v alidation performance ( Karpathy , 2026 ). Our framew ork targets the complementary regime of multi-ﬁle, multi-objectiv e research with structured reporting and veriﬁcation. For broader context, we refer to recent surve ys ( Ferrag et al. , 2025 ; Zheng et al. , 2025a ). Agentic coding tools. T erminal-based coding agents such as Claude Code, OpenCode, Codex CLI, and Gemini CLI ( Anthropic ; Anomaly ; OpenAI ; Google ) extend AI assistance beyond chat by enabling users ( Handa et al. , 2025-12-04, 2025 ) (software engineers, analysts, and researchers alike) to delegate work within a persistent local project. These agents can read and edit ﬁles and in v oke development tools (e.g., shells, test runners, linters, and formatters) from within a CLI in- terface, producing inspectable artifacts such as patches, diffs, and test outputs. This inspectable, ﬁle-based workﬂow is central to our setting: it enables reproducible iteration and makes it possi- ble to attach veriﬁcation hooks (tests, proofs, consistency checks) directly to the agent’ s actions. A key recent dev elopment is the gro wth of long-running autonomy: in Claude Code, the 99.9th- percentile turn duration nearly doubled from under 25 to ov er 45 minutes between October 2025 and January 2026 ( McCain et al. , 2026 ), reducing the need for constant supervision while increas- ing the importance of robust guardrails. Finally , these tools separate the underlying model from a repository-scoped instruction ﬁle, allowing us to express our framework as a portable, model- and harness-agnostic procedure that applies across Claude Code, OpenCode, Codex CLI, and related CLI agents. A C K N O W L E D G M E N T S The frameworks, approaches, and insights presented here have been developed in the context of the MA TH+ project Agentic AI in Mathematics . 6 This research was partially supported by the Deutsche Forschungsgemeinschaft (DFG) through the DFG Cluster of Excellence MA TH+ (EXC- 2046/1, EXC-2046/2, project id 390685689), as well as by the German Federal Ministry of Research, T echnology and Space (research campus Modal, fund number 05M14ZAM, 05M20ZBM) and the VDI/VDE Innov ation + T echnik GmbH (fund number 16IS23025B). 6 https://iol.zib.de/project/agentmath.html 23 R E F E R E N C E S Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer , Joe Kileel, T amara G. K olda, Paul D. Nelson, Daniel Spielman, Nikhil Sriv asta v a, Rachel W ard, Shmuel W einberger , and Lauren W illiams. First Proof, February 2026. T udor Achim, Alex Best, Alberto Bietti, Ke vin Der , Math ¨ ıs F ´ ed ´ erico, Sergei Gukov , Daniel Halpern- Leistner, Kirsten Henningsgard, Y ury Kudryashov , Alexander Meibur g, Martin Michelsen, Ri- ley Patterson, Eric Rodriguez, Laura Scharff, V ikram Shanker , Vladmir Sicca, Hari Sowrirajan, Aidan Swope, Matyas T amas, Vlad T enev , Jonathan Thomm, Harold W illiams, and Lawrence W u. Aristotle: IMO-level Automated Theorem Proving, October 2025. URL https://arxiv. org/abs/2510.01346v2 . Boris Alex ee v and Dustin G. Mixon. Forbidden Sidon subsets of perfect dif ference sets, featuring a human-assisted proof, January 2026. Alberto Alfarano, Franc ¸ ois Charton, and Amaury Hayat. Global L yapunov functions: A long- standing open problem in mathematics, with symbolic transformers, October 2024. URL http: //arxiv.org/abs/2410.08304 . Anomaly . Opencode, February . URL https://github.com/anomalyco/opencode . Anthropic. Claude code. URL https://code.claude.com/docs/en/overview . Jeremy A vigad. Mathematicians in the Age of AI. 2026. AxiomMath. AxiomProv er reports perfect score on Putnam 2025, 2025. URL https:// github.com/AxiomMath/putnam2025 . GitHub repository , accessed 2026-03-06. Jonathan Bragg, Mike D’Arcy , Nishant Balepur, Dan Bareket, Bhav ana Dalvi, Serge y Feldman, Dany Haddad, Jena D. Hwang, Peter Jansen, V arsha Kishore, Bodhisattwa Prasad Majumder , Aakanksha Naik, Sigal Rahamimov , K yle Richardson, Amanpreet Singh, Harshit Surana, Aryeh T iktinsky , Rosni V asu, Guy W iener , Chloe Anastasiades, Stefan Candra, Jason Dunk elberger , Dan Emery , Rob Evans, Malachi Hamada, Regan Huf f, Rodney Kinney , Matt Latzke, Jaron Lochner , Ruben Lozano-Aguilera, Cecile Nguyen, Smita Rao, Amber T anaka, Brook e Vlahos, Peter Clark, Doug Downey , Y oav Goldberg, Ashish Sabharwal, and Daniel S. W eld. AstaBench: Rigorous Benchmarking of AI Agents with a Scientiﬁc Research Suite, October 2025. Benjamin Breen, Marco Del T redici, Jacob McCarran, Javier Aspuru Mijares, W eichen W inston Y in, Kﬁr Sulimany , Jacob M. T aylor , Frank H. L. K oppens, and Dirk Englund. Ax-Prover: A Deep Reasoning Agentic Framew ork for Theorem Proving in Mathematics and Quantum Physics, Nov ember 2025. S ´ ebastien Bubeck, Christian Coester , Ronen Eldan, T imothy Go wers, Y in T at Lee, Alexandru Lup- sasca, Mehtaab Sawhne y , Robert Scherrer , Mark Sellke, Brian K. Spears, Derya Unutmaz, K evin W eil, Stev en Y in, and Nikita Zhiv otovskiy . Early science acceleration experiments with GPT -5, Nov ember 2025. Lisa Carbone. Advancing mathematics research with generative AI, December 2025. URL http: //arxiv.org/abs/2511.07420 . Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutﬁ Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaE- volv e: Adaptive LLM Dri ven Zeroth-Order Optimization, February 2026. Ziru Chen, Shijie Chen, Y uting Ning, Qianheng Zhang, Boshi W ang, Botao Y u, Y ifei Li, Zeyi Liao, Chen W ei, Zitong Lu, V ishal Dey , Mingyi Xue, Frazier N. Baker , Benjamin Burns, Daniel Adu- Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Y u Su, and Huan Sun. ScienceAgentBench: T o ward Rigorous Assessment of Language Agents for Data-Driv en Scientiﬁc Discovery, March 2025. URL . Y uri Cherv onyi, Trieu H. T rinh, Miroslav Ol ˇ s ´ ak, Xiaomeng Y ang, Hoang Nguyen, Marcelo Mene- gali, Junehyuk Jung, Junsu Kim, V ikas V erma, Quoc V . Le, and Thang Luong. Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2, December 2025. URL http://arxiv.org/abs/2502.03544 . 24 Jake Chuharski, Elias Rojas Collins, and Mark Meringolo. Mining math conjectures from LLMs: A pruning approach. In The 4th W orkshop on Mathematical Reasoning and AI at NeurIPS’24 , 2024. URL https://openreview.net/forum?id=aYlKvzY6ob . Alex Davies, Petar V eli ˇ ckovi ´ c, Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad T oma ˇ sev , Richard T anburn, Peter Battaglia, Charles Blundell, Andr ´ as Juh ´ asz, Marc Lackenby , Geordie W illiamson, Demis Hassabis, and Pushmeet Kohli. Advancing mathematics by guid- ing human intuition with AI. Natur e , 600(7887):70–74, December 2021. ISSN 1476- 4687. doi: 10.1038/s41586- 021- 04086- x. URL https://www.nature.com/articles/ s41586- 021- 04086- x . Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In Andr ´ e Platzer and Geoff Sutcliffe (eds.), Automated Deduction – CADE 28 , pp. 625–635, Cham, 2021. Springer International Publishing. ISBN 978-3-030-79876-5. Jasper Dekoninck, Ivo Petro v , Kristian Minchev , Misla v Balunovic, Martin V echev , Miroslav Mari- nov , Maria Drenchev a, L yuba K onov a, Milen Shumanov , Kaloyan Tsvetkov , Nikolay Drenchev , Lazar T odoro v , Kalina Nikolov a, Nikolay Georgiev , V anesa Kalinkov a, and Margulan Ismol- dayev . The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs, January 2026. Charles-Philippe Diez, Luis da Maia, and Ivan Nourdin. Mathematical research with GPT -5: A Malliavin-Stein e xperiment, September 2025. Edgar Dobriban. Solving a Research Problem in Mathematical Statistics with AI Assistance, De- cember 2025. URL . Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Moham- madamin Barekatain, Alexander Noviko v , Francisco J. R. Ruiz, Julian Schrittwieser , Grzegorz Swirszcz, David Silver , Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix mul- tiplication algorithms with reinforcement learning. Nature , 610(7930):47–53, October 2022. ISSN 1476-4687. doi: 10.1038/s41586- 022- 05172- 4. URL https://www.nature.com/ articles/s41586- 022- 05172- 4 . Moran Feldman and Amin Karbasi. G ¨ odel T est: Can Large Language Models Solve Easy Conjec- tures?, September 2025. T on y Feng. Eigenweights for arithmetic Hirzebruch Proportionality, February 2026. T on y Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Guko v , Chiang-Chiang Tsai, David W oodruff, Adel Jav anmard, Aryan Mokhtari, Dawsen Hwang, Y uri Chervon yi, Jonathan N. Lee, Garrett Bingham, T rieu H. Trinh, V ahab Mirrokni, Quoc V . Le, and Thang Luong. Aletheia tackles FirstProof autonomously , February 2026a. T on y Feng, Trieu H. T rinh, Garrett Bingham, Dawsen Hwang, Y uri Chervon yi, Junehyuk Jung, Joonkyung Lee, Carlo P agano, Sang-hyun Kim, Federico Pasqualotto, Sergei Guk ov , Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Y i T ay , Y aGuang Li, Chenkai Kuang, Y uan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Y ang, Heng-Tze Cheng, Demis Hassabis, K oray Kavukcuoglu, Quoc V . Le, and Thang Luong. T o wards Autonomous Mathemat- ics Research, February 2026b. URL . Mohamed Amine Ferrag, Norbert Tihan yi, and Merouane Debbah. From LLM Reasoning to Au- tonomous AI Agents: A Comprehensiv e Re vie w , April 2025. Elias Frantar and Dan Alistarh. Sparsegpt: Massiv e language models can be accurately pruned in one-shot. In International Conference on Mac hine Learning , pp. 10323–10337. PMLR, 2023. Elias Frantar , Saleh Ashkboos, T orsten Hoeﬂer , and Dan Alistarh. GPTQ: Accurate Post-T raining Quantization for Generati ve Pre-trained T ransformers, March 2023. URL http://arxiv. org/abs/2210.17323 . Robert Joseph George, Suozhi Huang, Peiyang Song, and Anima Anandkumar . LeanProgress: Guid- ing Search for Neural Theorem Proving via Proof Progress Prediction, January 2026. 25 Bogdan Georgie v , Javier G ´ omez-Serrano, T erence T ao, and Adam Zsolt W agner . Mathematical ex- ploration and discovery at scale, Nov ember 2025. URL 02864v3 . Elliot Glazer , Ege Erdil, T amay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Car- oline F alkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oli veira Santos, Olli J ¨ arviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Eliza- beth Pratt, Lionel Levine, Grant Barkley , Natalie Ste wart, Bogdan Grechuk, T etiana Grechuk, Shreepranav V arma Enugandla, and Mark W ildon. FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI, December 2025. Google. Build, debug & deploy with AI. URL https://geminicli.com/ . Benjamin Grimmer and Ning Liu. Lower bounds for linear minimization oracle methods optimizing ov er strongly con vex sets. arXiv preprint , 2026. Jennifer Haase and Sebastian Pokutta. Human-AI Co-Creativity: Exploring Synergies Across Lev els of Creativ e Collaboration. pp. 205–221. 2026. doi: 10.1016/B978- 0- 443- 34073- 4.00009- 5. URL http://arxiv.org/abs/2411.12527 . Jannis Halbey , Daniel Deza, Max Zimmer, Christophe Roux, Bartolomeo Stellato, and Sebastian Pokutta. Lo wer bounds for frank-wolfe on strongly con v ex sets. arXiv pr eprint arXiv:2602.04378 , 2026. Song Han, Jeff Pool, John Tran, and William Dally . Learning both weights and connections for efﬁcient neural netw ork. Advances in neural information pr ocessing systems , 28, 2015. Kunal Handa, Michael Stern, Saffron Huang, Jerry Hong, Esin Durmus, Miles McCain, Grace Y un, AJ Alt, Thomas Millar , Alex T amkin, Jane Leibrock, Stuart Ritchie, and Deep Ganguli. Introducing anthropic interviewer: What 1,250 professionals told us about working with AI. https://anthropic.com/research/anthropic-interviewer , 2025-12-04, 2025. Deborah Hendrych, Hannah Troppens, Mathieu Besanc ¸ on, and Sebastian Pokutta. Conv e x mixed- integer optimization with Frank–W olfe methods. Mathematical Pr ogr amming Computation , 17 (4):731–757, December 2025. ISSN 1867-2957. doi: 10.1007/s12532- 025- 00288- w. URL https://doi.org/10.1007/s12532- 025- 00288- w . Jonas Henkel. The Mathematician’ s Assistant: Integrating AI into Research Practice, August 2025. URL . Y ichen Huang and Lin F . Y ang. W inning Gold at IMO 2025 with a Model-Agnostic V eriﬁcation- and-Reﬁnement Pipeline, September 2025. Y inya Huang, Xiaohan Lin, Zhengying Liu, Qingxing Cao, Huajian Xin, Haiming W ang, Zhenguo Li, Linqi Song, and Xiaodan Liang. MUST ARD: Mastering Uniform Synthesis of Theorem and Proof Data, May 2024. URL . Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl ´ os Z. Horv ´ ath, Goran ˇ Zu ˇ zi ´ c, Eric Wieser , Aja Huang, Julian Schrittwieser , Y annick Schroecker , Hussain Masoom, Ottavia Bertolli, T om Za- havy , Amol Mandhane, Jessica Y ung, Iuliya Beloshapka, Borja Ibarz, V ivek V eeriah, Lei Y u, Oliv er Nash, Paul Lezeau, Salvatore Mercuri, Calle S ¨ onne, Bhavik Mehta, Alex Davies, Daniel Zheng, Fabian Pedregosa, Y in Li, Ingrid von Glehn, Mark Ro wland, Samuel Albanie, Ameya V elingker , Simon Schmitt, Edward Lockhart, Edward Hughes, Henryk Michalewski, Nicolas Sonnerat, Demis Hassabis, Pushmeet K ohli, and Da vid Silver . Olympiad-lev el formal mathe- matical reasoning with reinforcement learning. Natur e , pp. 1–3, November 2025. ISSN 1476- 4687. doi: 10.1038/s41586- 025- 09833- y. URL https://www.nature.com/articles/ s41586- 025- 09833- y . Paata Ivanisvili and Xinyuan Xie. Counterexample to majority optimality in NICD with erasures, October 2025. Peter Jansen, Oyvind T afjord, Marissa Radensky , Pao Siangliulue, T om Hope, Bhav ana Dalvi Mishra, Bodhisattwa Prasad Majumder , Daniel S. W eld, and Peter Clark. CodeScientist: End- to-End Semi-Automated Scientiﬁc Discov ery with Code-based Experimentation, March 2025. 26 Jiachen Jiang, T ian yu Ding, and Zhihui Zhu. DeltaEvolv e: Accelerating Scientiﬁc Discov ery through Momentum-Driv en Ev olution, February 2026. Keller Jordan, Y uchen Jin, Vlado Boza, Y ou Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjor dan. github. io/posts/muon , 6(3):4, 2024. Haocheng Ju and Bin Dong. AI for Mathematics: Progress, Challenges, and Prospects, February 2026. URL . Andrej Karpathy . autoresearch, 2026. URL https://github.com/karpathy/ autoresearch . GitHub repository , accessed 2026-03-08. Hiroshi Kera, Nico Pelleriti, Y uki Ishihara, Max Zimmer , and Sebastian Pokutta. Computational Algebra with Attention: T ransformer Oracles for Border Basis Algorithms, May 2025. Thomas K erdreux, Ale xandre d’Aspremont, and Sebastian Pokutta. Projection-free optimization on uniformly con v ex sets. In International conference on artiﬁcial intelligence and statistics , pp. 19–27. PMLR, 2021. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014. Donald E. Knuth. Claude’ s cycles, 2026. URL https://www- cs- faculty.stanford. edu/ ˜ knuth/papers/claude- cycles.pdf . Adarsh Kumarappan, Mo Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, and Anima Anandkumar . LeanAgent: Lifelong Learning for Formal Theorem Pro ving, March 2025. Robert Tjarko Lange, Y uki Imajuku, and Edoardo Cetin. ShinkaEvolve: T o wards Open-Ended And Sample-Efﬁcient Program Ev olution, September 2025. Mike Lasby , Max Zimmer, Sebastian Pokutta, and Erik Schultheis. Compressed sparse tiles for memory-efﬁcient unstructured and semi-structured sparsity . In Sparsity in LLMs (SLLM): Deep Dive into Mixture of Experts, Quantization, Har dwar e, and Infer ence , 2025. URL https:// openreview.net/forum?id=iso0KV2HVq . Joonkyung Lee and Jaehyeon Seo. Lower bounds for multiv ariate independence polynomials and their generalisations, February 2026. Zichong Li, Liming Liu, Chen Liang, W eizhu Chen, and Tuo Zhao. Normuon: Making muon more efﬁcient and scalable. arXiv preprint , 2025. Y ong Lin, Shange T ang, Bohan L yu, Ziran Y ang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Y ihan Geng, Jiawei Ge, Jingruo Sun, Jiayun W u, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Y ang, Hongzhou Lin, Y ejin Choi, Danqi Chen, Sanjee v Arora, and Chi Jin. Goedel-Prover -V2: Scaling Formal Theorem Pro ving with Scaf folded Data Synthesis and Self-Correction, August 2025. Julia Lindberg, Alisha Zachariah, Nigel Boston, and Bernard C. Lesieutre. The Distribution of the Number of Real Solutions to the Power Flo w Equations, October 2020. Y uanhang Liu, Beichen W ang, Peng Li, and Y ang Liu. AI Mathematician as a Partner in Adv ancing Mathematical Discovery – A Case Study in Homogenization Theory, October 2025. URL http: //arxiv.org/abs/2510.26380 . Ilya Loshchilov and Frank Hutter . Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 , 2017. Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster , Jeff Clune, and David Ha. The AI Scientist: T ow ards Fully Automated Open-Ended Scientiﬁc Discovery , September 2024. URL http://arxiv.org/abs/2408.06292 . 27 Daniel J. Mankowitz, Andrea Michi, Anton Zhernov , Marco Gelmi, Marco Selvi, Cosmin Padu- raru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, Thomas K ¨ oppe, Ke vin Millikin, Stephen Gaffney , Sophie Elster , Jackson Broshear, Chris Gamble, Kieran Milan, Robert T ung, Minjae Hwang, T aylan Cemgil, Mohammadamin Barekatain, Y ujia Li, Amol Mandhane, Thomas Hubert, Julian Schrittwieser, Demis Hassabis, Pushmeet Kohli, Martin Riedmiller, Oriol V inyals, and David Silver . F aster sorting algorithms discov ered using deep reinforcement learning. Natur e , 618(7964):257–263, June 2023. ISSN 1476- 4687. doi: 10.1038/s41586- 023- 06004- 9. URL https://www.nature.com/articles/ s41586- 023- 06004- 9 . The mathlib Community . The Lean mathematical library . In Pr oceedings of the 9th A CM SIGPLAN International Confer ence on Certiﬁed Pr ogr ams and Pr oofs , pp. 367–381, January 2020. doi: 10.1145/3372885.3373824. Miles McCain, Thomas Millar , Saffron Huang, Jake Eaton, Kunal Handa, Michael Stern, Alex T amkin, Matt Kearne y , Esin Durmus, Judy Shen, Jerry Hong, Brian Calvert, Jun Sh- ern Chan, Francesco Mosconi, David Saunders, T yler Neylon, Gabriel Nicholas, Sarah Pollack, Jack Clark, and Deep Ganguli. Measuring AI agent autonomy in practice. https://anthropic.com/research/measuring-agent-autonomy , 2026. Stephen Merity , Caiming Xiong, James Bradbury , and Richard Socher . Pointer sentinel mixture models. arXiv preprint , 2016. Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh V enkatesh, Chong Y u, and Paulius Micike vicius. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378 , 2021. Challenger Mishra, Subhayan Roy Moulik, and Rahul Sarkar . Mathematical conjecture generation using machine intelligence, June 2023. URL . K onrad Mundinger , Sebastian Pokutta, Christoph Spiegel, and Max Zimmer . Extending the contin- uum of six-colorings. Geombinatorics Quarterly , XXXIV , 2024. URL https://geombina. uccs.edu/past- issues/volume- xxxiv . K onrad Mundinger , Max Zimmer , Aldo Kiem, Christoph Spiegel, and Sebastian Pokutta. Neural discov ery in mathematics: Do machines dream of colored planes? In F orty-Second International Confer ence on Machine Learning , 2025. URL https://openreview.net/forum?id= 7Tp9zjP9At . Sima Noorani, Shayan Kiyani, George Pappas, and Hamed Hassani. Human-AI Collaborati ve Un- certainty Quantiﬁcation, October 2025. URL . Alexander Noviko v , Ng ˆ an V ˜ u, Marvin Eisenberger , Emilien Dupont, Po-Sen Huang, Adam Zsolt W agner , Sergey Shirobokov , Borislav Kozlo vskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pa wan Kumar , Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Now ozin, Pushmeet K ohli, and Matej Balog. AlphaEvolve: A coding agent for scientiﬁc and algorithmic discov ery , June 2025. URL . OpenAI. Codex | AI Coding Partner from OpenAI. URL https://openai.com/codex/ . Azim Ospanov , Farzan Farnia, and Roozbeh Y ousefzadeh. APOLLO: Automated LLM and Lean Collaboration for Advanced F ormal Reasoning, No vember 2025. Nico Pelleriti, Christoph Spiegel, Shiwei Liu, David Mart ´ ınez-Rubio, Max Zimmer , and Sebastian Pokutta. Neural Sum-of-Squares: Certifying the Nonneg ativity of Polynomials with T ransform- ers, October 2025. Colin Raf fel, Noam Shazeer , Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Y anqi Zhou, W ei Li, and Peter J Liu. Exploring the limits of transfer learning with a uniﬁed te xt-to-text transformer . J ournal of machine learning r esearc h , 21(140):1–67, 2020. Y uanjie Ren, Jinzheng Li, and Y idi Qi. MerLean: An Agentic Framework for Autoformalization in Quantum Computation, February 2026. 28 Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng W ang, W anjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Y ang, Z. F . W u, Zhibin Gou, Shirong Ma, Hongxuan T ang, Y ux- uan Liu, W enjun Gao, Daya Guo, and Chong Ruan. DeepSeek-Prover -V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition, July 2025. Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov , Matej Balog, M. Paw an Kumar , Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming W ang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discov eries from program search with large language models. Natur e , 625(7995):468–475, January 2024. ISSN 1476- 4687. doi: 10.1038/s41586- 023- 06924- 6. URL https://www.nature.com/articles/ s41586- 023- 06924- 6 . Christophe Roux, Max Zimmer, Alexandre d’Aspremont, and Sebastian Pokutta. Don’t be greedy , just relax! pruning llms via frank-wolfe. arXiv pr eprint arXiv:2510.13713 , 2025. Kai Ruan, Y ilong Xu, Ze-Feng Gao, Y ang Liu, Y ike Guo, Ji-Rong W en, and Hao Sun. Discovering physical laws with parallel symbolic enumeration. Natur e Computational Science , 6(1):53–66, January 2026. ISSN 2662-8457. doi: 10.1038/s43588- 025- 00904- 8. URL https://www. nature.com/articles/s43588- 025- 00904- 8 . Adil Salim. Accelerating mathematical research with language models: A case study of an interac- tion with GPT -5-Pro on a conv ex analysis problem, October 2025. Samuel Schmidgall and Michael Moor . AgentRxiv: T o wards Collaborati ve Autonomous Research, March 2025. Samuel Schmidgall, Y usheng Su, Ze W ang, Ximeng Sun, Jialian W u, Xiaodong Y u, Jiang Liu, Michael Moor , Zicheng Liu, and Emad Barsoum. Agent Laboratory: Using LLM Agents as Research Assistants, June 2025. Johannes Schmitt. Extremal descendant integrals on moduli spaces of curv es: An inequality discov- ered and prov ed in collaboration with AI, December 2025. Andrei Semenov , Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining. arXiv preprint , 2025. Asankhaya Sharma. OpenEvolve: An open-source ev olutionary coding agent. GitHub, 2025. Chongjie Si, Debing Zhang, and W ei Shen. Adamuon: Adaptiv e muon optimizer . arXiv pr eprint arXiv:2507.11005 , 2025. Peiyang Song, Kaiyu Y ang, and Anima Anandkumar . Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean, May 2025. URL 12534 . Nat Sothanaphan. Resolution of Erd ˝ os Problem #728: A writeup of Aristotle’ s Lean proof, January 2026. W eijie J. Su. Y ou Are the Best Revie wer of Y our Own Papers: An Owner-Assisted Scoring Mecha- nism, June 2022. Mingjie Sun, Zhuang Liu, Anna Bair , and J. Zico K olter . A Simple and Ef fecti v e Pruning Approach for Large Language Models, May 2024. URL . T rieu H. Trinh, Y uhuai W u, Quoc V . Le, He He, and Thang Luong. Solving olympiad geom- etry without human demonstrations. Nature , 625(7995):476–482, January 2024. ISSN 1476- 4687. doi: 10.1038/s41586- 023- 06747- 5. URL https://www.nature.com/articles/ s41586- 023- 06747- 5 . George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Ami- tayush Thakur, and Swarat Chaudhuri. PutnamBench: Ev aluating Neural Theorem-Provers on the Putnam Mathematical Competition, Nov ember 2024. 29 Silviu-Marian Udrescu and Max T egmark. AI Feynman: A physics-inspired method for symbolic regression. Science Advances , 6(16):eaay2631, April 2020. doi: 10.1126/sciadv .aay2631. URL https://www.science.org/doi/10.1126/sciadv.aay2631 . Sumanth V arambally , Thomas V oice, Y anchao Sun, Zhifeng Chen, Rose Y u, and Ke Y e. Hilbert: Recursiv ely Building Formal Proofs with Informal Reasoning, September 2025. URL http: //arxiv.org/abs/2509.22819 . Haiming W ang, Huajian Xin, Chuanyang Zheng, Lin Li, Zhengying Liu, Qingxing Cao, Y inya Huang, Jing Xiong, Han Shi, Enze Xie, Jian Y in, Zhenguo Li, Heng Liao, and Xiaodan Liang. LEGO-Prov er: Neural Theorem Proving with Gro wing Libraries, October 2023. URL http: //arxiv.org/abs/2310.00656 . Hanyu W ang, Ruohan Xie, Y utong W ang, Guoxiong Gao, Xintao Y u, and Bin Dong. Aria: An Agent For Retriev al and Iterativ e Auto-Formalization via Dependency Graph, October 2025a. URL . Ruida W ang, Jipeng Zhang, Y izhen Jia, Rui Pan, Shizhe Diao, Renjie Pi, and T ong Zhang. The- oremLlama: Transforming General-Purpose LLMs into Lean4 Experts, October 2024. URL http://arxiv.org/abs/2407.03203 . Y iping W ang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Y ang, Zeyi Huang, Xue- hai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, W eizhu Chen, Shuohang W ang, Simon Shaolei Du, and Y elong Shen. ThetaEvolv e: T est-time Learning on Open Problems, Nov ember 2025b. Y ongji W ang, Mehdi Bennani, James Martens, S ´ ebastien Racani ` ere, Sam Blackwell, Alex Matthews, Stanislav Nikolov , Gonzalo Cao-Labora, Daniel S. Park, Martin Arjovsk y , Daniel W orrall, Chongli Qin, Ferran Alet, Borislav K ozlovskii, Nenad T oma ˇ sev , Alex Davies, Pushmeet K ohli, Tristan Buckmaster , Bogdan Geor giev , Javier G ´ omez-Serrano, Ray Jiang, and Ching-Y ao Lai. Discovery of Unstable Singularities, September 2025c. URL 2509.14185 . Ziyu W ang, Bowen Y ang, Chenyi Li, Y uan Zhang, Shihao Zhou, Bin Dong, and Zaiwen W en. T ranslating Informal Proofs into Formal Proofs Using a Chain of States, December 2025d. URL http://arxiv.org/abs/2512.10317 . Ke W eng, Lun Du, Sirui Li, W angyue Lu, Haozhe Sun, Hengyu Liu, and Tiancheng Zhang. Aut- oformalization in the Era of Large Language Models: A Surve y , May 2025. URL http: //arxiv.org/abs/2505.23486 . Y uhuai W u, Albert Q. Jiang, W enda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy . Autoformalization with Large Language Models, May 2022. URL http: //arxiv.org/abs/2205.12615 . Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, W enda Li, and Xiaodan Liang. DeepSeek-Prov er: Advancing Theorem Proving in LLMs through Large- Scale Synthetic Data, May 2024. URL . Y u Xuejun, Jianyuan Zhong, Zijin Feng, Pengyi Zhai, Roozbeh Y ousefzadeh, W ei Chong Ng, Haox- iong Liu, Ziyi Shou, Jing Xiong, Y udong Zhou, Claudia Beth Ong, Austen Jeremy Sugiarto, Y aoxi Zhang, W ai Ming T ai, Huan Cao, Dongcai Lu, Jiacheng Sun, Qiang Xu, Shen Xin, and Zhenguo Li. Mathesis: T o wards Formal Theorem Proving from Natural Languages, June 2025. URL http://arxiv.org/abs/2506.07047 . Y utaro Y amada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster , Jef f Clune, and David Ha. The AI Scientist-v2: W orkshop-Lev el Automated Scientiﬁc Discov ery via Agentic T ree Search, April 2025. URL . Kaiyu Y ang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Y u, Saad Godil, Ryan Prenger , and Anima Anandkumar . LeanDojo: Theorem Proving with Retrie v al-Augmented Language Models, October 2023. URL . 30 Jie Zhang, Cezara Petrui, Kristina Nikoli ´ c, and Florian T ram ` er . RealMath: A Continuous Bench- mark for Evaluating Language Models on Research-Le vel Mathematics, October 2025a. Lan Zhang, Marco V alentino, and Andre Freitas. Autoformalization in the Wild: Assessing LLMs on Real-W orld Mathematical Deﬁnitions. In Christos Christodoulopoulos, T anmoy Chakraborty , Carolyn Rose, and V iolet Peng (eds.), Pr oceedings of the 2025 Conference on Empirical Methods in Natural Language Pr ocessing , pp. 1720–1738, Suzhou, China, Nov ember 2025b. Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp- main. 90. URL https://aclanthology.org/2025.emnlp- main.90/ . Ruijie Zhang, Y equan Zhao, Ziyue Liu, Zhengyang W ang, and Zheng Zhang. Muon+: T ow ards better muon via one additional normalization step. arXiv preprint , 2026. Y ingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, and Carlo V ittorio Cannistraci. Plug-and-Play: An Efﬁcient Post-training Pruning Method for Large Language Models. In The T welfth International Conference on Learning Repr esentations , October 2023. URL https: //openreview.net/forum?id=Tr0lPx9woF . Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. MiniF2F: A cross-system benchmark for formal Olympiad-lev el mathematics, February 2022. T ianshi Zheng, Zheye Deng, Hong T ing Tsang, W eiqi W ang, Jiaxin Bai, Zihao W ang, and Y angqiu Song. From automation to autonomy: A survey on large language models in scientiﬁc discov- ery . In Christos Christodoulopoulos, T anmoy Chakraborty , Carolyn Rose, and V iolet Peng (eds.), Pr oceedings of the 2025 Confer ence on Empirical Methods in Natural Language Pr ocessing , pp. 17733–17750, Suzhou, China, Nov ember 2025a. Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp- main.895. T ianshi Zheng, Zheye Deng, Hong T ing Tsang, W eiqi W ang, Jiaxin Bai, Zihao W ang, and Y angqiu Song. From Automation to Autonomy: A Survey on Large Language Models in Scientiﬁc Dis- cov ery , September 2025b. URL . Zhanke Zhou, Chentao Cao, Xiao Feng, Xuan Li, Zongze Li, Xiangyu Lu, Jiangchao Y ao, W eikai Huang, Tian Cheng, Jianghangfan Zhang, T angyu Jiang, Linrui Xu, Y iming Zheng, Brando Mi- randa, T ongliang Liu, Sanmi Ko yejo, Masashi Sugiyama, and Bo Han. AlphaApollo: A System for Deep Agentic Reasoning, March 2026. Max Zimmer , Megi Andoni, Christoph Spiegel, and Sebastian Pokutta. Perp: Rethinking the prune- retrain paradigm in the era of llms. arXiv preprint , 2023a. Max Zimmer, Christoph Spiegel, and Sebastian Pokutta. How I Learned To Stop Worrying And Lov e Retraining. In The Eleventh International Conference on Learning Repr esentations , 2023b. URL https://openreview.net/forum?id=_nF5imFKQI . Max Zimmer, Christoph Spiegel, and Sebastian Pokutta. Sparse model soups: A recipe for improv ed pruning via model av eraging. In The T welfth International Confer ence on Learning Repr esenta- tions , 2024. Max Zimmer , Christoph Spiegel, and Sebastian Pokutta. Compr ession-awar e T raining of Neu- ral Networks using F rank-W olfe , pp. 137–168. De Gruyter, Berlin, Boston, 2025. ISBN 9783111376776. doi: doi:10.1515/9783111376776- 010. URL https://doi.org/10. 1515/9783111376776- 010 . Max Zimmer , Christophe Roux, Moritz W agner , Deborah Hendrych, and Sebastian Pokutta. Spars- eSwaps: T ractable LLM Pruning Mask Reﬁnement at Scale, February 2026. URL http: //arxiv.org/abs/2512.10922 . Haohan Zou, Jie Feng, Hao Zhao, and Y uanyuan Shi. Analytical L yapunov Function Discov ery: An RL-based Generati ve Approach, June 2025. URL . 31

The Agentic Researcher: A Practical Guide to AI-Assisted Research in Mathematics and Machine Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment