Reading time: 51 minute
...

📝 Original Info

  • Title:
  • ArXiv ID: 2512.21373
  • Date:
  • Authors: Unknown

📝 Abstract

We introduce AInsteinBench, a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents within real research software ecosystems. Unlike existing scientific reasoning benchmarks which focus on conceptual knowledge, or software engineering benchmarks that emphasize generic feature implementation and issue resolving, AInsteinBench evaluates models in end-to-end scientific development settings grounded in production-grade scientific repositories. The benchmark consists of tasks derived from maintainer-authored pull requests across six widely used scientific codebases, spanning quantum chemistry, quantum computing, molecular dynamics, numerical relativity, fluid dynamics, and cheminformatics. All benchmark tasks are carefully curated through multi-stage filtering and expert review to ensure scientific challenge, adequate test coverage, and well-calibrated difficulty. By leveraging evaluation in executable environments, scientifically meaningful failure modes, and test-driven verification, AInsteinBench measures a model's ability to move beyond surface-level code generation toward the core competencies required for computational scientific research.

📄 Full Content

Modern scientific discovery is increasingly mediated by large-scale computational systems. Across disciplines such as chemistry, physics, and materials science, new scientific insights are no longer derived solely from analytical derivations or isolated simulations, but from sustained interaction with complex scientific software ecosystems. Researchers routinely formulate hypotheses, encode physical or mathematical assumptions into simulation code, and iteratively refine implementations based on empirical feedback from tests, solvers, and large-scale simulation runs. As illustrated in Figure 2, such a workflow follows an iterative cycle of reading literature, experimenting, and solving questions, with active interactions with the environment: the academic institution and compute infrastructure. In this setting, scientific discoveries depend not only on the literal understanding of the subject, but also on the ability to implement the theory into production codebases and execute in mass production.

Recent advances in large language models (LLMs) have growing interest in their potential to assist scientific computing research [1][2][3]. To function as genuine research assistants, LLM-based agents must be able to navigate long-lived scientific repositories, understand domain-specific invariants and numerical constraints, and implement or debug code in ways that preserve scientific correctness. Whether current models are capable of the transition from surface-level scientific reasoning toward more analytic and physically based reasoning remains an open question. As shown in Fig. 2, analogous to human researchers, the LLM agents engage in a workflow of loading files into context, executing bash commands, and committing changes, with active interaction with its environment: usually a Docker container we built.

Current scientific reasoning benchmarks evaluate high-level conceptual understanding [4][5][6][7][8] or isolated coding questions [9], but they stop short of requiring LLMs to navigate, interpret, and modify real scientific repositories. As a result, these benchmarks fundamentally underrepresent the skills needed to be a successful scientist. Across nearly all of academia, there are numerous groups whose primary workflow entails developing ideas within large scientific repositories. This is no better demonstrated than by looking through the author lists of popular scientific repositories [10][11][12][13][14][15][16][17][18]. Existing SWE benchmarks [19,20] are not much better, as they do not reflect the coding paradigms, numerical stability constraints, cross-language heterogeneity, and HPC execution environments that dominate scientific computing. Although SWE-bench includes one scientific repository, Astropy, this is by no means a representative sampling. As a result, a fundamental gap remains: no benchmark currently measures an LLM’s ability to solve issue or write new features through patch synthesis inside real, large-scale scientific software ecosystems.

To address this gap, we introduce AInsteinBench, a new benchmark designed to evaluate performance in scientific computing development as practiced in real research environments. Rather than extending existing software-engineering benchmarks, AInsteinBench is constructed to measure how well an agent can operate as a scientific researcher working through code: developing, diagnosing, and maintaining computational implementations that embody modern scientific theory. The benchmark is grounded in large, production-grade scientific repositories spanning quantum chemistry (PySCF), quantum computing (Qiskit), cheminformatics (RDKit), fluid dynamics (AMReX), numerical relativity (Einstein Toolkit), and molecular dynamics simulation (OpenMM). Unlike standard software projects, these codebases encode domain-specific scientific structure: from density-functional theory and self-consistent field formulations to symplectic integration, constrained optimization, and PDE discretizations. Successfully modifying such systems requires reasoning not only about code, but about the scientific assumptions, invariants, and numerical behavior. By curating tasks from these environments, AInsteinBench moves towards whether models can translate scientific understanding into correct and stable computational practice. This focus aligns the benchmark with the realities of scientific programming, where progress is measured by preserving physical correctness and numerical robustness.

We construct this benchmark through a rigorous multi-phase pipeline inspired by SWE-bench but extended to meet scientific reproducibility standards. First, we identify high-impact scientific repositories with active developer communities and well-maintained test suites, ensuring both scientific relevance and executability. Second, we collect scientifically meaningful issue-linked pull requests and extract before/after code states, environment specifications, and domain-specific metadata. Third, we build fully containerized execution environments, capturing dependencies such as MPI, domain-specific compilers, and scientific development stacks. Fourth, we apply a strict filtering criterion which isolate issues that exhibit clear numerical or logical answering-oriented [5]. FrontierScience pushes further toward research-oriented assessment by separating an Olympiad track from a rubric-graded Research track of domain-authored tasks [27].

Foundation Models for Science. A growing line of work aims to endow language models with stronger scientific priors through domain-targeted pre-training and instruction tuning, without introducing explicit agentic control loops. Early efforts emphasized scientific and biomedical language modeling via in-domain pre-training (e.g., SciBERT, BioBERT, PubMedBERT), improving scientific NLP as a substrate for downstream reasoning and knowledge use [28][29][30]. More recent work extends this paradigm to generative scientific foundation models, including domain-specialized biomedical generators such as BioGPT and broader science-focused LLMs such as Galactica, which are trained on large scientific corpora and evaluated on scientific QA, knowledge-intensive tasks, and technical text generation [31,32]. In parallel, instruction-and adaptation-based approaches build explicit scientific instruction data or specialized scientific LLMs: [33] propose SciInstruct and report gains in college-level scientific reasoning, [34] introduce ChemDFM for chemistry-centric QA and molecular reasoning, and [35] develop SciDFM, a mixture-of-experts scientific LLM trained at large scale.

LLM Agent for Science. A complementary line of research studies LLM-based agents that couple language reasoning with tools, simulators, or multi-step interaction loops to perform scientific tasks beyond standalone question answering. Representative systems integrate domain tools for chemistry and materials science [2,36], or evaluate agents as biomedical hypothesis generators that iteratively propose, critique, and refine hypotheses from background evidence [37]. Recent benchmarks further connect agents to executable scientific simulators, enabling iterative in silico experimentation and feedback-driven reasoning [38]. While these works emphasize planning-acting-observing loops and experimental orchestration, they largely focus on hypothesis generation or simulation control, rather than repository-scale scientific software development and code-level reasoning.

Agentic Coding Benchmarks. Recent efforts have introduced increasingly realistic benchmarks for evaluating LLMs on software engineering tasks. SWE-Bench formulates issue resolution as patch generation over real Python repositories [39]. Its extension, SWE-Bench Verified, incorporates human validation to improve alignment between issues and reference patches, although the setting remains monolingual and primarily oriented toward small, localized bug fixes [40]. Building on this line of work, SWE-Gym converts SWE-Bench tasks into an interactive environment, enabling multi-step agentic editing rather than single-shot patch generation, while inheriting the same task distribution and Python-only scope [41]. More recently, Multi-SWE-Bench expands the evaluation to a multilingual setting, introducing issue-resolution tasks across seven widely used programming languages [20].

The Need for a Benchmark for Scientific Computing Agents. Prior evaluation of LLMs for science largely falls into two regimes: closed-world scientific reasoning and QA benchmarks (e.g., GPQA and related expert-level test sets) [4,5,42], and agentic systems that close the loop via tools or simulators (e.g., SciGym and hypothesisgeneration frameworks) [37,38]. In parallel, software-engineering benchmarks such as SWE-Bench, SWE-Bench Verified, SWE-Gym, and Multi-SWE-Bench evaluate repository-based issue resolution [19,20,40,41], but they overwhelmingly reflect general-purpose software and underrepresent the required knowledge and skills in scientific computing. As a result, existing evaluations do not directly measure whether agents can perform the core computational work of modern research: navigating, modifying, and validating production scientific repositories where correctness is scientific and numerical, not merely syntactic or local. AInsteinBench is introduced to fill this missing intersection.

In this section. we explain the process that we produce, select and verify questions in the benchmark. We first list the scientific computing projects that we select and why we select them. Then two categories of tasks, feature implementation and issue resolving, are defined on the projects. After that, we explain the details of how we produce questions in Data Curation. Finally, the questions are scrutinized and verified by human experts we recruited.

High-accuracy electronic-structure solvers; modular Python interfaces.

Quantum Computing theory and infrastructure Quantum circuit construction and simulation; hardware-backend execution interfaces.

Einstein Toolkit Computational Astrophysics Large-scale relativistic simulations; data-calibrated pipelines connecting theory and observation.

Computational Physics and Numerical Infrastructure

High-performance AMR grid framework; reusable solvers and mesh utilities.

Force-field molecular dynamics simulations; optimized kernels for biomolecular systems. RDKit Cheminformatics Molecular property prediction; cheminformatics algorithms and fingerprint utilities.

Table 1 Overview of selected scientific repositories and their primary features.

Our benchmark focuses on scientific codebases that are widely used and representative of real research workflows. We select repositories using four criteria:

• Acceptance. We prioritize repositories that have become de facto standards within their respective research domains. Such projects typically accumulate substantial user bases, community discussions, and long-term citations, reflecting both practical relevance and sustained scientific trust. • Breadth. We include projects that differ in disciplinary focus (e.g., quantum chemistry, numerical relativity, molecular dynamics, and fluid simulation) and in software stacks, spanning Python front ends and performance-critical C/C++/Fortran kernels. • Maintenance. We select repositories with active issue discussions, code reviews, and regular releases, and with well-structured test suites that exercise core functionality and enable reliable evaluation. • Unit tests. Each repository provides a self-contained, reproducible build-and-test workflow that can be executed at fixed commits, allowing controlled evaluation without relying on external data or environmentspecific configuration.

Based on these principles, we select a set of widely used scientific software repositories that satisfy all four criteria and collectively represent the diversity of modern computational research practice. These repositories are summarized in Table 1.

Each question in the benchmark represents a real scientific computing project’s development task. For every task, we provide the full problem context: the issue/PR description (for repo-crawled questions), the “before” code state, the test patch used to expose the error, and the “after” code state representing the ground-truth fix. The agent interacts with this environment through file edits and test executions.

Feature-implementation tasks. FEA-Bench has addressed the question of whether LLMs are capable of incremental software development across large and heterogeneous codebases [43]. In contrast, our benchmark focuses on a more domain-specific capability: assessing whether models can correctly implement concrete computational modules within established scientific computing repositories. For instance, in our production of synthesized EinsteinToolkit questions, we ablate a feature (a “Thorn” of Cactus) by removing part of the source codes and ask LLM to implement the ablated feature by completing the missing code. In addition, a few repo-crawled questions also involve feature implementation.

Bug-fixing tasks. Tasks that involve solving a science-related bug, such as physical conservation laws, group representations, calculus, and so on. The relevant bugs typically propagate across multiple files and require understanding both the science context and the software architecture. Most of the repo-crawled questions fall into this kind of task.

The questions that we include in our benchmark generally are produced by two kinds of pipelines: Synthesized Questions, where we ablate some functionality of a code base and ask LLM to implement, and Repo-crawled Questions, where one question comes from a pull request and related issues on GitHub.

• Pull Request Crawling. We begin by cloning each repository and collecting all pull requests via the GitHub API, extracting metadata such as the issue description, base commit, fix patch, and associated test updates. This yields the “before” and “after” code states for potential benchmark instances. To ensure high-quality tasks, we retain only PRs that satisfy three criteria: (1) they address a clearly defined scientific or numerical bug rather than documentation, refactoring, or large feature additions; (2) they include test updates that expose the failure and validate the fix; and (3) they are merged into the main branch, indicating correctness and compatibility with the repository. This filtering stage produces a curated pool of scientifically meaningful, test-verifiable code changes from which benchmark tasks are constructed.

• Environment Configuration. Software repositories evolve over time (dependencies change, APIs shift, and default behaviors vary across releases), making historical reproducibility nontrivial. To ensure robust and consistent evaluation, we reconstruct an environment for each pull request that mirrors the dependency stack at or near the time it was created. Each instance is executed inside a minimal, isolated Docker container that installs the required libraries, configures build paths, and enables deterministic test execution. This design guarantees cross-machine reproducibility and prevents interference between repositories or dependency versions.

• FP Filtering. Following the data-construction pipelines of SWE-Bench and Multi-SWE-Bench [19,20], we collect PRs from actively maintained scientific repositories, retain those that resolve documented issues and modify tests, and verify that each instance produces a clear fail-to-pass transition when executed in the reconstructed environment. • New Components Extraction and Feature Implementation. Furthermore, to enhance the quality of the benchmark dataset, specifically for problems involving new feature requests, we implemented a supplementation pipeline that augments issue descriptions with precise API documentation. In many cases, when a PR introduces new functionality (features), the original issue description, or the PR description itself, lacks sufficient detail about the new API’s signature and usage. This scarcity of information renders it near impossible for AI agents implement new features, since this requires guessing the new function signatures. Rather than throwing these valuable problems out, we can ‘supplement’ the issue descriptions.

Our supplementation process focuses on “adding docstrings” in the form of typed API signatures to the problem description. This ensures that the agent has immediate access to the definition of new tools available in the codebase. The pipeline consists of the following steps: (1) We analyze the git diff of the PR’s “fix patch” to identify all newly introduced function and class definitions and extract the raw signatures, including parameter names and available type annotations.(2) We filter the extracted APIs. By parsing the “test patch” associated with the PR, we identify which of the new symbols are actually utilized in the tests. Only these relevant APIs are selected for supplementation, ensuring the added context is directly applicable to the verification logic.(3) Fully typed API signatures are formatted into a standardized “Minimal API” block. This block includes the function/class location, signature, and a structured list of inputs and outputs. This documentation is appended to the original issue description in the dataset, effectively “adding docstrings” that describe the new feature’s interface to the problem statement.

• Question Production. We begin by identifying computational modules or features in key scientific software repositories, such as “Thorns” in the EinsteinToolkit framework. Similar workflows can also be produced analogously for other repositories. We selectively ablate a feature, typically by removing or masking parts of source code within the module (e.g., files in src/), and construct a task prompt based on the accompanying documentation, interface specifications, remaining code context, and problem description. Each instance is designed to simulate a realistic developer scenario in which an agent must reconstruct or implement a missing component from the intended functionality, like filling the grid with a certain spacetime metric, implementing a source term for right-hand-side of differential equations, doing spherical decomposition for finding apparent horizons, etc. An example is given in Appendix B.

• Test Verification. Since a desired functionality usually has different ways of implementation, the original code is only a reference answer but not a good verification. Instead, each feature task defined is paired with one or more test simulations defined by parameter files and output checks (e.g., reference data in test/ directories for EinsteinToolkit thorns) that validate correct behavior of the implementation. An agent’s solution is accepted if, when inserted in the appropriate code location, it passes all tests in the reconstructed environment, thereby ensuring that the feature behaves as intended and satisfies the scientific correctness criteria.

To ensure scientific fidelity and evaluation reliability, we perform domain-expert human verification as the final curation stage. This step serves two purposes: (i) confirming that each instance encodes a clear and meaningful scientific background, and (ii) auditing and, auditing and, when necessary, revising the associated tests to ensure adequate coverage.

Issue Verification. Each candidate instance is reviewed by domain experts to determine whether the underlying change reflects a scientifically meaningful issue. Issues that are purely engineering-driven and lack scientific relevance are excluded. Crucially, experts also assess clarity: the issue/PR description must provide enough information for an agent to infer failure mode and the expected behavior using only in-repo context. Instances with ambiguous intent, missing reproduction conditions, or underspecified expected outcomes are filtered out; when the intent is sound but the description is incomplete (most commonly for feature requests), we supplement the prompt with minimal, test-relevant API documentation. Finally, experts assign an empirical difficulty tier (e.g., easy/medium/hard) reflecting the scientific depth and engineering complexity required to solve the instance (Please see detail in Appendix), which we later use for stratified analysis. Please see the difficulty grading rubrics in Appendix-A Test Verification. A benchmark is only as reliable as its tests. For each retained instance, we audit the test patch to ensure that it faithfully captures the intended scientific requirement and is robust to common shortcut solutions. We explicitly check two failure modes and revise tests accordingly: (i) under-coverage: tests are too weak and admit spurious fixes. In these cases, we strengthen tests by adding regression cases.

(ii) over-coverage: tests over-constrain the solution by encoding a particular implementation strategy, thereby rejecting alternative but scientifically valid fixes. In these cases, we remove tests that enforce irrelevant or overly prescriptive implementation details.

False Positives. A false positive occurs when a model-generated patch passes the benchmark tests but does not resolve the underlying scientific issue. This typically arises from under-coverage, where tests validate superficial behavior (e.g., an exception path or a loose numerical bound) rather than the scientific constraint itself. We mitigate this by strengthening tests and by adversarially inspecting candidate patches for common shortcuts; Case Study C provides an illustrative example.

False Negatives. A false negative occurs when a patch resolves the scientific issue but fails the benchmark due to over-coverage. We mitigate this by removing prescriptive assertions and validating scientific properties up to appropriate numerical tolerance; Case Study C illustrates how overly strict iteration caps can incorrectly reject valid fixes.

In total, AInsteinBench contains 244 problems drawn from real pull requests and feature implementations across six repositories and multiple scientific domains. Each instance corresponds to a maintainer-validated change gated by repository tests, yielding an evaluation suite. The difficulty tiers across the five domains are shown in Fig. 4.

Coding Agent. We deploy the CodeAct Agent, a modular framework that integrates LLMs with a containerized execution environment and programmatic tools [44]. For each benchmark instance, the agent provisions an isolated Docker container with the appropriate repository snapshot and dependencies, providing a restricted command-line interface. This setup enables the agent to inspect files, modify code, install packages, and run tests in a reproducible, sandboxed environment. The agent follows a plan-act loop: first, it gathers information (e.g., searching the codebase, inspecting test failures), then proposes concrete edits. After each edit, the agent invokes custom reproduction scripts, analyzes failures, and iteratively refines its patch, ensuring it produces changes that robustly pass the repository’s tests.

Models. We evaluate a diverse set of state-of-the-art large language models, spanning both proprietary and open-source models. Specifically, we include GPT-4o, GPT-5.1, GPT-5-high, Claude 4.5 Sonnet, Claude 4.5 Opus, Grok-4, Gemini 3 Pro, Gemini 2.5 Pro, and Doubao Seed 1.8. We additionally report results for open-source models Kimi K2, GLM 4.6, and DeepSeek V3.

Evaluation Metric. Given the structure of this dataset, the natural evaluation metric is the proportion of tasks which are successfully addressed by the agent’s patch. Beyond the basic problem solve rate, we evaluate the models ability on the following metrics, (1) scientific domain, measuring variation across physics-, chemistry-, and engineering-focused repositories; (2) issue localization, assessing whether models edit the correct files implicated by a bug; (3) problem difficulty, evaluating robustness across empirically stratified tiers; and (4) efficiency, quantifying efficiency finding, fixing/implementing, and testing the prescribed issue/feature. . Together, these axes highlight both structural challenges in scientific codebases and systematic strengths and weaknesses of current LLM-based coding agents.

Performance Across Scientific Domains. Table 2 summarizes resolution rates across repositories and scientific fields. We observe substantial variation across domains. Cheminformatics-oriented repositories such as RDKit and HPC-focused repositories such as AMReX exhibit lower solve rates. In the case of AMReX, this is due to significant engineering difficulty of the repositories pulled issues. For RDKit, the low resolve rate stems from the demand for genuine spatial reasoning, a capability where current models still lag behind [45].

In particular, a large fraction of RDKit tasks hinge on correctly interpreting subtle aspects of a molecule’s 3D structure. In contrast, repositories like Einstein Toolkit benefit from rich in-repo tooling and testing harnesses, and hinge much more on correct mathematical implementations of scientific equation. This reduces agent burden, yielding higher success rates.

Performance Across Problem Difficulty.

We assess the detailed problem-resolved-rate distribution across difficulty. We bucket the difficulties as easy [1, 2.33], medium (2.33, 3.66], and hard (3.66, 5]. Overall, the performance decreases monotonically with difficulty Fig. 5. The difficulty-stratified results incorporate the human review from recruited domain experts, whose feedback is based not only on the question statement, but also samples of trajectories of LLM agents solving the question in the environment. For each question, the difficulty is not only marked by a score, but also a feedback in text so that different experts can cross validate.

Efficiency and Search Cost. During the agent’s work, each additional round of conversation not only incurs computational cost but also exploits the model’s context window and increases the risk of compounding errors [46]. In order to quantify the efficiency, we measure the number of iterations each method requires to reach a terminal outcome (success or failure) (Figure 6).

The trajectory-level breakdown reveals that virtually all problems take at least 20 iterations, a startup period in which the models are exploring the structure of the repository, localizing the problem, and attempting to recreate it. Most solutions accumulate quickly in the first 20-60 steps, after which point additional steps show only moderate improvement. Beyond this point, additional steps primarily contribute to failed trajectories rather than converting failures into successes [47]. This pattern suggests that success depends on satisfying a small number of tightly coupled global conditions, such as identifying the correct files and implementing a fix that is consistent across all affected components. In contrast, trajectories that fail to meet these conditions early are rarely able to correct their mistakes transition into successful solutions, leading to a steady accumulation of unsuccessful attempts as step budgets increase.

Issue Localization. Fixing scientific software often requires identifying where in a large, multi-language codebase the underlying defect resides. We evaluate localization independently of correctness by determining whether a candidate patch interacts with the files required files. Here ‘interacts’ is defined as editing, viewing the contents of, or deleting the file in question. Due to the complexity of these patches, and the existence of redundant files (such as Readme.md, certain header files, etc.) we require a pipeline for determining what ‘required’ changes are.

As illustrated in the benchmark results, we observe a distinct stratification of model capabilities that suggests a shift in the primary bottlenecks of the field. While localizing complex repository-level issues remains an open challenge, top-performing models are increasingly proficient at identifying the relevant codebase coordinates. Models such as DeepSeek V3.2 and Claude Opus 4.5 achieve localization scores exceeding 0.70, approaching a ceiling where the “search” phase is not a dominant failure mode. This indicates that the frontier of improvements ought to shift from retrieval-augmented discovery toward the higher-order reasoning required for functional synthesis.

In an idealized evaluation framework, implementation correctness is a strictly stronger condition than localization; a correct fix inherently implies the files were identified. In practice, however, binary localization metrics are inherently brittle. As the number of required files grows, the quality of the localization scores tend to degrade since models begin to identify valid, alternative fix locations which differ from the ground-truth “required-file”. That said, Fig. 8 indicates that most issues involve few required files, reducing ambiguity and Overall difficulty is defined as the mean of engineering and scientific difficulty preserving the practical utility of the localization metric.

One stark exception to this is the AMReX repository, which exhibits the highest median number of required file at five. This increased dependency correlates with markedly lower resolve rates, highlighting a distinct failure mode in current frontier models: the lack of robust cross-file reasoning. Higher-order synthesis required to implement a consistent fix across five or more files remains a primary bottleneck, even among top-performing models.

Aggregated success rates capture whether a model ultimately resolves a task, but they obscure the reasoning processes that lead to success or failure. To better interpret the results reported above, we examine several representative case studies that make explicit the kinds of scientific judgments required.

In molecular simulation (Case Study C.1), the task is not a just an issue to fix, it is missing an entire physical contribution: a self-energy term required for invariance under changes to PME parameters. This case study is particularly illustrative because resolution did not stop at correcting the underlying implementation, but required extending the existing testing infrastructure to distinguish genuine physical invariance from residual numerical error by systematically tightening grid resolution and tolerance parameters, reflecting domain-aware reasoning about Ewald truncation and discretization effects. In quantum computing (Case Study C.2), resolution begins by recognizing that, in Qiskit’s implementation, the recursive part of the Solovay-Kitaev algorithm operates on rotation matrices in SO(3) rather than on unitary matrices in SU(2). To apply the algorithm to decompose a unitary in SU(2), a mapping from SU(2) to SO( 3) is therefore required. This mapping introduces a global-phase ambiguity, since two unitary matrices that differ only by a phase of -1 are mapped to the same element in SO( 3). An LLM needs to understand both why the SU(2)-to-SO(3) conversion is necessary and what information may be lost in this process; however, LLM agents struggle with both aspects. Although it is also possible to implement the Solovay-Kitaev algorithm entirely within SU(2), doing so requires a deeper understanding of the algorithm and more extensive modifications to Qiskit’s code, which presents an equally significant challenge for LLM agents.

In numerical relativity (Case Study C.3), the overall correctness is determined by the numerical accuracy of the simulation using generated code versus the reference data (provided in the test). In the reasoning process, enforcing physical symmetries and understanding analytic limits are critical to accurately implementing codes. For instance, the example demonstrates the mistakes the model made in a) calculus limits at n → 0, b) missing absolute signs. On the other hand, once we realize these weaknesses and correct the code by changing just two lines, the results of the simulation become accurate. This shows that modern models are able to connect concise analytic constructions from the literature to their concrete numerical realization in large, unfamiliar HPC codebases, while maintaining stability and physical consistency.

Failure Modes. Beyond the software engineering failure modes emphasized by SWE-bench-style evaluations [47], our benchmark exposes a distinct class of limitations that arise from scientific correctness rather than code repair. In these settings, failures are not only about locating files, satisfying tests, or repairing broken APIs, but about respecting domain-encoded invariants. Unlike the former, these constraints are fundamental to the underlying science. As a result, models often produce implementations that are locally plausible and numerically stable, but scientifically invalid.

Across repositories and scientific domains, we observe recurring errors that reflect gaps in mathematical precision, conservation laws, spatial reasoning, and adherence to established scientific conventions. These failures frequently manifest as subtle global inconsistencies: a correct-looking local fix that silently violates an invariant whose consequences only emerge downstream. The failure modes summarized in the table below illustrate that current frontier LLMs struggle most when correctness depends on real physical and mathematical laws, geometric reasoning, or community conventions. Together, these results suggest that scientific software repair poses challenges that are qualitatively different from existing SWE-centric benchmarks.

Implements the right structure, but gets a crucial analytic detail wrong (e.g., missing a limit term or sign/absolute value in a series; Case C.3. Fail to notice the SU(2)-to-SO(3) conversion introduces phase ambiguity; Case C.2).

Proposed fix breaks an exact invariant, or violates strict conservation conditions including of momentum, energy, or particle number. (e.g., electron-number conservation under smearing for odd N ; Case C.7).

Inability to reliably reason about global three-dimensional geometry and spatial constraints, resulting in implementations that satisfy local or axis-aligned criteria but violate true geometric guarantees (Case C.5).

Ignores or misapplies established scientific or codebase conventions that encode global invariants, causing subtle nonlocal inconsistencies despite locally reasonable fixes (Case C.6).

Fails to distinguish scientifically meaningful constraints from representational artifacts, leading to fixes that stabilize behavior across input orderings or encodings while violating domain correctness (Case C.8).

Despite its scale and scientific grounding, AInsteinBench has several limitations.

First, test suites remain imperfect scientific oracles. Although all instances are drawn from maintainer-accepted pull requests and undergo human verification, unit tests in scientific repositories frequently codify imperfect correctness conditions rather than fool-proof physical or mathematical invariants. As demonstrated by our false-positive and false-negative case studies, tests may be either too permissive, allowing scientifically invalid solutions, or overly prescriptive, rejecting alternative but valid implementations. While our curation pipeline mitigates the most egregious cases, completely eliminating such effects remains a major bottleneck in this pipeline. Second, the benchmark is inherently biased by repository practices. Scientific communities differ widely in testing culture, documentation quality, and software architecture. Repositories with well-maintained code, established and consistent infrastructure, and rich test harnesses are more amenable to PR-based benchmarking, while equally important scientific codes with sparse tests or monolithic designs are underrepresented. As a result, we admit that benchmark coverage is limited which adhere to software engineering norms.

Third, the benchmark is constrained to self-contained, lightweight execution environments. Each task must run reliably inside a container on a small number of CPUs, and rely on off-the-shelf packages. As a result, we exclude pull requests whose validation depends on intensive computational resources (e.g., long-running simulations, large MPI jobs, GPU-heavy kernels), or runtime communication with external software systems. While such tasks are scientifically valuable and representative of real research workflows, incorporating them substantially increases the difficulty of the benchmark, with complexity compounding across environment reconstruction, data dependencies, and evaluation design. Crucially, this would also reduce reproducibility and accessibility for third parties, making independent verification and reuse significantly more challenging.

AInsteinBench introduces the first large-scale benchmark that evaluates LLM agents inside real, production scientific software repositories. By grounding evaluation in executable environments, domain-specific failure modes, and maintainer-authored fixes, the benchmark moves beyond abstract scientific reasoning and conventional software-engineering tasks to assess whether models can function as genuine scientific computing agents. Our results highlight both encouraging progress and persistent gaps, particularly where complex spacial reasoning, hardware-level engineering, and cross-file reasoning are required. We hope this benchmark serves as a foundation for future work on reliable, domain-aware agents capable of assisting scientific software development at scale.

AInsteinBench introduces the first large-scale benchmark that evaluates LLM agents within real, production scientific software repositories, grounded in executable environments, maintainer-authored fixes, and domain-specific correctness criteria. By design, the benchmark moves beyond abstract scientific reasoning and conventional software-engineering tasks, probing whether models can operate as genuine scientific computing agents capable of respecting physical invariants, mathematical structure, and established community conventions. Our results reveal both clear progress and persistent gaps, particularly when problems demand global spatial reasoning, hardware-level engineering insight, or tightly coupled cross-file updates.

At the same time, the benchmark surfaces encouraging evidence that frontier models can succeed on nontrivial scientific reasoning when the relevant structure is properly recognized. Successful trajectories include resolving subtle global-phase ambiguities in quantum compilation (Case Study C.2), enforcing analytic limits and physical symmetries in numerical relativity (Case Study C.3), and identifying missing physical contributions in molecular simulation that require both implementation changes and test-suite extensions to validate invariance properties (Case Study C.1). In these cases, models demonstrate the ability to connect compact theoretical insights to concrete, large-scale scientific codebases, going beyond surface-level fixes to reason about physical correctness.

Together, these results position AInsteinBench as both a stress test and a progress marker: it exposes scientific failure modes that lie outside the scope of existing SWE benchmarks, while also documenting that LLM agents are beginning to exhibit the kinds of domain-aware reasoning required to assist scientific software development at scale. We hope this benchmark serves as a foundation for future work on reliable, domain-aware agents capable of assisting scientific software development at scale.

The authors are listed in alphabetical order by their first names. Some names refer to the authors’ internal aliases at the company.

Our expert team consists of 10 experts with PhD degree in the relevant field. We provide them the following rubrics for their annotation of problem difficulties. Scientific Depth (1-5). Scientific depth measures the minimum level of domain-specific scientific knowledge required to correctly understand and resolve a task.

• Score 1. The task can be completed without domain-specific scientific knowledge; a non-expert can solve it by following explicit instructions or existing code patterns.

A question in our synthesized EinsteinToolkit data can be formalized by the following statement. This is also the prompt template we send to LLM when evaluating their single-round performance (i.e., without using agent). where the {src_filename}, {thorn_name}, {interface}, {schedule}, {param}, {configuration}, {doc_context_section} {context} are specific to each question. See the sample questions for an example.

A question can be uniquely specified by {thorn_name} and {src_filename}, meaning that the corresponding src file was removed in the Thorn and the LLM is asked to generate the missing code. For LLM agent, we would provide a Docker container where the specified file is deleted and the agent is able to interact with the environment with shell command. For single-round generation, we provide the relevant context using the prompt template shown above.

As an example, for the question where the LLM is tasked with implementing the Misner metric for wormhole spacetime, the corresponding {thorn_name} is EinsteinInitialData/IDAnalyticBH, and the corresponding {src_filename} is Misner_standrad.c.

For LLM agent, we provide a Docker container where the relevant file is deleted. For single-round generation, we provided (to be replaced to the corresponding {…} positions in the prompt template above). {interface} # Schwarzschild parameters # ————————REAL mass “Mass of black hole” { : :: “Not sure if it can be negative or not” } 2.0 # Kerr parameters # ————————REAL a_Kerr “Angular momentum parameter of black hole” { -1:1 :: “Between +1 and -1” } 0.1 # ————————- Problem: The energy calculations for systems using the AMOEBA and HIPPO force fields do not account for the correction due to the neutralizing plasma in periodic boundary conditions. This omission can lead to inaccuracies in the computed potential energy for systems with a net nonzero charge, particularly when using Particle Mesh Ewald (PME) for long-range electrostatics. Expected Behavior: The energy calculations should include a correction term for the neutralizing plasma when PME is used. This ensures that the potential energy of systems with a net charge is accurately computed, consistent with the physical behavior of periodic systems. The energy should also remain consistent across different PME parameters, such as the alpha value, for systems with nonzero net charge.

The model first explored the OpenMM repository, locating the AMOEBA/HIPPO reference and common kernels and confirming the Python wrappers for ‘AmoebaMultipoleForce’. It then constructed a focused Python reproduction: a single charged particle in a periodic box with AMOEBA multipoles and PME enabled, comparing energies at α = 3.0 and α = 4.0; the large ∼ 1.3 kJ/mol discrepancy confirmed strong, unphysical α-dependence. Using Ewald theory, the model hypothesized that the missing neutralizing plasma term was the culprit, derived the correction

), and mapped each symbol to OpenMM’s internal quantities: total charge from multipole parameters, volume from periodic box vectors, and 1/4πϵ0 from the existing ‘EP SILON 0’ ’electric’ constants. It then implemented the correction in the AMOEBA common PME kernel (and analogously for HIPPO), ensuring the term is applied only when PME is active and accumulating total charge once during parameter initialization / updates.

After inserting the correction, the model rebuilt OpenMM and reran the reproduction: the α-dependence shrank from ∼ 1.3 kJ/mol to ∼ 10 -3 kJ/mol, and further tuning PME grid dimensions and error tolerances reduced the residual difference to ∼ 2 × 10 -5 kJ/mol. The model also extended the fix to HIPPO by summing ‘coreCharge + valenceCharge’ per particle, and verified that existing C++ tests for AMOEBA and HIPPO reference implementations still passed, indicating no regression in other electrostatics behavior. Crucially, the patch respects unit consistency and physical constants, correctly handles periodic box volume, and ensures that the correction is applied only in the PME pathway, aligning with the analytical Ewald limit for charged periodic systems.

The original tests did not explicitly enforce α-invariance for net-charged AMOEBA/HIPPO systems, so the missing plasma term could have remained hidden. The final verification distinguished true physical invariance from residual numerical PME errors by tightening grid and tolerance parameters, demonstrating domain-aware reasoning about Ewald truncation and discretization effects. The global phase of some circuits generated by SolovayKitaev seems wrong.

from qiskit.transpiler.passes.synthesis import SolovayKitaev »> from qiskit.quantum_info import Operator »> from qiskit import QuantumCircuit, Aer, transpile, assemble »> c = QuantumCircuit(1) »> c.y(0) »> skd = SolovayKitaev(recursion_degree=3) »> dc = skd(c) »> dc.draw() # global phase: 3π 4

Given Operator(dc).data is approximately equal to 0 i -i 0 , which equals -Operator(c).data, the global phase of dc should be π (The correct global phase should be -π/4 instead of π) rather than 3π/4.

In Qiskit’s implementation of the Solovay-Kitaev algorithm, the decomposition is performed in SO(3) rather than in SU (2). The conversion map from SU(2) to SO( 3) is two-to-one and therefore introduces a possible global phase flip of ±1. The fix adds an explicit phase-alignment check and adjusts the resulting operator’s global phase angle by π whenever the negated approximation matches the target better.

The Solovay-Kitaev algorithm efficiently approximates an arbitrary single-qubit gate in SU(2) using gates from a fixed, finite gate set, typically {T, T † , H}. In Qiskit’s implementation, this is achieved by first converting a unitary matrix in SU(2) to a rotation matrix in SO(3), and then performing the core decomposition in SO(3). This example illustrates a scientifically meaningful bug arising from the fact that the homomorphism from SU(2) to SO(3) is two-to-one. As a result, the global phase has to be stored separately in order to reconstruct the original SU(2) unitary from its SO (3) representation. In the Core Fix, such phase ambiguity is resolved through an extra phase adjust. It demonstrates the need for domain-aware debugging beyond standard software-engineering fixes.

The LLM was asked to complete the Misner_standard.c for setting up initial data of wormhole spacetime in the EinsteinToolkit framework. The IDAnalyticBH thorn provides analytically-specified initial data for black hole evolutions, including the Misner solution [48], an early attempt to represent wormholes as initial data for Einstein’s equations.

Misner’s 1960 solution describes two massive objects instantaneously at rest whose “throats” are connected through a wormhole topology. Unlike simple superposition, this is an exact solution to Einstein’s constraint equations. The spacetime can be represented, in a proporly chosen coordinate, by a time-symmetric (vanishing extrinsic curvature Kij = 0) and conformally flat metric

The physics is encoded entirely in the conformal factor ψ, which solves the Hamiltonian constraint (reduces to ∇ 2 ψ = 0 for this case). Misner’s ingenious method of images constructs ψ by summing contributions from image sources at locations determined by the wormhole topology.

The IDAnalyticBH thorn implementation inherits metric grid-functions from ADMBase and conformal factor variables from StaticConformal. The code must: (1) iterate over all spatial grid points, (2) at each point (x, y, z), sum the series for ψ up to truncation limit nmax, (3) set the conformal metric components to gij = δij and extrinsic curvature to zero, and (4) when metric_type=“static conformal”, compute and store analytical derivatives ψ,i and possibly ψ,ij . The parameter µ (denoted mu in code) controls the throat of the wormhole.

Implementing this requires understanding: 1. Initial value formulation: The Einstein equations split into constraint equations (to be satisfied on an initial slice) and evolution equations. For time-symmetric data with Kij = 0, the constraint reduces to (3) R = 0 on the spatial slice. 2. Conformal method: Writing gij = ψ 4 δij with flat conformal metric δij reduces the constraint to ∇ 2 ψ = 0 (Laplace’s equation), enabling the method of images.

The wormhole has periodicity in a toroidal coordinate µ with period 2µ0. Image sources appear at zn = coth(µ0n) for all integers n, each contributing mass ∝ | sinh(µ0n)| -1 . 4. Coordinate transformation: The natural toroidal coordinates (µ, θ, φ) must be converted to Cartesian (x, y, z) for numerical simulation. Each image source at zn contributes csch(µ0n)/rn where rn =

x 2 + y 2 + (z -zn) 2 .

The key computational task is evaluating the conformal factor ψ at each grid point. From Misner’s derivation, going to Cartesian coordinates, the conformal factor is:

The generated code failed numerical tests (grr_max: 2.26 vs. benchmark 80.18) despite correct overall structure. It chose to iterate explicitly from -N to +N rather than exploiting symmetry: psi_val = 0.0; // Error: n=0 limit is 1.0, not 0 for (n = -nmax; n <= nmax; n++) { if (n == 0) continue; csch_val = 1.0 / sinh(mu_val * n); // Negative for n < 0 r1 = sqrt(xxxx + yyyy + (zz + coth_val)*(zz + coth_val)); psi_val += csch_val / r1; // Error: missing fabs() } Two physical errors: (1) The n = 0 term limn→0[sinh(µn)] -1 / x 2 + y 2 + (z + coth(µn)) 2 evaluates to 1.0, seen explicitly by applying L’Hôpital’s rule, not 0. (2) Since sinh(µn) is negative for n < 0, omitting fabs() causes negative mass contributions, violating physics, though the original absolute sign is coming from a calculus trick pulling a complete square out of the root. The reference code avoids both by pairing ±n terms and only sum over positive n. The fix (that we can do) is simple:

This failure demonstrates the gap between general coding ability and domain expertise. The LLM successfully navigated the Cactus framework (understanding ADMBase and StaticConformal conventions, computing derivatives conditionally, handling grid iteration), and translated mathematics to code competently. However, it missed physics-driven implementation choices: the reference code’s paired summation is not arbitrary style but encodes the physical symmetry n ↔ -n, automatically ensuring positive mass contributions. The LLM’s explicit -N to +N iteration is mathematically equivalent but fails to recognize that (1) the absolute value in the formula is not decorative but enforces physical positivity, requiring fabs() when symmetry isn’t exploited, and (2) the n = 0 singularity requires limiting analysis yielding 1.0, not naive exclusion from a zero-initialized sum. These are not programming errors detectable by compilers or unit tests. Only domain knowledge and numerical validation against known solutions reveal them.

The

The agent was asked to extend PySCF’s Full Configuration Interaction (FCI) solver with new APIs that produce reduced density matrices (RDMs) up to fourth order, including spin-resolved blocks. Concretely, two feature functions were to be added to pyscf/fci/direct_spin1.py: 1. make_rdm1234: return spin-traced 1-, 2-, 3-, and 4-RDMs following PySCF’s FCI conventions. 2. make_rdm1234s: return spin-separated components, including 3-RDM blocks (aaa, aab, abb, bbb) and 4-RDM blocks (aaaa, aaab, aabb, abbb, bbbb). The implementation needed to respect existing operator-order conventions and a reorder flag that maps internal kernel outputs into the user-facing canonical ordering. The accompanying tests assert that spin-separated blocks reconstruct the spin-traced tensors when summed with the required index transpositions.

The problem statement provided typed function signatures: def make_rdm1234(fcivec, norb, nelec, link_index, reorder) -> numpy.ndarray: … def make_rdm1234s(fcivec, norb, nelec, link_index, reorder) -> numpy.ndarray: … Despite the brevity, the naming and parameter patterns encoded strong affordances: make_rdm1234 clearly extends the existing make_rdm123 family in direct_spin1.py, while the suffix s matches established spin-separated routines (make_rdm1s, make_rdm12s, make_rdm123s). The shared argument schema (fcivec, norb, nelec, link_index, reorder) aligns with PySCF’s FCI interfaces and effectively points to the correct file and helper kernels.

Reduced density matrices encode k-body correlations of an N -electron wavefunction |Ψ⟩ via expectation values of strings of creation/annihilation operators. PySCF adopts a consistent convention in which higher-order tensors follow dm[p, q, r, s, .

with the 1-RDM matching the mean-field convention. Spin separation decomposes the RDM into blocks based on the spin labels (α or β) of the involved orbital indices. For example, the 3-RDM decomposes into aaa, aab, abb, bbb blocks, and the 4-RDM into aaaa, aaab, aabb, abbb, bbbb. These spin-resolved objects enable diagnostics and development work where the distribution of spin among correlated electrons matters.

The solution mirrors the established design patterns presented by the minimal API spec, as well as the pattern present in pyscf/fci/direct_spin1.py for lower-order RDMs:

  1. Spin-traced 1-4 RDMs. The implementation delegates all heavy computation to the existing kernel rdm.make_dm1234, followed by an optional call to rdm.reorder_dm1234 to enforce PySCF’s canonical operator ordering.

  2. Spin-separated 1-4 RDMs. As in make_rdm123s, the CI vector is first converted into a spinless representation over doubled orbitals. The spin-traced RDMs are then computed in this enlarged space and sliced into α/β blocks using spatial index ranges. The slicing logic generalizes the existing 2-and 3-RDM implementations to the full set of 3-and 4-RDM spin blocks.

pyscf/fci/direct_spin1.py (excerpt; code truncated) def make_rdm1234(fcivec, norb, nelec, link_index=None, reorder=True): dm1, dm2, dm3, dm4 = rdm.make_dm1234(…) if reorder: dm1, dm2, dm3, dm4 = rdm.reorder_dm1234(…) return dm1, dm2, dm3, dm4 def make_rdm1234s(fcivec, norb, nelec, link_index=None, reorder=True): ci_spinless = civec_spinless_repr(…) rdm1, rdm2, rdm3, rdm4 = make_rdm1234(…) # slice α/β blocks for 1-4 RDMs # … return (rdm1a, rdm1b), (rdm2aa, rdm2ab, rdm2bb), …

This case illustrates how high-leverage scientific extensions can be driven by surprisingly small surface specifications when the codebase’s naming conventions and helper utilities are coherent. By aligning with PySCF’s established RDM conventions and reusing rdm.make_dm1234+rdm.reorder_dm1234, the implementation cleanly extends the FCI solver to provide spin-resolved access to 3-and 4-body correlators. In practice, these APIs ensure that downstream methods can reliably reconstruct spin-traced tensors.

Pull Request Description.

The RDKit conformation generator EmbedMolecule() should fail when a molecule contains a geometrically impossible tetrahedral center. In PR #7564, certain molecules return -1, indicating failure to embed. The maintainer-provided fix introduces a new d_structureFlags field and conditionally relaxes the tetrahedral volume threshold only for chiral centers located in fused small rings (high angular strain).

Model-Generated Patch (Key Diff Excerpts).

The model produced a patch that globally relaxes two critical constants:

• MIN_TETRAHEDRAL_CHIRAL_VOL: 0.50 → 0.20 • TETRAHEDRAL_CENTERINVOLUME_TOL: 0.30 → 0.15 It also adds minor changes unrelated to the scientific logic (such as hydrogens being skipped in linearity checks), but does not modify the ChiralSet data structure and does not implement structural-context-dependent logic. Why is this a false positive? Under the benchmark test suite, this globally relaxed version of the embedding algorithm passes because some previously failing SMILES strings now embed successfully. However, this constitutes a scientific regression. Molecules with geometrically impossible tetrahedral centers, such as:

should not be embeddable. The model’s patch embeds them by lowering global thresholds rather than addressing the actual geometric constraints.

The test suite introduces bulk test cases for valid small-ring embeddings but does not include dedicated “impossible geometry” regressions. The model’s incorrect patch passes all tests even though it fundamentally violates the intended physical meaning of the embedding algorithm. This is a canonical example of a false positive: a patch that satisfies the unit tests while failing the scientific semantics.

A globally relaxed chirality threshold makes impossible molecules embeddable, letting the model’s patch pass tests while violating physical constraints. This shows how scientific software can yield false positives without domain-aware checks.

This pull request addresses a DIIS instability for a symmetry-adapted ROHF calculation on atomic oxygen: with DIIS enabled, the triplet O state (spin=2, basis=‘cc-pVDZ’, symmetry=True) fails to converge, whereas disabling DIIS yields a stable solution. The maintainer fix modifies DIIS error-vector construction and symmetry handling (e.g., in diis.py, hf_symm.py, uhf_symm.py, ghf_symm.py) so that symmetry-forbidden components are masked out and symmetry-adapted SCF paths consistently require mol.symmetry to be enabled. The corresponding test test_diis_for_symmetry_adapted_scf asserts that ROHF with DIIS now converges for triplet O within a fixed maximum of 9 SCF cycles and reaches a reference energy of -74.7874921601008 Ha.

A ROHF-specific DIIS redesign in PySCF restores convergence for a problematic triplet oxygen calculation and yields the correct total energy, but fails a benchmark test that requires convergence within 9 iterations. In this case fixing the false negative is quite simple, and requires only that we remove this the 9 iteration requirement.

The agent was asked to extend OpenMM’s Modeller.addSolvent() method to support non-cubic periodic boxes, specifically truncated octahedra and rhombic dodecahedra. The implementation needed to correctly calculate periodic box vectors based on a user-defined padding distance p, ensuring that the minimum distance between any point of the solute and its periodic images in all directions is at least p.

Molecular dynamics simulations use periodic boundary conditions (PBC) to simulate bulk properties. While cubic boxes are standard, compact shapes like the truncated octahedron (the Wigner-Seitz cell of a BCC lattice) and the rhombic dodecahedron (the Voronoi cell of an FCC lattice) are more computationally efficient. They require 23% and 29% less volume respectively than a cube for the same minimum image distance, reducing the number of solvent molecules needed. For non-orthogonal boxes, a naive axis-aligned bounding box (AABB) approach is insufficient. If a solute is elongated along a non-axial direction, the distance to a tilted face may be smaller than the distance to an axial boundary. To guarantee a minimum padding p, the box must be sized relative to the solute’s bounding sphere of radius R. The periodic box width L must then satisfy L = 2R + p (or 2R + 2p depending on the convention) to ensure no part of the solute interacts with its own image within the cutoff distance.

Key concepts required for this implementation include: 1. Bounding Sphere vs. AABB: Sizing a box based on max(x, y, z) (AABB) only guarantees padding along the axes. For triclinic boxes, the minimal distance to a face often occurs off-axis, necessitating a sphere-based radius R = max ∥⃗ ri -⃗ c∥. 2. Reduced Triclinic Vectors: OpenMM requires box vectors in a “reduced” form where vector ⃗ a is along x, ⃗ b is in the xy-plane, and components are shifted to satisfy bx ≤ ax/2, etc. 3. Shape Geometry: A truncated octahedron has angles α = β = γ = arccos(-1/3) ≈ 109.47 • . A rhombic dodecahedron typically uses 60 • or 120 • angles. 4. Volume/Density Heuristics: When adding a specific number of molecules, the volume of the shape (V T O = L 3 /2; V RD = L 3 / √ 2) must be accounted for to estimate the correct padding.

[Image comparing bounding sphere and axis-aligned bounding box for a non-spherical solute]

The correct implementation calculates the bounding sphere radius and applies shape-specific coefficients to generate reduced vectors:

Model-Generated Patch (Key Diff Excerpts). Model-Generated Patch (Key Diff Excerpts). Significance This is a “geometric algorithmic miscalculation.” The agent’s code is syntactically correct and uses the right library functions, but it violates the underlying physical requirement of the task (the padding guarantee). In scientific computing, such errors are dangerous because the simulation will run without crashing, but the physics will be wrong. In this case, it could causing artificial and physically incorrect interactions between the solute and its periodic images. This highlights that LLMs must understand the geometric constraints of physical systems.

The Task

Coupled-cluster singles and doubles (CCSD) is an ab initio many-body electronic structure method widely used for high-accuracy correlation energy calculations. The agent was asked to diagnose and fix a non-convergence bug in PySCF’s periodic CCSD implementation that exploits k-point space-group symmetry (pyscf.pbc.cc.kccsd_rhf_ksymm.KsymAdaptedRCCSD). A user reported that a diamond primitive cell with a 3 × 3 × 3 k-point mesh fails to converge when k-point symmetry is enabled, while the non-symmetry-adapted KRCCSD converges. The goal was to identify the scientific root cause in the symmetry machinery that maps orbitals and integrals between the irreducible Brillouin zone (IBZ) and the full Brillouin zone (BZ), and to implement the minimal corrective change so that the CCSD calculation converges.

In periodic electronic structure theory, crystalline orbitals are Bloch functions ψ n,k (r) obeying Bloch’s theorem

for any lattice vector R.

Space-group operations g ≡ (R | τ ) act as r → Rr + τ and induce k → Rk. When generating symmetry-related quantities from the IBZ to the full BZ, one must include the correct Bloch phase associated with any net lattice translation L incurred when mapping atomic centers between symmetry-equivalent positions. If the sign convention in the phase e ±i k•L is wrong, the transformed orbitals/integrals acquire inconsistent complex phases across the star of k-points, breaking Hermiticity and momentum-conservation relationships that CCSD relies on.

A key diagnostic is the MP2 correlation energy used to initialize CCSD:

For an insulator, ϵi, ϵj (occupied) lie below ϵa, ϵ b (virtual), making denominators negative and yielding a negative E (2) . With an incorrect symmetry phase, the ERIs can become pathologically inconsistent, producing unphysical MP2 energies (including positive values and large imaginary parts), and CCSD iterations can fail to converge.

  1. Space-group actions and IBZ↔BZ reconstruction: mapping orbitals/integrals between symmetry-related k-points requires consistent rotation of k plus translation-induced phases tied to how atoms are permuted and shifted by lattice vectors. 3. Hermiticity and momentum conservation in k-space ERIs: correct symmetry reconstruction preserves algebraic constraints; phase errors corrupt complex ERIs and can induce large imaginary components in energies/intermediates. 4. Energy-sign sanity checks: a positive MP2 correlation energy in a gapped system is a red flag for inconsistent integrals/phase conventions rather than mere orbital ordering.

The root cause was upstream of CCSD: the Bloch phase sign used during symmetry transformations between the IBZ and the full BZ was incorrect. In PySCF this occurs in pyscf/pbc/symm/symmetry.py (routine _get_phase), which computes phase factors associated with mapping atomic centers under a space-group operation, including a lattice shift Lshift. The original code used the complex-conjugated phase,

where k is in scaled reciprocal coordinates. The gold patch flips the sign to match Bloch’s theorem:

This one-line fix restores consistent phases for orbitals/integrals generated from the IBZ, yielding physically reasonable (negative) MP2 initialization energy and eliminating the spurious imaginary components that destabilized CCSD convergence.

The agent misdiagnosed the symptom (non-convergence and unphysical MP2 energy) as an orbital-ordering issue and patched the CCSD ERI builder instead of the symmetry phase convention. Concretely, it modified pyscf/pbc/cc/kccsd_rhf_ksymm.py to compute k-point Fock matrices, take their diagonals as mo_energy, and then sort mo_energy and the corresponding mo_coeff columns when not monotone.

This does not (and cannot) repair a global phase inconsistency across symmetry-related k-points: sorting only reorders eigenvectors locally at each k-point, while the true error corrupts the inter-k-point phase relationships that define the symmetry-adapted reconstruction of orbitals and ERIs. In degenerate manifolds, such sorting may further introduce arbitrary gauge choices, potentially masking the real defect and destabilizing reproducibility.

This case highlights a characteristic failure mode in scientific software repair: the code can be modified in a superficially reasonable way (“orbital energies aren’t sorted”), yet still miss a subtle but foundational physics convention. Here, a single sign in the Bloch phase e ±ik•L determines whether IBZ→BZ transformations preserve the complex phase structure required for Hermiticity and momentum conservation. With the wrong sign, downstream correlated methods exhibit dramatic, qualitative pathologies (positive MP2 energies in an insulator, large imaginary parts, and CCSD non-convergence). The episode illustrates that effective debugging in computational chemistry often hinges on recognizing invariants and conventions that are only weakly encoded in unit tests or APIs; without that domain grounding, even clean-looking patches can be scientifically inert or actively harmful.

The agent was asked to fix PySCF’s smearing implementation for restricted Hartree-Fock (RHF) so that the total electron number is conserved for systems with an odd number of electrons. The bug appears when pyscf.scf.addons.smearing is applied to an RHF object with odd mol.nelectron: the resulting occupations mo_occ sum to an even integer (e.g., 16) instead of the correct odd electron count (e.g., 15). The relevant code lives in pyscf/scf/addons.py, specifically _SmearingSCF.get_occ (RHF branch) and the helper _get_fermi used to seed the chemical potential µ.

In Hartree-Fock theory, electrons occupy molecular orbitals (MOs) determined by a self-consistent Fock operator. In restricted HF, α and β spins share the same spatial orbitals, so each spatial MO carries a two-fold spin degeneracy and can host up to two electrons. At zero temperature, RHF occupations are typically 0 or 2, but smearing introduces fractional occupations to stabilize SCF convergence by approximating a finite-temperature ensemble. For Fermi-Dirac smearing, the (spin-resolved) occupation of an orbital with energy εi is fi(µ, σ) = 1 exp (εi -µ)/σ + 1 , (and PySCF also supports a Gaussian-like form, e.g. 1 2 erfc((εi -µ)/σ)). The chemical potential µ must be chosen to enforce particle-number conservation. For RHF, the constraint is 2 i fi(µ, σ) = N, so the spatial-orbital occupation sum must equal N/2, which is fractional when N is odd. If one instead rounds the spatial occupation count to an integer and then doubles, the total becomes even, violating particle-number conservation.

Key concepts required for this fix include: 1. RHF degeneracy and electron counting: enforce i ni = N with ni ∈ [0, 2] per spatial orbital, i.e. solve the smearing constraint in the spatial basis with target N/2. 2. Chemical potential as a constraint variable: µ is not a free parameter; it must be adjusted so the occupation function satisfies the electron-number constraint. 3. Fractional N/2 for odd-electron RHF: odd N implies a half-filled spatial level on average, which requires allowing non-integer N/2 in the solver. 4. Robust initialization of µ: seeding µ near the HOMO/LUMO region requires turning a fractional target (e.g. 7.5) into a valid index, e.g. via ⌈N/2⌉. 5. Consistent entropy scaling: when occupations are computed per spin channel and then doubled for RHF, the entropy (free-energy term) must be doubled as well.

The correct fix enforces the RHF electron-number constraint in the spatial basis and correctly handles fractional N/2 when seeding µ.

With this change, a 15-electron system targets i fi = 7.5 in the spatial basis and, after doubling, yields mo_occ.sum() = 15 exactly.

The agent’s patch preserved an integer-rounded spatial occupation target in RHF: For odd N , this guarantees an even result: e.g. N = 15 ⇒ (N + 1)//2 = 8, so the solver enforces i fi = 8 and then doubles to 16. The code is numerically stable, compiles, and can converge, but it violates the conservation law the method is supposed to enforce.

This case is a prototypical conservation-law violation triggered by an “innocent” engineering choice (integer rounding) that is valid for closed-shell even-N RHF but becomes unphysical under smearing for odd-electron systems. The failure is subtle: SCF still converges and most regression tests (often dominated by even-electron molecules) can pass, yet the algorithm breaks a fundamental invariant (N conservation). The fix requires explicitly reasoning about RHF spin degeneracy and recognizing that N/2 is allowed to be fractional in the smeared ensemble, illustrating how domain constraints govern scientific software validity.

The Task

The agent was asked to diagnose and fix an order-dependence bug in RDKit’s tautomer hash function HetAtomTautomerv2 (implemented in Code/GraphMol/MolHash/hashfunctions.cpp). The bug manifested as different hash outputs for the same molecule when constructed from a V3000 molblock versus from canonical SMILES, indicating that traversal/bond iteration order (or atom numbering) was leaking into the hash. The requirement was to make MolHash::TautomerHashv2 deterministic and canonical (independent of bond order and atom indices) while preserving scientific correctness about which atoms/bonds belong to tautomeric systems and maintaining existing semantics and tests.

Tautomerism is a chemically constrained rearrangement involving proton relocation and adjacent π-bond shifts (e.g., keto-enol). A “tautomeric hash” aims to produce a canonical identifier for all members of a tautomer set: equivalent tautomers should map to the same hash, while chemically distinct molecules should not be spuriously merged. Computationally, the algorithm identifies a subgraph G = (V, E) of eligible bonds and selects a subset S ⊆ E representing tautomeric/conjugated systems under domain constraints (conjugation, heteroatom behavior, and functional group exclusions). It then normalizes that system (e.g., aromaticization of bonds in S, charge/H bookkeeping), and emits a canonical SMILES-like representation with additional invariants (e.g., H-count / net charge suffixes). Determinism is not a pure “graph ordering” property: it must arise from chemically-defined inclusion/exclusion rules so that S is stable across alternate atom numbering and input encodings.

Key concepts required for a correct fix include: 1. Conjugated system identification: Tautomeric motion occurs along conjugated pathways, so inclusion decisions must depend on conjugation predicates (e.g., conjugated bond neighborhoods), not on incidental traversal order. 2. Functional group exclusions via SMARTS flags: Many motifs (amides, nitro, phosphates, sulfates, guanidines, isocyanates, etc.)

must not be treated as tautomerizing; these are excluded by SMARTS-driven bond flags used during traversal. 3. Overreach prevention near heteroatoms: In patterns like enamine-like C-N-C=C, naive expansion can incorrectly absorb the first adjacent single bond near a heteroatom. A targeted overreach heuristic based on local conjugation density prevents chemically invalid system growth. 4. Flag-aware eligibility: Atom participation tests must respect propagated atom/bond flags (e.g., using isCandidateAtom(oatom, atomFlags) rather than an overload that ignores flags). 5. Determinism via chemistry, not sorting: Robust canonicalization requires stabilizing which bonds are selected (chemistry), not merely stabilizing how bonds are iterated (data structure order).

The reference solution stabilizes TautomerHashv2 by augmenting chemical heuristics rather than applying generic ordering tricks. Concretely, it: (i) extends getBondFlags with additional SMARTS patterns to flag/exclude non-tautomeric functional groups; (ii) introduces a per-atom getNumConjugatedNeighbors() statistic to quantify local conjugation density; (iii) implements checkForOverreach() to block specific heteroatom-adjacent expansions; and (iv) integrates these checks into neighbor exploration, deduplicating queued neighbor bonds, tracking conjugated atoms, and using flag-aware candidate checks. These changes make the selected tautomeric system S chemically stable across representations, yielding deterministic hashes without breaking chemically meaningful distinctions.

The agent misdiagnosed a chemistry-driven selection problem as a generic iteration-order problem, and attempted to enforce determinism by sorting and cardinality-based set selection. It (a) sorted bonds by atom indices, (b) re-initialized bondsConsidered per traversal seed, and (c) chose systems using either a union-of-all strategy or a “pick smallest disjoint systems” heuristic. None of these encode the required chemical constraints (exclusions, overreach near heteroatoms), leading to broad regressions: union-of-all over-tautomerized fused/hetero systems, while smallest-disjoint selection under-selected in cases where chemically correct reachability differs across tautomer forms.

// Agent’s incorrect approach (conceptual excerpt) std::vector<Bond*> sortedBonds = mol->bonds(); std::sort(sortedBonds.begin(), sortedBonds.end(), byMinMaxAtomIdx); for (

This trajectory illustrates a recurring failure mode in scientific software repair: the model can produce plausible, compilable C++ and even add a targeted reproduction test, yet still fail because the invariant is domain-defined rather than purely algorithmic. Here, canonical tautomer hashing depends on nuanced chemical heuristics (functional group exclusions, conjugation density, heteroatomadjacent overreach prevention). Generic graph/order fixes cannot substitute for those rules, and can silently introduce scientifically meaningful regressions (e.g., incorrectly merging or separating tautomeric systems). The case underscores why “passes a reproduction” is insufficient in chemistry-heavy code: correctness is encoded in specialized heuristics that constrain which transformations are chemically admissible.

Models evaluated on the benchmark for each domain with resolved rates. Difficulty is reported as pass rate (%) over easy, medium, and hard instances. The five scientific domains and corresponding codebases are: Numerical Relativity (NR): EinsteinToolkit and AMReX; Quantum Information (QI): Qiskit; Molecular Dynamics (MD): OpenMM; Quantum Chemistry (QC): PySCF; Cheminformatics (CI): RDKit.

Engineering Difficulty (1-5). Engineering difficulty measures the software engineering complexity involved in implementing a correct solution.• Score 1. Minimal engineering effort; the task can be completed without understanding internal APIs or implementation logic (e.g., simple copy-paste or parameter changes). • Score 2.

Engineering Difficulty (1-5). Engineering difficulty measures the software engineering complexity involved in implementing a correct solution.

→Create C/C++/Fortran code for the file {src_filename}in thorn {thorn_name}. → IMPORTANT: → → Code:

Scientific intent vs. numerical artifacts.

Scientific intent vs. test design.The scientific requirement the issue is that ROHF with DIIS should converge for the specified triplet O system and reproduce the correct energy, just as the DIIS-disabled calculation does. The model’s patch satisfies these physical constraints but deviates from the maintainer’s chosen implementation path and convergence speed. Because the test suite over-specifies iteration constraints, it effectively enforces one particular implementation style rather than the underlying quantum-chemical behavior.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut