We introduce QMBench, a comprehensive benchmark designed to evaluate the capability of large language model agents in quantum materials research. This specialized benchmark assesses the model's ability to apply condensed matter physics knowledge and computational techniques such as density functional theory to solve research problems in quantum materials science. QMBench encompasses different domains of the quantum material research, including structural properties, electronic properties, thermodynamic and other properties, symmetry principle and computational methodologies. By providing a standardized evaluation framework, QMBench aims to accelerate the development of an AI scientist capable of making creative contributions to quantum materials research. We expect QMBench to be developed and constantly improved by the research community.
๐ Full Content
Benchmarks are essential for measuring and guiding the development of machine learning models. The rapid progress of large language models (LLMs) in recent years have been usually measured by benchmarks on common tasks such as math, coding, language processing etc. [1,2,3,4] Specialized benchmarks such as GPQA [5] in have also been developed, but mainly at the level of course exam problems. Among the potential applications of LLM, scientific research is one of the most exciting areas that is being explored. If LLM can execute research tasks with high enough accuracy, it will be able to play the role of a research assistant or even a research collaborator. More and more attempts have been made to build AI scientist agents in multiple fields including biology [6,7,8], chemistry [9,10] and more generic tasks [11,12,13,14]. However, a broader application of LLM in frontier scientific research remains challenging. A main difficulty is that research-level benchmarks are still scarce. For the purpose of evaluating AI's performance in frontier research tasks, research benchmarks need to satisfy the following criteria:
Proper granularity. Benchmarks covering a broad area such as physics and math are too broad for evaluating AI’s research capability in a particular subfield. The scope of a benchmark needs to be defined by the experts in the field, similar to how the scope of an academic conference is defined.
Research-level questions. Even in domains where AI performs well on graduate level exam problems, substantial differences remain between these tasks and actual research problems encountered by researchers. Identifying the truly research level problems require the problems to be selected by domain experts.
Coevolving with the field. Research problems are constantly evolving. New discoveries are made and new concepts are being introduced. The benchmarks need to be regularly updated to reflect the current knowledge and interest of each subfield. Such a mechanism has not been developed yet.
Quantum materials constitute a broad class of solids in which quantummechanical effects at the level of electrons, spins, and lattice degrees of freedom give rise to qualitatively new phases of matter and functionalities. Prototypical examples include topological insulators and semimetals [15,16,17], lowdimensional materials [18,19], unconventional superconductors [20,21], and strongly correlated oxides [22,23], where band structure, electronic correlations, and symmetry intertwine in nontrivial ways. Progress in this field requires deep expertise in quantum mechanics and solid-state physics, together with a diverse skill set: the ability to perform analytical derivations and approximate calculations, a firm grasp of crystal symmetry, group theory, and related mathematical structures, proficiency with first-principles electronic-structure simulation methods such as density functional theory (DFT) and beyond, and an understanding of the numerical algorithms and high-performance computing that underlie modern simulations. These features make quantum materials research an especially demanding yet attractive testbed for evaluating the emerging capabilities of AI assistants and agents.
Motivated by the need for research-level benchmarks, in this paper, we introduce QMBench, a benchmark set for quantum material research. In particular, we focus on crystalline materials. The problems in this benchmark cover different aspects of this field, classified by physical properties and research methods. In addition to fundamental principles and physical properties, we put an emphasis on density functional theory, since it plays a critical role in theoretical understanding of solid state materials. Our benchmark uncovers the strength and weakness of current LLMs in this field, which we discuss in more detail later in the draft.
The remainder of the paper is organized as follows. In Sec. 2 we describe the problems in QMBench and provide example problems. In Sec. 3 we summarize the performance of the leading models, including the scores and detailed analysis of some example problems. In Sec. 4 we discuss other related works, and finally we conclude in Sec. 5. Our benchmark is developed and posted on https: //bench.science, an open platform to facilitate collaboration and sharing of scientific research benchmarks.
2 Detailed description of QMBench
Quantum materials research constitutes an inherently multimodal endeavor that integrates theoretical derivation, computational implementation, and the interpretation of simulation data. To comprehensively evaluate these capabilities, our benchmark dataset addresses the entire research workflow, ranging from conceptual formulation to practical execution. As detailed in Table 1, the tasks require four distinct output modalities: multiple-choice selections, numerical values, free-text responses, and atomic structure files in the POSCAR format [24], a standard within the DFT community. The dataset comprises 103 problems organized into five thematic domains: structural properties, symmetry principles, computational methodologies, electronic properties, and thermal, optical, and magnetic properties. Table 2 illustrates the distribution across these categories, which contain 14, 28, 18, 43, and 16 problems, respectively, with multi-label tagging allowing a single problem to span multiple domains. Furthermore, we stratify the problems into three difficulty tiers corresponding to undergraduatelevel fundamentals, graduate-level knowledge, and frontier research challenges. Output the atomic positions of a slab of five van der Waals layers of Bi 2 Te 3 in POSCAR format for DFT calculations. Please set the thickness of vacuum layer to be 15 angstroms. Your output should be the content of the POSCAR file, in a string.
Problem Type: POSCAR
This category examines the AI scientist’s capacity to understand and study the electronic structure of materials. It covers theoretical derivations within band theory, including band topology, first-principles calculations of electronic structures, and knowledge of fundamental facts related to electronic structure.
Band gap estimation from figure
Please provide an estimate of the band gap of this material in electronvolts (eV). Only give the numerical value, without including the unit.
Symmetry constraint of nearest neighbor hopping in tightbinding model
Consider atoms on a two-dimensional triangular lattice with the lattice constant a and basis vectors a 1 = (a, 0), a 2 = (a/2, โ 3a/2). The system has D 3h symmetry and time-reversal symmetry. On each site, consider the 3 d-orbitals d z 2 , d xy and d x 2 -y 2 and neglect spin. The lattice has D 3h symmetry group with mirror symmetry along xy and yz planes. Define the nearest neighbor hopping matrix elements along a 1 direction as t
22 + 1 2 t
(1) 33
22
22 + 3 4 t
(1) 33
22
22 + 1 4 t
(1) 33
Problem Type: Multiple Choice
This category evaluates the AI scientist’s ability to study how materials respond to external conditions such as temperature, magnetic fields, and light, encompassing a range of thermal, optical, and magnetic phenomena.
The specific heat contribution from electrons in a 3D Dirac semimetal at low temperature (much lower than Fermi temperature) is proportional to T n , where T is the temperature. Please give the power n.
Problem Type: Numerics
To automatically evaluate the AI scientist (student model) across diverse problems in quantum materials, we define several standardized problem types, and define grading algorithm based on each type. In more details, we have the following problem types:
Multiple-choice. The answer is extracted and compared with the ground truth. When the ground truth includes multiple answers, an incomplete answer receives partial credit, but any wrong answer receives zero credit. For example, if the ground truth is A, B, D, answer A, B will receive 2/3 of the full score, while A, C will receive zero.
Numerics. The answer is an integer, or a floating-point number (with error bar). For integer, credit is given only for exact matching. For floating-point number, the ground truth is a range (such as [0.1, 0.12]) and the answer is considered correct if it falls in this range.
Text. For questions with a more free-form answer, we allow the answer to be text, which will be graded by another LLM agent. To minimize subjectivity in LLM grading, each problem includes a detailed rubrics to explicitly enumerate the required points and their corresponding scores, which provides an objective and reproducible basis for evaluation. POSCAR. Some problems require the model to output a POSCAR file, which is a commonly used format of crystal structure data. The POSCAR is compared with the ground truth with a symmetry-aware grading function. Different atom coordinates could correspond to the same structure since they may be related by a translation or a rotation. Our grading function considers such coordinate transformation ambiguity and also allows an error tolerance. The tolerance for lattice angles is 1ยฐand that for cell lengths is 5%. The function generates candidate rotation/reflection matrices and translation vectors to seek for a possible matching. By relying on this symmetry-aware comparison, we can robustly handle equivalent structures that are presented in different cell orientations or with different origins. Full credit is awarded if the two structures differ only by a rigid-body translation or rotation, and their lattice constants and atomic coordinates lie within the relative error margin.
3 Model Performance
We evaluate two groups of LLMs on our benchmark: text-only models, which are tested on the 92 purely textual problems, and multimodal models, which are tested on the full set of 103 problems including 11 requiring visual interpretation. All problems are scored out of 10 points.
For the text-only setting, five representative open-source and commercial models were evaluated: DeepSeek v3.1, DeepSeek-R1, Kimi-K2, Qwen-3, and GPT-OSS-120B. Overall accuracy remained low, with model averages ranging from 4.2 to 5.1 points. The strongest performer was DeepSeek-R1 (5.05), followed by GPT-OSS-120B (4.80), Qwen-3 (4.75), DeepSeek v3.1 (4.47), and Kimi-K2 (4.17).
For the multimodal benchmark, which includes problems requiring interpretation of figures from computational and experimental work, the participating models were Grok-4, Claude 4 Sonnet, GPT-5-mini, Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5, and O3. Performance was somewhat higher at the top end, with GPT-5 achieving the highest overall average (7.30), followed by O3 (6.41), Gemini 2.5 Pro (6.27), GPT-5-mini (6.21), and Grok-4 (6.53). Claude 4 Sonnet averaged 5.53, while Gemini 2.5 Flash lagged significantly at only 1.39. Since 92 of the 103 questions are purely textual, these performance differences should be attributed primarily to stronger language and reasoning capabilities rather than to superior image understanding.
Across categories, several consistent trends emerge. Thermal, Optical, and Magnetic Properties were relatively the easiest for multimodal models, with average scores exceeding 7.0 and top models above 8.0. Electronic Properties posed significant difficulty for all models, with text-only systems averaging below 3.0, while multimodal systems achieved moderate gains (average 4.48, with GPT-5 reaching 6.30). Computational Methodologies and Symmetry Principles fell in the intermediate range, with averages around 4.2-5.0 for text-only models and 5.5-5.7 for multimodal models. Structural Properties remained challenging, with scores clustering near 4.2 in the text-only group and 5.1-5.4 in the multimodal group.
Taken together, these results demonstrate that our benchmark is difficult for current AI systems. Even frontier multimodal models such as GPT-5 and O3 fail to exceed 7.5/10 in most categories, and text-only models rarely reach 5/10. Persistent weaknesses are evident in electronic structure reasoning, computational setup, and symmetry analysis, while the ability to handle physical responses to external fields (thermal, optical, magnetic) remains uneven across models. These findings highlight the benchmark’s discriminative power and point to clear research challenges in building AI scientists capable of robust quantum materials research.
A detailed analysis of model performance on QMBench reveals a sharp dichotomy in capabilities. The leading models performed exceptionally well on knowledge-oriented questions, functioning as highly effective knowledge resources. Across categories, they demonstrated strong performance on items requiring the identification of standard terminology, the summarization of textbook-level relationships, or the recall of canonical examples (e.g., prototypical topological materials, or the common choice of exchange-correlation functionals).
However, this proficiency in conceptual recall stands in stark contrast to their performance in tasks requiring applied reasoning and practical execution. Beyond aggregate scores, a closer inspection of error patterns reveals several systematic limitations.
First, problems requiring rigorous analytical calculations and derivations proved exceptionally challenging. Even the best-performing model (GPT-5) collectively answered 13 such questions incorrectly. These failures often reflected a fundamental inability to reliably apply group-theoretical arguments or execute multi-step algebraic manipulations, even when the relevant foundational concepts had been correctly articulated in earlier parts of the response.
Second, despite nominal multimodal capabilities, current models faltered on questions requiring quantitative figure interpretation. Errors persisted even on fundamental tasks, such as accurately enumerating the number of bands crossing the Fermi level. This suggests that current LLM-based agents struggle with the meticulous visual inspection and figure-based summarization that are central to interpreting computational and experimental results in quantum materials research.
Third, regarding questions involving atomistic structures, the models demonstrated an incomplete command of structural representations. They showed reasonable familiarity with standard formats such as POSCAR files: for example, among the four tasks that require the generation or modification of POSCAR, GPT-5 solved two correctly. However, all models failed on more complex structural tasks, such as slab geometry construction, which demands consistent handling of surface terminations and vacuum regions. We expect such problems are useful tests for the effectiveness of external tools, which can be provided to the LLM to carry specialized computations. In the current evaluation we haven’t included such tools.
Our findings collectively highlight this distinct performance profile. LLMs already function as effective knowledge resources for quantum materials science, but substantial gaps persist in their ability to perform sustained quantitative reasoning, rigorous derivations, precise figure interpretation, and robust programming workflows. In the following we will carry some more detailed analysis on LLM’s answers for some example problems.
Problem: Heterostructure supercell construction I plan to study a heterostructure between graphene and monolayer CrSBr. They do not match in the lattice parameters. What is the optimal supercell I can choose in DFT calculations? Please specify it in the unit of CrSBr lattice vectors. Analysis: Almost all LLMs correctly obtained the lattice constants of graphene and CrSBr within approximately 3% accuracy, which falls within the typical tolerance range of DFT calculations, except for GPT OSS 120B, GPT 5 mini, and Qwen 3.
The second step involves recognizing that graphene possesses a hexagonal lattice, whereas CrSBr has a rectangular one. This requires artificial expanding graphene’s hexagonal unit cell into an equivalent rectangular cell of dimensions a graphene ร โ 3a graphene , allowing proper comparison with CrSBr’s rectangular lattice. At this stage, a few additional models, such as DeepSeek R1, Kimi K2, failed to make this correction. The third step is identifying the correct superlattice matching condition. While several models (e.g., Claude 4 Sonnet and Gemini 2.5 Flash) provided only vague or incomplete reasoning, others, such as GPT O3, GPT 5, and DeepSeek V3, performed better by deriving near-accurate lattice matching relations. However, these models failed to take into account two distinct ways to align rectangular unit cells, e.g. a CrSBr โฅ a graphene and b CrSBr โฅ a graphene . Only Gemini 2.5 Pro and Grok 4 correctly considered both configurations and arrived at the optimum lattice-matching ratio. All models correctly identified the single Fermi pocket around the ฮ point (n ฮ = 1). However, most models miscounted pockets along the Brillouin zone (BZ) edges-where periodic boundaries must be patched across X(X’), Z(Z’), and D corners-which leads to their failure to obtain the correct answer n X = 2, n Z = 2, and n D = 2. Typical errors include: Claude 4 Sonnet (n X = 4, n D = 0), Gemini 2.5 Pro (n X = n Z = 4, n D = 1), and Gemini 2.5 Flash (n D = 4).
Problem: Number of parameters in tight-binding model
Consider atoms on a two-dimensional triangular lattice with the lattice constant a and basis vectors a 1 = (a, 0), a 2 = (a/2, โ 3a/2). The system has D 3h symmetry and time-reversal symmetry. On each site, consider the 3 d-orbitals d z 2 , d xy and d x 2 -y 2 and neglect spin. The lattice has D 3h symmetry group with mirror symmetry along xy and yz planes. Define the nearest neighbor hopping matrix elements along a 1 direction as t
(1) ij = โจa 1 , i|H|0, jโฉ with i, j = 1, 2, 3 the orbital index. How many independent parameters are there in the matrix elements t
(1) ij ? Count the number of real-valued parameters (so that one complex number counts as two real numbers). (A) 3 (B) 5 (C) 6 (D) 9
Model Output Excerpts:
Identify the 3 ร 3 hopping matrix elements Gemini 2.5 Pro (โ)
ij = โจa 1 , i|H|0, jโฉ is a 3 ร 3 matrix. Without any symmetry, it would have 9 complex elements, which amounts to 18 real parameters.
Claude 4 Sonnet (โ) DeepSeek V3 (โ) T requires the Hamiltonian to be real. This means all matrix elements must be real. So we start with 9 real parameters.
Under T , the hopping matrix must satisfy t ij = t * ji .
Gemini 2.5 Pro (โ)
The representation matrix of M yz is: D(M yz ) = diag(1,-1,1). 4. Implement spatial symmetry (ฯ v,yz mirror) constraint Gemini 2.5 Pro (โ) Kimi K2 (โ) For an operation g that relates the hopping vector R to gR, the constraint is: t (R) = D(g) * t (gR) D(g). The hopping matrix to the site -a 1 is related to the hermitian conjugate of the hopping matrix to a 1 : t (-a1) = (t (a1) ) โ = T โ . The spatial symmetry constraint simplifies to: T = DT T D So t (1) must satisfy:
12 (so t Analysis: Almost all LLMs can identify the hopping matrix to be 3 ร 3, as well as representation matrix of M yz to be diag(1,-1,1). However, when implementing T and M yz , some models, including DeepSeek V3 and Kimi K2, didn’t correctly consider the transformation of marix elements upon the flip of hopping direction a 1 โ -a 1 under T or M yz .
We consider a two-band tight-binding Hamiltonian on a square lattice,
where H 0 is the onsite energy, H Please calculate the transmission matrix among the four leads according to the Landauer-Buttiker formalism, at chemical potential ยต, assuming zero temperature. Here we order the leads clockwise, starting from the lead in the negative-x direction, and the basis for the transmission matrix is arranged in the same order. Please keep the diagonal terms of the transmission matrix, which represents the reflection probability at each lead.
Model Output Excerpts: from Gemini2.5 Pro’s -The given Hamiltonian is a lattice realization of a QAHI. We analyze its bulk properties using the parameters A = 0.5, m 0 = 0.3, m 2 = -1. The energy spectrum of the bulk has a gap. Although the given Fermi energy ยต = 0.5 appears to be in the bulk conduction band based on a simple analysis at the ฮ-point, problems of this type on topological insulators are almost always designed to probe the physics of edge states, which requires the Fermi energy to be within the bulk gap. We proceed under the physically reasonable assumption that the parameters are intended to place the system in the topological regime with the Fermi energy inside the gap. In this regime, electron transport occurs exclusively through topologically protected edge states.
Analysis: Even though Gemini 2.5 Pro realizes the system is in a metallic phase, it tends to reinterpret or revise the problem into a more familiar form, possibly reflecting biases inherited from the types of problems seen during training. Therefore it proceeds as if the system is in a topological phase with Chern number C = 1, leading to a simplified picture of perfect chiral transport with no backscattering. In contrast, the true transmission matrix elements in a metallic phase are not quantized, and need to be obtained from an explicit computation.
A widely used science benchmark is the Graduate-level Google-Proof Q&A benchmark (GPQA) [5], but it is mainly at the coursework rather than research level. SciCode [25] is a benchmark focusing on coding tasks in science. Humanity’s Last Exam (HLE) [26] is a set of difficult questions covering different fields of math and science. Several more field-specific benchmarks have been developed this year. The theoretical physics bench (TPBench) [27] is a benchmark on theoretical physics, mainly in the areas of high energy physics and cosmology.
Ref. [28] introduced PhySense, a benchmark focusing on physics reasoning. Two recent works have introduced benchmarks in condensed matter physics [29,30]. Among them, Ref. [30] covered many numerical methods such as exact diagonalization, Monte Carlo, DMRG, Hartree-Fock, etc. Ref. [29] developed an innovate approach to automatic evaluation based on Scalable Expression Edit Distance (SEED).
In summary, we have developed QMBench, a comprehensive benchmark designed to evaluate the research capabilities of large language models in quantum materials science.
Our findings reveal a distinct performance dichotomy. On one hand, current models function as effective encyclopedic resources, demonstrating strong performance on knowledge-oriented questions that test the recall of established concepts, definitions, and qualitative trends.
On the other hand, we identify substantial gaps in tasks requiring applied reasoning and practical execution. These limitations are systematic and include:
A failure to perform rigorous analytical derivations and multi-step algebraic manipulations.
Poor performance in the quantitative interpretation of figures, such as band-structure plots, despite nominal multimodal capabilities.
An incomplete command of complex atomistic structural manipulations, particularly for non-trivial geometries like slab construction.
Collectively, these findings indicate that while LLMs have mastered the knowledge base of quantum materials, significant challenges remain in bridging the gap from conceptual recall to the robust, multi-faceted reasoning and practical application required for authentic scientific research.
We would like to make some further discussion on different ways to use a benchmark to probe an AI model. In our benchmark, we include problems that would be much simpler for a model with access to tools such as computational software or code execution. Although our current evaluation is carried on models without such capability, we can also apply the benchmark to AI agents that have these tools. In general, an AI agent is defined by three aspects: (1) the foundation model(s); (2) the prompts and the collaboration architecture (if there are multiple agents); (3) the tools. In general, if a benchmark contains questions that certain tools will be useful for, one can use the benchmark to independently evaluate these three parts. For example, we can fix the foundation model and compare the performance with and without different tools, which provides an evaluation of the usefulness of the tools. We can fix the tools and the architecture and switch the foundation model to evaluate the capability of the models. By comparing the capability of the models with and without tools, we can also evaluate the tool-use capability of the models in this family of tasks. We can also fix both the model and the tools, and test how prompt engineering or adjustment of the multi-agent collaboration pattern can affect the result.
Finally, we would like to make some further comments about the platform https://bench.science
. The goal of this platform is to facilitate the collaboration on benchmarks in scientific research. A group of researchers can collaborate by posting questions, evaluate the question against a list of models, test the grading by the grading model, and comment and approve on each other’s questions. When the set of benchmarks is ready, it can be published, which provides a uniquely identifiable version of this benchmark. The published benchmark is not entirely public. The organizers of the benchmark can set certain problems to be public, and keep other problems private, to avoid data pollution. If the organizers want, they can also accept new questions submitted by the community. The goal is to have a “github” for benchmark collaborations. We welcome submissions of new questions in the field of quantum materials to our benchmark. You can submit your questions at our project page.