In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves DFT-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient large-scale phonon calculations and dynamical-stability analysis for 108,843 crystal structures generated by six leading crystal generation models. PhononBench reveals a widespread limitation of current generative models in ensuring dynamical stability: the average dynamical-stability rate across all generated structures is only 25.83%, with the top-performing model, MatterGen, reaching just 41.0%. Further case studies show that in property-targeted generation-illustrated here by band-gap conditioning with MatterGen--the dynamical-stability rate remains as low as 23.5% even at the optimal band-gap condition of 0.5 eV. In space-group-controlled generation, higher-symmetry crystals exhibit better stability (e.g., cubic systems achieve rates up to 49.2%), yet the average stability across all controlled generations is still only 34.4%. An important additional outcome of this study is the identification of 28,119 crystal structures that are phonon-stable across the entire Brillouin zone, providing a substantial pool of reliable candidates for future materials exploration. By establishing the first large-scale dynamical-stability benchmark, this work systematically highlights the current limitations of crystal generation models and offers essential evaluation criteria and guidance for their future development toward the design and discovery of physically viable materials. All model-generated crystal structures, phonon calculation results, and the high-throughput evaluation workflows developed in PhononBench will be openly released at https://github.com/xqh19970407/PhononBench
In recent years, AI-driven inverse materials design has advanced rapidly and comprehensively, achieving breakthroughs across multiple aspects of crystal materials design [1][2][3][4]. Diffusion-based approaches [5,6], such as DiffCSP [7], have demonstrated significant improvements in both speed and accuracy for crystal structure prediction compared with traditional DFT-based search workflows. CrystalFlow [8] further enhances efficiency by reformulating the underlying algorithm using flow matching, substantially accelerating inference. For functional materials design, MatterGen [1] exhibits strong performance across key metrics including Stability, Uniqueness, and Novelty, and leverages adapter-based fine-tuning to enable precise generation of various classes of functional materials. The active learning-based workflow for inverse functional materials design (InvDesFlow-AL) [9] has achieved notable success in the design of high-temperature superconductors, together with other generative approaches [10][11][12] for functional materials, further demonstrating the capability of AI to explore complex material systems. Meanwhile, crystal generation models incorporating space-group control-such as CrystalFormer [13,14] and DiffCSP++ [15] significantly improve the symmetry and physical plausibility of generated structures. With the rapid development of large language models [16,17](LLM), methods such as CrystaLLM [18] and FlowLLM [19] integrate natural language into crystal generation, enabling more intuitive and flexible conditioning for materials design.
However, it is essential to recognize the limitations of current progress. Despite the rapid development of crystal generative models, their assessment of material stability has largely focused on thermodynamic stability, typically evaluated using metrics such as E hull (energy above the convex hull) [1,9,20,21]. Yet the synthesizability and practical existence of materials depend not only on thermodynamic stability but, more critically, on dynamical stability-whether a structure resides in a local potential well and can withstand small perturbations without collapsing. Dynamical instability often manifests as imaginary phonon modes [3,22,23], which not only directly signal mechanical instability but have also long posed a persistent challenge for DFT practitioners [24]. The origins of imaginary modes are diverse, potentially arising from symmetry breaking, insufficient q-mesh resolution, pseudopotential choices, approximation errors, or intrinsic structural instability [25,26]. More importantly, because rigorous dynamical stability validation (e.g., full phonon dispersion calculations) is computationally demanding [27], most existing generative models have never performed systematic dynamical stability tests on their generated structures. As a result, these models may produce a large number of structures that appear thermodynamically stable but are in fact dynamically unstable [28], undermining both the reliability and practical utility of their predictions.Therefore, building on the successes of current generative models, establishing systematic and efficient dynamical stability evaluation for generated structures has become a key challenge and an important frontier for guiding computationally generated materials toward experimental synthesis and enhancing the trustworthiness of model predictions.
With the rapid development of universal machine-learning interatomic potentials (uMLIPs) [21,[29][30][31][32], a growing number of models have achieved energy and force prediction accuracies that approach or even surpass those of DFT, thereby greatly enhancing computational efficiency in atomistic materials simulations [33][34][35][36][37]. Representative advances include MEGNet [38], which reaches a formation energy accuracy of 21 meV on the GNoME [2], and M3GNet [39], which incorporates atomic coordinates, lattice vectors, and three-body interactions and has become one of the most prominent uMLIPs. Building on M3GNet, MatterSim [29] is pretrained on 17 million first-principles data points, enabling zero-shot generalization across the first 89 elements of the periodic table over temperatures of 0-5000 K and pressures of 0-1000 GPa. Its accuracy in predicting energies, forces, and stresses surpasses previous models such as MACE [40] by roughly an order of magnitude, enabling reliable calculations of lattice dynamical, mechanical, and thermodynamic properties. More importantly, Miguel A. L. Marques et al. conducted a systematic assessment based on phonon calculations for over 10,000 materials (Figure 1) [41]. Their results show that MatterSim achieves phonon-spectrum prediction accuracy comparable to DFT, with average errors even smaller than the inherent differences between the PBE and PBEsol functionals, and markedly outperforming all other models. Furthermore, in dynamical-stability classification, MatterSim attains a 95% truepositive rate, achieving a level of rel
This content is AI-processed based on open access ArXiv data.