A critical assessment of bonding descriptors for predicting materials properties
Most machine learning models for materials science rely on descriptors based on materials compositions and structures, even though the chemical bond has been proven to be a valuable concept for predicting materials properties. Over the years, various theoretical frameworks have been developed to characterize bonding in solid-state materials. However, integrating bonding information from these frameworks into machine learning pipelines at scale has been limited by the lack of a systematically generated and validated database. Recent advances in high-throughput bonding analysis workflows have addressed this issue, and our previously computed Quantum-Chemical Bonding Database for Solid-State Materials was extended to include approximately 13,000 materials. This database is then used to derive a new set of quantum-chemical bonding descriptors. A systematic assessment is performed using statistical significance tests to evaluate how the inclusion of these descriptors influences the performance of machine-learning models that otherwise rely solely on structure- and composition-derived features. Models are built to predict elastic, vibrational, and thermodynamic properties typically associated with chemical bonding in materials. The results demonstrate that incorporating quantum-chemical bonding descriptors not only improves predictive performance but also helps identify intuitive expressions for properties such as the projected force constant and lattice thermal conductivity via symbolic regression.
💡 Research Summary
This paper presents a comprehensive study on the impact of quantum‑chemical bonding descriptors on machine‑learning predictions of solid‑state material properties. The authors first extend their previously published Quantum‑Chemical Bonding Database by performing high‑throughput LOBSTER analyses on approximately 13 000 materials from the Materials Project. From each LOBSTER calculation they extract a suite of descriptors—including COHP, ICOHP, COBI, Mulliken and Löwdin charges, integrated COHP/COOP values, bond‑order statistics, and various measures of bond heterogeneity—using the automated LobsterPy workflow.
With this database in hand, they construct two families of feature sets: (i) the conventional “MATMINER” set derived from crystal structure and composition (e.g., elemental fractions, lattice parameters, local geometry) and (ii) the new “LOBSTER” set containing the quantum‑chemical bonding descriptors. They evaluate the predictive power of (a) MATMINER alone, (b) LOBSTER alone, and (c) the combined MATMINER+LOBSTER set.
Machine‑learning models are built using Random Forest regressors and MODNet (a graph‑based neural network). Five‑fold cross‑validation provides mean absolute errors (MAE) and standard deviations; statistical significance of any performance gain is assessed with a corrected resampling t‑test based on ten‑fold CV results. Feature importance is examined with SHAP and permutation feature importance (PFI). When a descriptor set yields a statistically significant improvement (p < 0.05, 0.01, 0.001), this is highlighted in the results tables.
Target properties span three physical domains: (1) bonding‑related quantities—maximum projected force constant (max pfc), highest phonon density‑of‑states peak (last phonon peak), lattice thermal conductivity (log k_lat), bulk and shear moduli (log K_VRH, log G_VRH), mean‑squared displacement (log MSD); (2) thermodynamic observables—heat capacity (Cv), entropy (Sv), Helmholtz free energy (Hv), internal energy (Uv) at three temperatures (25 K, 305 K, 705 K); (3) auxiliary vibrational data.
Key findings:
- For properties directly linked to bond strength (max pfc, last phonon peak, lattice k, elastic moduli, MSD) the inclusion of LOBSTER descriptors leads to consistent MAE reductions of roughly 10–30 % relative to MATMINER‑only models. The improvement for max pfc and last phonon peak is highly significant (p < 0.001). SHAP analysis identifies ICOHP‑sum, bond‑strength skewness, and charge asymmetry metrics as the most influential variables.
- For thermodynamic quantities (Cv, Sv, Hv, Uv) the MATMINER descriptors already capture the dominant variance; adding LOBSTER features does not improve, and sometimes slightly worsens, predictive accuracy. Distance‑correlation and dependency‑graph analyses confirm that LOBSTER descriptors have weak or negligible association with these targets.
- Correlation analyses (distance correlation and Random‑Forest‑based dependency graphs) reveal that LOBSTER descriptors possess independent, non‑linear information for max pfc, whereas for Cv they fall within the variance of the MATMINER set. Combining both sets yields the largest performance boost only for the bonding‑related targets.
- Symbolic regression with SISSO uncovers compact, physically interpretable formulas linking bonding descriptors to target properties, e.g., max pfc ≈ α · (ICOHP_sum) · (bond‑order_skew)⁻¹. These expressions corroborate traditional chemical intuition while providing data‑driven validation.
The authors also discuss the computational cost of LOBSTER calculations and propose a future direction: training surrogate models (e.g., graph neural networks) to predict quantum‑chemical bonding descriptors directly from crystal structures, thereby bypassing explicit LOBSTER runs. This hybrid approach could enable large‑scale materials screening that leverages chemically meaningful bonding information without prohibitive DFT overhead.
In summary, the work demonstrates that quantum‑chemical bonding descriptors are valuable, complementary features for machine‑learning models of properties that are intrinsically tied to interatomic forces. Their systematic integration improves accuracy, yields interpretable physical insights, and opens pathways toward more chemically informed, high‑throughput materials discovery pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment