Active Learning Strategies for Efficient Machine-Learned Interatomic Potentials Across Diverse Material Systems
Efficient materials discovery requires reducing costly first-principles calculations for training machine-learned interatomic potentials (MLIPs). We develop an active learning (AL) framework that iteratively selects informative structures from the Materials Project and Open Quantum Materials Database (OQMD) using compositional and property-based descriptors with a neural network ensemble model. Query-by-Committee enables real-time uncertainty quantification. We compare four strategies: random sampling (baseline), uncertainty-based sampling, diversity-based sampling (k-means clustering with farthest-point refinement), and a hybrid approach. Experiments across four material systems (C, Si, Fe, and TiO2) with 5 random seeds demonstrate that diversity sampling achieves competitive or superior performance, with 10.9% improvement on TiO2. Our approach achieves equivalent accuracy with 5-13% fewer labeled samples than random baselines. The complete pipeline executes on Google Colab in under 4 hours per system using less than 8 GB RAM, democratizing MLIP development for resource-limited researchers. Open-source code and configurations are available on GitHub. This multi-system evaluation provides practical guidelines for data-efficient MLIP training and highlights integration with symmetry-aware architectures as a promising future direction.
💡 Research Summary
The paper presents a systematic study of active learning (AL) strategies for training machine‑learned interatomic potentials (MLIPs) with reduced reliance on expensive density‑functional theory (DFT) calculations. Using public APIs, the authors retrieve up to 600 crystal structures per material from the Materials Project (MP) and the Open Quantum Materials Database (OQMD) for four chemically distinct systems: elemental carbon (C), silicon (Si), iron (Fe), and a binary titanium‑oxide compound (Ti–O). After filtering for reasonable cell sizes (2–50 atoms), complete formation‑energy and band‑gap entries, and removing near‑duplicate structures (ΔE < 1 meV), each dataset is split 80 % for training/pool and 20 % for testing.
For every structure a 17‑dimensional descriptor vector is computed, comprising eight compositional statistics (atomic number, mass, electronegativity, etc.) and nine property‑based features (formation energy, band gap, density, stability metrics). Features are standardized independently for each training run to avoid leakage.
The learning model is an ensemble of five feed‑forward neural networks. Each network has an input layer of size 17, two hidden layers of 128 ReLU units, and a single linear output predicting formation energy per atom. Training uses the Adam optimizer (learning rate = 1e‑3) and mean‑squared‑error loss. The ensemble provides a mean prediction and an epistemic uncertainty estimate given by the variance across the five members (Query‑by‑Committee).
Active learning proceeds in a pool‑based loop. An initial labeled set of 30 structures is expanded over six iterations; at each iteration 15 new structures are queried, yielding a final labeled set of 105. Four query strategies are compared:
- Random – uniform sampling (baseline).
- Uncertainty – select the 15 unlabeled structures with highest ensemble variance.
- Diversity – apply k‑means clustering (k = 15) in descriptor space to the unlabeled pool and pick the structure nearest each cluster centroid.
- Hybrid – combine normalized uncertainty (weight α = 0.6) and normalized diversity scores via a weighted sum.
Performance is measured by mean absolute error (MAE, eV/atom) and coefficient of determination (R²) on the held‑out test set. Each configuration is repeated with five different random seeds; results are reported as mean ± standard deviation, and paired two‑tailed t‑tests (α = 0.05) assess statistical significance against the random baseline.
Key findings:
- Diversity sampling consistently yields the lowest or comparable MAE across all four systems. The most pronounced gain appears for Ti–O, where MAE drops from 0.912 ± 0.041 (random) to 0.813 ± 0.035 (diversity), a 10.9 % reduction (p = 0.008).
- For Si, all strategies converge to similar performance, reflecting the relative simplicity of its energy landscape.
- For Fe, diversity improves MAE by ~4 % relative to random, while uncertainty sampling performs slightly worse.
- Carbon shows marginal differences; both uncertainty and diversity achieve similar MAE, with diversity marginally better.
- Learning curves illustrate that, especially for complex systems, diversity sampling maintains an advantage throughout the labeling process, not just at the final stage.
- Cross‑database validation (training on MP, testing on OQMD and vice‑versa) reveals asymmetric transfer errors, with MP→OQMD generally performing better. Diversity‑driven AL reduces these transfer gaps more effectively than uncertainty‑driven AL, suggesting that broader coverage of descriptor space enhances robustness to domain shift.
Practical considerations are emphasized: the entire pipeline—data retrieval, feature computation, model training, active‑learning loop, and evaluation—runs on a free Google Colab instance in under four hours per material system, using less than 8 GB of RAM. All code, configuration files, and detailed instructions are released on GitHub, enabling researchers with limited computational resources to develop high‑quality MLIPs.
The authors acknowledge limitations: the handcrafted 17‑dimensional descriptor set may not capture all relevant physics for highly complex chemistries; k‑means clustering introduces sensitivity to initialization; and the study focuses solely on formation‑energy regression. Future work is proposed to integrate symmetry‑aware architectures such as E(3)‑equivariant graph neural networks, to explore dynamic weighting between uncertainty and diversity, and to extend the framework to other target properties (forces, stress tensors) and larger, multi‑component systems.
In summary, this work delivers a rigorously evaluated, multi‑system benchmark of active‑learning strategies for MLIP training, demonstrating that diversity‑oriented sampling offers substantial data‑efficiency gains, especially for chemically heterogeneous materials. By providing an accessible, open‑source implementation, the study lowers the barrier for the broader community to adopt active learning in materials informatics, accelerating discovery while curbing the computational expense of first‑principles calculations.
Comments & Academic Discussion
Loading comments...
Leave a Comment