How to Train a Shallow Ensemble

Reading time: 5 minute
...

📝 Original Info

  • Title: How to Train a Shallow Ensemble
  • ArXiv ID: 2602.15747
  • Date: 2026-02-17
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. **

📝 Abstract

Shallow ensembles provide a convenient strategy for uncertainty quantification in machine learning interatomic potentials, that is computationally efficient because the different ensemble members share a large part of the model weights. In this work, we systematically investigate training strategies for shallow ensembles to balance calibration performance with computational cost. We first demonstrate that explicit optimization of a negative log-likelihood (NLL) loss improves calibration with respect to approaches based on ensembles of randomly initialized models, or on a last-layer Laplace approximation. However, models trained solely on energy objectives yield miscalibrated force estimates. We show that explicitly modeling force uncertainties via an NLL objective is essential for reliable calibration, though it typically incurs a significant computational overhead. To address this, we validate an efficient protocol: full-model fine-tuning of a shallow ensemble originally trained with a probabilistic energy loss, or one sampled from the Laplace posterior. This approach results in negligible reduction in calibration quality compared to training from scratch, while reducing training time by up to 96%. We evaluate this protocol across a diverse range of materials, including amorphous carbon, ionic liquids (BMIM), liquid water (H$_2$O), barium titanate (BaTiO$_3$), and a model tetrapeptide (Ac-Ala3-NHMe), establishing practical guidelines for reliable uncertainty quantification in atomistic machine learning.

💡 Deep Analysis

📄 Full Content

Introducing machine learning (ML) surrogate models and machine learning interatomic potentials (MLIPs) into first-principles atomistic modeling workflows should always be approached with care. Most ML models originate in statistical learning frameworks and therefore introduce additional uncertainty in their predictions, either from limited knowledge (epistemic uncertainty) [1], irreducible noise in the training data (aleatoric uncertainty) [2,3] and the inability of the chosen ML architecture to capture complex physical interactions (model misspecification) [4]. These uncertainties add to the pre-existing discrepancies of observables computed with the electronic-structure method targeted for ML acceleration and experimental observations [5,6].

In practice, these additional sources of uncertainty can negate any advantages gained from using faster ML models. The successful deployment of ML models and trustworthy interpretation of ML-accelerated simulation, therefore, relies not only on accurate point estimates but also on the ability to quantify the reliability of predictions using well-calibrated uncertainty estimates [7] and, ultimately, practical ways to propagate predicted model uncertainties to derived quantities [8,9]. Beyond data generation via active learning procedures [10][11][12][13], calibrated uncertainties are essential for production simulations, where they can flag unreliable results or be propagated through downstream workflows, for example, quantifying the error on average thermodynamic quantities [9]. * michele.ceriotti@epfl.ch Neural-network-based (NN) ML models gained popularity because they offer favourable asymptotic scaling for both training and inference, and they are supported by highly optimized implementations in general machinelearning frameworks. In contrast to models such as Gaussian Process Regressors (GPR), which provide built-in posterior inference [14], traditional NNs that are comprised of one set of model weights from maximum likelihood, or MAP training, are point estimators and must be equipped with approximate uncertainty quantification (UQ) schemes [15,16]. Approximate NN UQ schemes typically either approximate the model posterior or try to estimate quantities that can serve as proxies to the confidence of model predictions.

Common strategies to approximate the weight posterior include Bayesian Neural Networks [17] and Monte Carlo dropout [18] and Deep Ensembles [7]. Alternative approaches seek to reduce computational cost via direct Mean-Variance Estimation (MVE) [19][20][21] or by utilizing distance-based metrics in latent space, such as conformal prediction [22][23][24]. However, for MLIPs, full ensembles consisting of independently trained models are often favored for their robustness and simplicity [10].

We recently introduced the “direct propagation of shallow ensembles” (DPOSE) scheme [25], an ensemble-based ML model UQ scheme, striking a good balance between accuracy and evaluation cost, which can be applied to any architecture and has low implementation complexity. DPOSE reduces the overhead of traditional full ensembles by sharing the model backbone and ensembling only the last layer, with a joint training procedure of all ensemble members simultaneously, using a probabilistic Gaussian negative log-likelihood (NLL) loss function. By integrating uncertainty awareness into the training procedure, DPOSE ensures well-calibrated uncertainty estimates with negligible additional training and evaluation cost. DPOSE has been successfully applied in materials modelling applications such as propagating the uncertainties of the universal machine-learning interatomic potential PET-MAD to melting-point calculations [26], performing active learning in surface catalysis [27], modelling general organic reactions [28], and detecting out-of-domain samples [29].

An alternative approach to generate shallow ensembles is the Last Layer Prediction Rigidity (LLPR) [30][31][32] formalism. It is based on a Laplace approximation of the last layer posterior, and unlike DPOSE, it is applied posttraining to a single, MSE-trained MLIP. In its original formulation, the LLPR method was only constructed for direct predictions, like energies, but not derivatives like forces. However, this extension was performed for the MACE-MP0 foundation model [33]. While the DPOSE method was not initially designed with explicit training for force uncertainty, it still showed potential for providing reliable force estimates. However, the actual impact of adding force uncertainty-aware training to the DPOSE model remained untested until now. The last-layer-based nature of the two approaches naturally raises the question of their similarity and efficacy. Specifically, it is unclear whether DPOSE is merely a Monte Carlo approximation of the same posterior targeted by LLPR, or whether the calibration of uncertainties arises from the learned features of the whole model. Furthermore, including a probabilistic fo

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut