Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials
Machine-learned interatomic potentials (MLIPs) are deployed for high-throughput materials screening without formal reliability guarantees. We show that a single MLIP used as a stability filter misses 93% of density functional theory (DFT)-stable materials (recall 0.07) on a 25,000-material benchmark. Proof-Carrying Materials (PCM) closes this gap through three stages: adversarial falsification across compositional space, bootstrap envelope refinement with 95% confidence intervals, and Lean 4 formal certification. Auditing CHGNet, TensorNet and MACE reveals architecture-specific blind spots with near-zero pairwise error correlations (r <= 0.13; n = 5,000), confirmed by independent Quantum ESPRESSO validation (20/20 converged; median DFT/CHGNet force ratio 12x). A risk model trained on PCM-discovered features predicts failures on unseen materials (AUC-ROC = 0.938 +/- 0.004) and transfers across architectures (cross-MLIP AUC-ROC ~ 0.70; feature importance r = 0.877). In a thermoelectric screening case study, PCM-audited protocols discover 62 additional stable materials missed by single-MLIP screening - a 25% improvement in discovery yield.
💡 Research Summary
The paper addresses a critical gap in high‑throughput materials discovery: machine‑learned interatomic potentials (MLIPs) are widely deployed without formal safety guarantees, leading to massive false negatives when used as stability filters. On a benchmark of 25 000 Materials Project structures, a single MLIP (e.g., CHGNet) recovers only 7 % of the DFT‑stable compounds, missing 93 % of viable candidates. To close this gap, the authors introduce Proof‑Carrying Materials (PCM), a three‑stage pipeline that combines adversarial falsification, bootstrap envelope refinement, and formal certification in Lean 4.
In the first stage, six adversarial strategies—random, heuristic, grid, Latin hypercube sampling, Sobol sequences, and large‑language‑model (LLM) generators—probe composition space for counterexamples. With a modest budget of 200 queries, the LLM adversary (Gemini 3 Flash) discovers high‑Z, multi‑element regions that other strategies miss, yielding 36 unique failure cases concentrated on functionally important chemistries (e.g., topological insulators, lead‑free perovskites).
The second stage uses the discovered counterexamples to tighten the safety specification. By bootstrapping 95 % confidence intervals around the error distribution, PCM produces an “envelope” that contains only 97 materials (≈7 % of the benchmark) while maintaining comparable precision to a full‑set conformal predictor. This represents a 20‑fold reduction in envelope size, focusing verification effort where it matters most.
The third stage compiles the refined envelope into machine‑checkable proofs using Lean 4, encoding physical axioms such as energy conservation and force continuity. The resulting safety certificate can be automatically verified, providing a formal guarantee that any material falling inside the envelope satisfies the prescribed error bounds.
Empirical evaluation on three architecturally distinct MLIPs—CHGNet, TensorNet, and MACE—over 5 000 synthetic “golden‑ratio” structures reveals stark architecture‑specific blind spots. Failure rates are 31 % (CHGNet), 76 % (TensorNet), and 73 % (MACE). Pairwise force‑error correlations are near zero (r ≤ 0.13), and the sets of failing compositions are largely disjoint, confirming that the failure patterns stem from model architecture rather than training data alone. Traditional perturbation‑based uncertainty quantification shows virtually no correlation (r = 0.039) with these compositional failures, indicating that adversarial compositional testing captures an orthogonal failure mode.
A risk model trained on features extracted from PCM‑discovered failures predicts unseen failures with AUC‑ROC = 0.938 ± 0.004 and perfect precision among the top‑20 % risk quintile. Cross‑MLIP transfer is also strong: a CHGNet‑trained risk model predicts MACE failures with AUC‑ROC ≈ 0.70 and feature‑importance correlation r = 0.877.
A thermoelectric screening case study demonstrates practical impact: PCM‑audited pipelines uncover 62 additional DFT‑stable compounds that a single‑MLIP filter would reject, a 25 % increase in discovery yield. The entire audit costs under $20, with the most cost‑effective LLM adversary delivering 17 unique counterexamples for just $0.05.
In summary, the work shows that (i) single‑model benchmarks cannot capture reliability, (ii) architecture‑specific blind spots are pervasive, (iii) adversarial compositional falsification combined with bootstrap envelope refinement yields tight, high‑confidence safety specifications, and (iv) formal Lean 4 certification provides provable safety guarantees. PCM thus establishes a new paradigm for MLIP validation that blends predictive risk modeling, adversarial testing, and formal methods, paving the way for trustworthy, high‑throughput materials discovery.
Comments & Academic Discussion
Loading comments...
Leave a Comment