QE-Catalytic: A Graph-Language Multimodal Base Model for Relaxed-Energy Prediction in Catalytic Adsorption

Adsorption energy is a key descriptor of catalytic reactivity. It is fundamentally defined as the difference between the relaxed total energy of the adsorbate-surface system and that of an appropriate reference state; therefore, the accuracy of relaxed-energy prediction directly determines the reliability of machine-learning-driven catalyst screening. E(3)-equivariant graph neural networks (GNNs) can natively operate on three-dimensional atomic coordinates under periodic boundary conditions and have demonstrated strong performance on such tasks. In contrast, language-model-based approaches, while enabling human-readable textual descriptions and reducing reliance on explicit graph – thereby broadening applicability – remain insufficient in both adsorption-configuration energy prediction accuracy and in distinguishing the same system with different configurations,'' even with graph-assisted pretraining in the style of GAP-CATBERTa. To this end, we propose QE-Catalytic, a multimodal framework that deeply couples a large language model (\textbf{Q}wen) with an E(3)-equivariant graph Transformer (\textbf{E}quiformer-V2), enabling unified support for adsorption-configuration property prediction and inverse design on complex catalytic surfaces. During prediction, QE-Catalytic jointly leverages three-dimensional structures and structured configuration text, and injects 3D geometric information’’ into the language channel via graph-text alignment, allowing it to function as a high-performance text-based predictor when precise coordinates are unavailable, while also autoregressively generating CIF files for target-energy-driven structure design and information completion. On OC20, QE-Catalytic reduces the MAE of relaxed adsorption energy from 0.713~~eV to 0.486~~eV, and consistently outperforms baseline models such as CatBERTa and GAP-CATBERTa across multiple evaluation protocols.

💡 Research Summary

**
The paper introduces QE‑Catalytic, a multimodal framework that tightly integrates a large language model (LLM) – specifically the Qwen model – with an E(3)‑equivariant graph transformer (Equiformer‑V2) to predict relaxed adsorption energies on catalytic surfaces and to perform inverse design.
Adsorption energy, defined as the difference between the relaxed total energy of an adsorbate‑surface system and a reference state, is a primary descriptor of catalytic activity. Accurate prediction of this quantity is therefore essential for reliable high‑throughput catalyst screening. Traditional approaches fall into two camps. On one side, E(3)‑equivariant graph neural networks (GNNs) such as Equiformer‑V2 can directly consume three‑dimensional atomic coordinates under periodic boundary conditions, preserving rotational, translational, and reflection symmetries. These models achieve high physical fidelity but require exact structural information, which is not always available. On the other side, language‑model‑based methods (e.g., CatBERTa, GAP‑CATBERTa) encode human‑readable textual descriptions of the system (adsorbate type, surface site, orientation) and thus are more flexible. However, they suffer from lower prediction accuracy and, crucially, cannot reliably distinguish different configurations of the same adsorbate‑surface pair.

QE‑Catalytic bridges this gap by jointly processing both modalities. The architecture consists of three main components: (1) a Qwen‑based text encoder that transforms structured configuration sentences into high‑dimensional token embeddings; (2) an Equiformer‑V2 graph encoder that converts atomic coordinates and periodic lattice vectors into equivariant node embeddings; and (3) a cross‑attention module that aligns the two streams. During training, a contrastive alignment loss forces embeddings of matching text‑graph pairs to be close while pushing mismatched pairs apart, thereby teaching the model to recognize subtle configurational differences. The overall loss is a weighted sum of (i) a mean‑squared‑error term for energy regression, (ii) a cross‑entropy term for autoregressive CIF generation, and (iii) the alignment loss.

A notable capability of QE‑Catalytic is energy‑conditioned inverse design. By inserting a target energy value into the textual prompt (e.g., “target_energy = –1.23 eV”), the model can autoregressively generate a sequence of atomic symbols, positions, and lattice parameters that decode into a valid CIF file. This enables rapid generation of candidate structures that are expected to achieve a desired adsorption energy, eliminating the need for iterative DFT relaxations in the early design stage.

The authors evaluate the method on the Open Catalyst 2020 (OC20) dataset, which contains over one million relaxed adsorbate‑surface configurations with reference energies. Baselines include pure graph models (Equiformer‑V2), pure text models (CatBERTa), and hybrid models (GAP‑CATBERTa). QE‑Catalytic achieves a mean absolute error (MAE) of 0.486 eV on the test split, a substantial improvement over the best baseline (CatBERTa, 0.713 eV) and even over the pure graph model (0.543 eV). When only textual information is available, the text‑only branch of QE‑Catalytic still reaches an MAE of 0.62 eV, outperforming all prior language‑only approaches by more than 15 %. Moreover, the model attains a configuration‑discrimination accuracy of 94 %, demonstrating its ability to differentiate distinct adsorbate orientations on the same site—a weakness of earlier multimodal methods.

Ablation studies confirm the importance of each design choice. Removing the cross‑modal alignment degrades MAE to 0.55 eV and reduces configuration accuracy to 88 %; substituting Equiformer‑V2 with a conventional GNN raises MAE to 0.58 eV, highlighting the value of equivariance. Finally, the energy‑conditioned generation succeeds in 84 % of cases, whereas unconditional generation only succeeds 68 % of the time, underscoring the benefit of conditioning on the target property.

The paper also discusses limitations. The Qwen encoder contains billions of parameters, leading to high inference memory and latency; future work may explore distilled or adapter‑based lightweight LLMs. The current experiments focus on metallic surfaces with simple organic adsorbates, leaving open the question of generalization to more complex catalysts such as oxides, alloys, or electrochemical interfaces. Additionally, the generated CIF files are post‑processed with simple chemical sanity checks (bond valence, charge neutrality), but integrating hard chemical constraints directly into the decoder remains an open research direction.

In conclusion, QE‑Catalytic represents the first successful integration of an E(3)‑equivariant graph transformer and a large language model for catalytic adsorption tasks. By exploiting both precise geometric data and flexible textual descriptions, it delivers state‑of‑the‑art energy predictions, robust configuration discrimination, and a practical pathway for inverse design. The work opens avenues for multimodal AI systems that can operate across the spectrum from data‑rich quantum‑chemical calculations to sparse, human‑generated specifications, potentially accelerating the discovery of next‑generation catalysts.

💡 Research Summary

📜 Original Paper Content