Coupled Cluster con MōLe: Molecular Orbital Learning for Neural Wavefunctions

Density functional theory (DFT) is the most widely used method for calculating molecular properties; however, its accuracy is often insufficient for quantitative predictions. Coupled-cluster (CC) theory is the most successful method for achieving acc…

Authors: Luca Thiede, Abdulrahman Aldossary, Andreas Burger

Coupled Cluster con MōLe: Molecular Orbital Learning for Neural Wavefunctions
Coupled Cluster con M¯ oLe: Molecula r Orbital Lea rning fo r Neural W avefunctions Luca Thiede 1 , 2 , † , Ab dulrahman Aldossa ry 2 , 3 , † , Andreas Burger 1 , 2 , Jo rge Arturo Camp os-Gonzalez-Angulo 2 , 3 , Ning W ang 4 , Alexander Zo ok 5 , Melisa Alkan 5 , K ouhei Nakaji 5 , T a ylo r Lee P atti 5 , Jérôme Flo rian Gonthier 5 , Mohammad Ghazi V akili 1 , 3 , Alán Aspuru-Guzik 1 , 2 , 3 , 5 , 6 , 7 , 8 , 9 , ∗ 1 Departmen t of Computer Science, Universit y of T oronto, Sandford Fleming Building, 10 King’s College Road, ON M5S 3G4, T oronto, Canada 2 V ector Institute for Artificial Intelligence, 661 Univ ersity A v e. Suite 710, ON M5G 1M1, T oronto, Canada 3 Departmen t of Chemistry , Universit y of T oronto, Lash Miller Chemical Laboratories, 80 St. George Street, ON M5S 3H6, T oronto, Canada 4 Key Lab of Electromagnetic Pro cessing of Materials, Ministry of Education, Northeastern Univ ersit y , 314 Mailb o x, Shen yang, 110819, P eople’s Republic of China 5 NVIDIA, 431 King St W #6th, M5V 1K4, T oronto, Canada 6 Departmen t of Materials Science & Engineering, Universit y of T oronto, 184 College St., M5S 3E4, T oronto, Canada 7 Departmen t of Chemical Engineering & Applied Chemistry , Universit y of T oron to, 200 College St. ON M5S 3E5, T oronto, Canada 8 A cceleration Consortium, 700 Universit y A ve., M7A 2S4, T oronto, Canada 9 Senior F ellow, Canadian Institute for A dv anced Researc h (CIF AR), 661 Universit y A ve., M5G 1M1, T oronto, Canada † Con tributed equally to this w ork. Densit y functional theory (DFT) is the most widely used metho d for calculating molecular properties; ho we ver, its accuracy is often insufficient for quan titative predictions. Coupled-cluster (CC) theory is the most successful method for ac hieving accuracy beyond DFT and for predicting prop erties that closely align with exp eriment. It is kno wn as the “gold standard” of quantum c hemistry . Unfortunately , the high computational cost of CC limits its widespread applicability . In this w ork, w e presen t the Molecular Orbital Learning (M¯ oLe) architecture, an equiv arian t machine learning model that directly predicts CC’s core mathematical ob jects, the excitation amplitudes, from the mean-field Hartree-F o c k molecular orbitals as inputs. W e test v arious asp ects of our mo del and demonstrate its remarkable data efficiency and out-of-distribution generalization to larger molecules and off-equilibrium geometries, despite b eing trained only on small equilibrium geometries. Finally , we also examine its ability to reduce the n umber of cycles required to conv erge CC calculations. M¯ oLe can set the foundations for high-accuracy w a vefunction-based ML architectures to accelerate molecular design and complemen t force-field approac hes. Date: F ebruary 25, 2026 Co rresp ondence: Alán Aspuru-Guzik at aspuru@nvidia.com 1 Computational chemistry aims to predict the prop erties of molecules in silic o , thereby facilitating insigh t into atomistic pro cesses and adv ancing fields ranging from materials science to drug disco very . Density functional theory (DFT) has become the mainstream workhorse in computational c hemistry , offering a goo d trade-off b et w een sp eed and accuracy 1 – 3 . DFT solves a self-consisten t field (SCF) problem to obtain the electron densit y , from which other prop erties can b e derived. T o further speed up DFT, recen t work uses ML models to predict goo d initial guesses in the form of Hamiltonian matrices 4 – 7 , densit y matrices 8 – 10 or the electron densit y 11 – 19 . Aside from DFT acceleration, a rapidly growing class of Machine-Learned Interatomic P otentials (MLIPs) pro vides accurate (relativ e to DFT) energies and forces at low cost b y learning to imitate DFT properties 20 – 23 . Ho wev er, ev en the b est DFT functionals are limited in accuracy , in turn making MLIPs unreliable for applications that demand a high lev el of precision. Th us, higher-accuracy ab initio methods are essential for reliably describing molecular prop erties. Among these metho ds, coupled cluster (CC) theory with single, double, and perturbative triples [CCSD(T)] is often considered the “gold standard” of quantum chemistry , ac hieving “chemical accuracy” (errors ≲ 1 . 6 mHa) with resp ect to exp erimen tal data across v arious systems 24 – 27 . Unfortunately , the high computational cost of CCSD(T), whic h scales as O ( N 7 ) with system size limits its routine application to small molecules. Non-p erturbativ e v ersions of coupled cluster, either excluding triple excitations in CCSD or including triple excitations fully as in CCSDT, result in scalings of O ( N 6 ) and O ( N 8 ) , resp ectiv ely . T o ov ercome the limitations of DFT and bring down the cost of correlated wa vefunction methods, machine learning approaches explored learning from coupled cluster with prop erty prediction with MLIPs 28 – 31 , effective hamiltonian matrix prediction 32 , direct prediction with manually designed physics-inspired features with X GBo ost and K-Nearest Neigh b ors 33 , 34 , neural DFT-functionals 35 – 37 as w ell as other beyond-DFT targets suc h as ML Green’s-function represen tations 38 – 40 , active orbital prediction for selected CI 41 and neural net works as generalizing wa v efunction represen tations 42 , 43 in v ariational quantum Monte Carlo. The CC (de-)excitation amplitudes enco de the complete correlated resp onse of the electronic system. A model that learns these amplitudes can therefore reco ver the full CC prop erty manifold. In this work, we propose the Molecular Orbital Learning mo del (M¯ oLe) that extends the idea of neural surrogates for densities to CC amplitudes. By lo w ering the complexity barrier of CC metho ds, w e hope to expand their use and enable highly accurate property predictions across more applications than is curren tly p ossible. Our contributions are the follo wing: 1. W e design the first symmetry-aw are neural architecture that tak es molecular orbitals as input and outputs CC T-amplitudes. 2. W e recalculate the QM7 44 , 45 dataset at the CCSD/def2-SVP lev el of theory and several out-of-distribution datasets using out-of-equilibrium geometries and m uch larger systems compared to QM7. 3. W e demonstrate that our mo del successfully predicts the CCSD amplitudes, resulting in energy errors of ∼ 0.1 mHa, electron densities more accurate than MP2, and significan tly fewer SCF iterations. 4. W e show that, compared to MLIPs, ev en when trained with ∆ -MP2-learning 46 , our mo del is more data efficien t in- and out-of-distribution. 1 Theo ry 1.1 Ha rtree–F o ck and molecular o rbitals In the Hartree–F ock (HF) approximation 2 , 47 , the many-electron w av efunction is represen ted b y a single Slater determinan t of orthonormal molecular orbitals (MOs) ψ p ( r |{ R A ) } , where { R A } are the n uclear co ordinates. Eac h MO ψ p ( r |{ R A } ) is expanded in an atomic-orbital (A O) basis as ψ p ( r ) = X A X k ∈K A X ℓ ∈L A,k ℓ X m = − ℓ C pA,kℓm ϕ m kℓ ( r − R A | {z } := r A ) . 2 Here, A lab els atoms with positions R A , while k , ℓ, m are the principal, azim uthal, and magnetic quantum n um b ers. The sets K A and L A,k refer to the elemen t-dep endent shells and sub-shells, resp ectively , included in the chosen basis on atom A , and C pA,kℓm are the MO co efficients. There are usually as many molecular spatial orbitals as there are atomic orbitals N AO . The first N electron molecular orbitals are called occupied, while the remaining N AO − N electron are called virtual orbitals. Throughout the pap er, w e will use the letters p and q to index all orbitals; i and j refer to o ccupied orbitals, and a and b to virtual orbitals. ϕ m kℓ ( r A ) = R k ( | r A | ) Y m ℓ ( ˆ r A ) are constructed from the pre-tabulated radial basis function R k ( r ) and the spherical harmonic Y m ℓ ( ˆ r ) . T o calculate the MO co efficients, we solv e the Ro othaan-Hall equations F ( C ) C = C ε, (1) with F ( C ) the F o c k matrix, C the MO coefficient matrix, and ε the diagonal matrix of orbital energies. Since F dep ends on C , this equation is solved iterativ ely . The columns of C are eigen v ectors, and therefore the MO co efficien ts are only determined up to an arbitrary sign. See Section B.1 for more details. In tuitiv ely , the Hartree-F o ck wa vefunction is the b est appro ximate solution to the Schrödinger equation under the assumption that the electronic p ositions do not explicitly correlate with each other. T o mov e past this limiting assumption, “p ost-Hartree-F o ck” methods build on the Hartree-F ock w a vefunction and in troduce correlation in a systematically impro v able manner. Møller–Plesset p erturbation theory (MP n ) and Coupled Cluster are among the most broadly used p ost-Hartree-F o ck correlation metho ds. 1.2 Møller-Plesset-2 (MP2) MP n 2 , 48 is a p erturbativ e expansion of the electronic correlation energy around the Hartree–F o c k reference up to the n -th order. MP2 is the first non trivial correction giv en by: E MP2 = 1 4 X i,j,a,b ⟨ ij || ab ⟩ ε i + ε j − ε a − ε b | {z } t ab ij, MP2 ⟨ ij || ab ⟩ , (2) with molecular orbital energies ε p and ⟨ ij || ab ⟩ , the in tegral betw een orbitals i, j, a, b , see Section B.2 for bac kground on in tegrals. MP2 formally scales as O ( N 5 ) with a relativ ely small prefactor, since there is no need to solv e an equation self-consistently . 1.3 Coupled cluster with single and double excitations The coupled cluster metho d 2 , 49 expresses the exact electronic wa v efunction as an exp onential excitation of a reference Hartree-F o ck wa v efunction: | Ψ CC ⟩ = e ˆ T | Φ HF ⟩ , (3) where the cluster op erator ˆ T = ˆ T 1 + ˆ T 2 + ˆ T 3 + · · · generates single, double, triple, and higher excitations out of the reference state. T runcating ˆ T to a given excitation rank defines a hierarc h y of systematically improv able metho ds: coupled cluster with singles and doubles (CCSD) includes single and double excitations, CCSDT adds triples, CCSDTQ adds quadruples, and so on. Therefore, in CCSD, the correlated w a vefunction is | CCSD ⟩ = exp  ˆ T 1 + ˆ T 2  | Φ HF ⟩ , (4) with ˆ T 1 = X ia t a i ˆ a † a ˆ a i , ˆ T 2 = 1 4 X ij ab t ab ij ˆ a † a ˆ a † b ˆ a j ˆ a i , (5) with t a i and t ab ij the singles and doubles amplitudes, and ˆ a and ˆ a † the F ermionic annihilation and creation op erators. The CCSD amplitudes, t a i and t ab ij , are the key objects whic h w e usually obtain by solving a non-linear system of equations, see more in Section B.3 . Solving the CCSD equations formally scales as O ( N 6 ) , with a large prefactor. 3 MP2 can b e understoo d as the first-order approximation to coupled-cluster theory: t ab ij, CCSD ≈ t ab ij, MP 2 . (6) Since MP2 is computationally inexpensive compared to CCSD, it is typically used to initialize CCSD calculations. An essen tial characteristic of the T-amplitudes, t a i and t ab ij , is their asymptotic b eha vior as a function of the asso ciated molecular orbitals separation: for tw o infinitely separated, closed shell molecular systems A and B, after localization (see Section 1.4 ), the molecular orbitals will be uniquely asso ciated with either system A or B. Therefore, an y excitation amplitude in volving orbitals from both noninteracting subsystems A and B v anishes. The CC amplitudes are all that is required to appro ximately calculate any electronic ground-state prop erty of the system. 50 Most notably , this includes the correlation energy E correlation = X ij ab  1 4 t ab ij + 1 2 t a i t b j  ⟨ ij || ab ⟩ . (7) This expression also holds for higher orders of non-p erturbativ e CC methods, i.e., CCSDT, CCSDTQ, etc. Th us, we only ever need to predict the T 1 and T 2 tensors for energies, op ening the do or for our M¯ oLe mo del to mo ve b eyond CCSD energies in the future. 1.4 Lo calized Molecula r Orbitals Canonical molecular orbitals (MOs), as defined in Eq. ( 1 ), are generally delocalized ov er the en tire molecule. W e can transform MOs into a localized represen tation that leav es all observ ables inv ariant, see Section D . W e h yp othesize that locality represents an essential inductive bias 51 that facilitates easier training, whic h we exp erimen tally confirm in T able S1 in the App endix. Therefore, we use localized orbitals for all of our models. 2 Equiva riance and equiva riant mo dels Man y mappings of physical prop erties are equiv ariant under co ordinate-system transformations, meaning they transform predictably . F ormally , a function f : X → Y is equiv ariant with resp ect to the elemen ts g of a group G , whic h acts on X and Y via D X ( g ) and D Y ( g ) , if f ( D X ( g ) x ) = D Y ( g ) f ( x ) . (8) Often times, the group G of in terest is SO (3) the group of rotations in three-dimensional space, denoted b y Q = ( α, β , γ ) , the Euler angles 52 , 53 . The irreducible represen tations (irreps) of this group are kno wn as the Wigner-D matrices, with elemen ts D mm ′ ℓ ( Q ) , where the parameter ℓ is determined b y the space they act upon. F or example, the Hartree-F o c k algorithm is the map that assigns to each molecular geometry { R } a set of angular-momen tum–lab eled co efficient vectors that represen t the molecular orbitals, C pA,kℓ ( { R } ) : R 3 N Atoms → R 2 ℓ +1 . This map is equiv ariant with resp ect to spatial rotations: C pA,kℓ ( D 1 ( Q ) { R } ) = D ℓ ( Q ) C pA,kℓ ( { R } ) . (9) In con trast, the amplitudes t a i and t ab ij are in v ariant under rotations of the molecule: t ab ij ( D 1 ( Q ) { R } ) = t ab ij ( { R } ) (10) Another important symmetry is sign equiv ariance: If the signs of one of the molecular orbitals i, j, a or b asso ciated with t ab ij flips, the amplitude also flips its sign. F ormally , for the amplitude t ab ij = t ( C i , C j , C a , C b ) w e hav e t ( − C i , C j , C a , C b ) = t ( C i , − C j , C a , C b ) = ... ... = − t ( C i , C j , C a , C b ) . (11) 4 2.1 Equiva riant Neural Netw o rks Equiv ariant neural netw orks are models that, by design, respect the equiv ariance property in Eq. ( 8 ), often leading to greater data efficiency 54 – 56 . ML mo dels for atomistic mo deling are usually designed to resp ect the p erm utation of feature lab els and O(3) symmetries by modeling the system as a graph neural netw ork (GNN) with features h i,k,ℓ,m that carry the O(3) irreps, and therefore transform under rotation with the Wigner D-matrices. T wo symmetry-adapted features, p ℓ 1 and g ℓ 2 , can in teract with each other while preserving equiv ariance using the bilinear Clebsch-Gordan tensor pro duct 52 h ℓ 3 m 3 = ℓ 1 X m 1 = − ℓ 1 ℓ 2 X m 2 = − ℓ 2 C ℓ 3 m 3 ℓ 1 m 1 ,ℓ 2 m 2 p ℓ 1 m 1 g ℓ 2 ,m 2 , (12) where C ( ℓ 3 ,m 3 ) ( ℓ 1 ,m 1 ) , ( ℓ 2 ,m 2 ) is the Clebsch-Gordan co efficient that implemen ts the O(3) -equiv arian t coupling of the irreps ℓ 1 and ℓ 2 in to the output irrep ℓ 3 , mapping basis elements ( ℓ 1 , m 1 ) and ( ℓ 2 , m 2 ) to ( ℓ 3 , m 3 ) . Equation ( 12 ) is used to build equiv arian t message passing from irrep-carrying node and edge features. See Section A for more details. The MA CE arc hitecture 57 is one of the most prominen t equiv arian t message-passing arc hitectures and will b e an imp ortan t comp onen t of our mo del design; see Section 3.2.2 . 3 M¯ oLe a rchitecture Figure 1 Giv en a molecule, a Hartree-F o ck calculation pro vides the molecular orbitals represen ted b y their co efficien t matrix C . The co efficient v ector is padded for each atom to ensure that they all ha v e the same n um b er of basis co efficien ts, enabling their em bedding in an equiv ariant neural netw ork. The mo del then alternates message passing to mix information within the MOs and atten tion lay ers to mix information betw een MOs. Finally , the embeddings are read out b y “outer product-like” op erations, outputting the T 1 and T 2 amplitudes. Our M¯ oLe arc hitecture design is guided by satisfying MO and CC wa vefunction symmetries and asymptotics: 1. MO co efficien ts are rotation equiv ariant, Eq. ( 9 ) 2. T-amplitudes are rotation inv ariant, Eq. ( 10 ) 5 3. T-amplitudes are sign equiv ariant, Eq. ( 11 ) 4. The T-amplitudes coupling the lo calized MOs of tw o infinitely-separated closed-shell molecules are 0. An ov erview of the mo del architecture is shown in Fig. 1 , and details are pro vided in Fig. 2 . The mo del Figure 2 On the left, we are summarizing the k ey comp onents of the M¯ oLe architecture, consisting of a padding and em b edding lay er, ( T ) transformer lay ers, and finally the T 1 and T 2 readout. On the right, we are further detailing the atten tion mec hanism, as well as the T 1 and T 2 readout of our architecture. pro ceeds in four main steps: first, we run a Hartree-F o ck calculation and lo calize the MOs. W e then “embed” the coefficients on to “graph-states”, one graph p er MO, b y initializing equiv ariant GNN features with the MO co efficien ts. The features are then pro cessed b y ( T ) transformer blo cks, eac h com bining a MACE-lik e lay er, a la yer norm, and an atten tion mechanism to capture local atomic and long-range orbital correlations. Finally , a readout la yer conv erts the laten t MO features in to T 1 and T 2 amplitudes. 3.1 MO emb edding Since differen t atom types hav e different n um b ers of basis functions, but GNNs exp ect all atoms to hav e the same n um b er of features, we need to pad the MO co efficients before we can use them as initial features in our equiv ariant GNN, see Fig. 1 for an illustration and Section E.1 for details. Using an equiv ariant linear lay er W ℓ,k ¯ k , the initial MO co efficients are then embedded in to a hidden dimension K, k ∈ { 1 , ..., K } that will b e kept throughout the transformer blocks: x (0) pA,kℓm = X ¯ k W (0) ℓ,k ¯ k C padded pA, ¯ kℓm , (13) where W ℓ,k ¯ k are learnable equiv ariance preserving w eight matrices. Note that, in con trast to machine learning force fields where a single graph neural netw ork maps to a target prop ert y , here, there are N M O graph neural net works in parallel, one for each MO, with the same w eigh ts shared across them. 6 3.2 Equiva riant T ransfo rmer blo ck Next, an equiv ariant transformer blo c k pro cesses the MO embeddings across multiple transformer lay ers, where eac h la yer transforms the input features while preserving their shap e, rotation and sign equiv ariance, and only makes MOs in teract if they belong to the same fragmen t. The t -th lay er transforms the input features x ( t ) pA = { x ( t ) pA,kℓm } to produce output features x ( t +1) pA of the same shape. Throughout the paper, we use b old letters to denote arra ys without sp ecifying their indices. The transformer consists of three main comp onen ts: a MO-A tten tion blo c k to capture correlation b etw een MOs, a normalization lay er, and “Odd-MA CE” to mix features within an MO, interlea ved with skip connections as sho wn in Fig. 2 . 3.2.1 MO-A ttention The MO-Atten tion mechanism processes the MO em b eddings through m ulti-head attention, where eac h head applies learnable w eigh t matrices to construct the query , key , and v alue represen tations. F or each attention head h , the w eigh t matrices W ( Q,h,t ) , W ( K,h,t ) , and W ( V ,h,t ) pro ject x ( t ) pA,kℓm to the k ey , query and v alues features: Q ( h,t ) pA,kℓm = X ¯ k W ( Q,h,t ) ℓ,k ¯ k x ( t ) pA, ¯ kℓm , (14) and similarly for K and V. Next, w e normalize the L 2 norm of the features: ˜ Q ( h,t ) pA,kℓm = Norm ϵ ( Q ( h,t ) p ) , ˜ K ( h,t ) pA,kℓm = Norm ϵ ( K ( h,t ) p ) with Norm ϵ ( X ( h,t ) p ) = X ( h,t ) pA,kℓm q P Akℓm | X ( h,t ) pA,kℓm | 2 + | ϵ | , (15) where ϵ is learnable. The atten tion score S ( h,t ) p,q b et w een MO p and MO q is computed as: S ( h,t ) pq = X Akℓm ˜ Q ( h,t ) pA,kℓm · ˜ K ( h,t ) q A,kℓm . (16) T o preserve sign equiv ariance, we skip the usual softmax normalization and directly sum and renormalize the atten tion-weigh ted contributions, which is structurally similar to previously proposed transformer v ariants lik e the T ransNormer 58 : ˜ o ( h,t ) pA,kℓm = X q ˜ S ( h,t ) pq V ( h,t ) q A,kℓm , o ( h,t ) pA,kℓm = Norm ϵ ( ˜ o ( h,t ) p ) Finally , we sum o ver the differen t heads: x ( t +1) pA,kℓm = X h w ( t ) h o ( h,t ) pA,kℓm , where w h are learnable parameters. The up dated node features x ( t +1) pA are then passed to the “Odd-MA CE” blo c k. Since the atten tion only uses inner products b etw een features with the same ℓ , whic h is rotation inv ariant, and a L 2 norm, whic h is rotation in v ariant, the attention w eights are also rotation in v arian t. Summing the rotation-equiv ariant features with in v ariant w eights therefore preserv es rotation equiv ariance of the attention mec hanism. Sign equiv ariance is preserv ed b ecause the inner pro duct is sign equiv arian t with resp ect to b oth MO features. Finally , if tw o MOs are lo calized on t wo non-interacting fragments, their inner pro duct is 0, since there are no o verlapping non-zero co efficients. Therefore, the attention mec hanism preserv es size-extensivity . 7 3.2.2 Odd-MA CE T o allow mixing within eac h MO feature, one MACE la yer is applied after each attention block. MACE is an equiv ariant neural netw ork, where the key equiv arian t non-linear op eration can b e view ed as a tensor p olynomial in the irreps-carrying features, x ( ν ) p,A,k = ν X ξ =0 X k ′ W ( ξ ) k,k ′  x p,A,k ′  ⊗ ξ . (17) The tensor p ow er ( h p,k ′ ) ⊗ ξ denotes the stack of all order- ξ equiv ariant tensor pro ducts constructed by rep eated application of Eq. ( 12 ) to the comp onen ts of x p,k ′ , and W ( ξ ) k,k ′ is a learned linear map. The tensor polynomial is applied to the no de features after eac h message-passing lay er. In practice, Eq. ( 17 ) is implemen ted efficien tly using Horner’s metho d. It is imp ortan t to note that the even tensor monomials are sign in v ariant, while the o dd tensor monomials are sign equiv ariant. Thus, to enforce sign equiv ariance in our mo del, we restrict the monomial order ξ in Eq. ( 17 ) to the o dd ones, which w e call “Odd-MACE” . Since MA CE is an equiv ariant neural net work, it preserv es the rotational equiv ariance of the MO features. Message passing itself is sign equiv arian t with resp ect to its input features, such that together with the restriction to odd monomials, Odd-Mace ensures sign equiv ariance. Finally , the cutoff radius enforced in message passing prev ents interactions betw een far-separated fragmen ts, thereb y ensuring size extensivit y . 3.3 La y er Normalization Our architecture interlea ves equiv ariant la y er normalization blo c ks b etw een eac h attention and an Odd-MA CE blo c k, as shown in Fig. 2 . The lay er normalization is inspired by 59 , how ever, w e found it b eneficial to mak e ϵ learnable. Please see Section E.2 for details. 3.4 Readout The readout la yers transform the latent MO features in to the final T 1 and T 2 amplitudes. 3.4.1 T 1 readout The T 1 readout combines features from t w o molecular orbitals (one o ccupied orbital i and one virtual orbital a ) to pro duce single excitation amplitudes. W e start by normalizing each feature using: ˜ x ( T ) iA,kℓm = Norm ϵ ( x ( T ) i ) . (18) Using an elemen t-wise tensor pro duct, w e calculate inv ariant pairwise features for ev ery ℓ : ˜ y ( T ) iA, ¯ kℓm = W single ℓ,,k ¯ k ˜ x ( T ) iA, ¯ kℓm , ˜ y ( T ) aA, ¯ kℓm = W single ℓ,k ¯ k ˜ x ( T ) aA, ¯ kℓm y ia,k,ℓ = X A,m 1 ,m 2 ˜ y ( T ) iA, ¯ kℓm 1 ˜ y ( T ) aA, ¯ kℓm 2 C 00 ℓ,m 1 ,ℓ,m 2 , (19) where W single ℓ,k ¯ k are learnable w eigh ts. Finally , a sign equiv ariant “Odd-MLP” (no bias and tanh nonlinearit y) pro cesses the features to predict the final amplitudes: t a i = Odd-MLP ( y ia ) . (20) 3.4.2 T 2 readout The T 2 readout com bines features from t wo occupied orbitals i and j , and t w o virtual orbitals a and b , to pro duce double excitation amplitudes. First, w e calculate in termediate pair-features: 8 ˜ f iaA,kℓm = X ℓ 1 m 1 ℓ 2 m 2 w (1) ¯ kℓℓ 1 ℓ 2 x ( T ) iA,kℓ 1 m 1 x ( T ) aA, ¯ kℓ 2 m 2 C ℓm ℓ 1 m 1 ℓ 2 m 2 , f iaA,kℓm = Norm ϵ ( ˜ f ia ) , f ′ iaA,kℓm = X ¯ k W ℓ,k ¯ k f iaA, ¯ kℓm and similar for the j, b orbitals, resulting in f j bA,k ℓm These intermediate representations are then further com bined to create the final four-orbital features: y ij ab,k ℓ = X A X m f ′ iaA,kℓm f ′ j bA,k ℓm C 00 ℓ,m,ℓ,m , (21) used to predict the amplitudes with an Odd-MLP: t ab ij = Odd-MLP ( y ij ab ) + t ab ij, MP2 , (22) Since we need the tw o-particle integrals for ev aluating the energies anyw ay (see Eq. ( 7 )), we get the MP2 amplitudes for free. Therefore, w e train our model to predict only the difference b etw een MP2 and CCSD amplitudes, a tec hnique denoted by ∆ -MP2-learning 46 . 4 Exp eriments Our experimental goal is to assess the efficacy of our neural net w ork surrogate for coupled-cluster (CC) amplitudes in predicting corrections from MP2 to CCSD amplitudes and whether these predictions yield strong performance on downstream tasks. F undamentally , our thesis is that CC metho ds are to o exp ensiv e to afford large-scale datasets on large molecules. This is already true for CCSD, but it would b e even more sev ere if w e mo v ed to higher-order CC metho ds, such as CCSDT. Therefore, we will restrict ourselv es to training on the relativ ely small QM7 dataset. W e are then interested in six aspects of our mo del: 1. In-distribution accuracy: Ho w w ell do es M¯ oLe reproduce CCSD energies on the QM7 test split? 2. Size extrapolation: How w ell do es the mo del generalize to molecules significan tly larger than those in the training distribution? 3. Off-equilibrium extrapolation: Can the mo del remain accurate for off-equilibrium geometries, ev en though it is trained only on equilibrium structures? 4. Ultra lo w data regime: Ho w well do es M¯ oLe p erform when trained on 100 samples only? 5. Other prop erties: T o what exten t do amplitude predictions support accurate prediction of other observ ables, such as the electron density? 6. SCF con v ergence: Can the predicted amplitudes serv e as a go o d initial guess, reducing CCSD solver iterations? Giv en the restriction to small datasets, we are particularly in terested in the data efficiency of our model. T o con textualize the data efficiency , we compare it against the MA CE and eSEN 60 MLIP architectures. F or fairness, w e train the MLIPs in a ∆ -learning setting, where we predict only the correction from MP2 to CCSD energies. W e also train a mo del without delta learning to add context for the results one w ould obtain with standard MACE training. W e emphasize that MLIPs and amplitude-based models target differen t ob jectiv es: MLIPs prioritize sp eed and scalability , whereas M¯ oLe is designed for data efficiency , completeness of prop erties, and CC-compatibilit y . Therefore, the comparison should b e interpreted as a reference point rather than a claim that M¯ oLe is a univ ersal replacemen t for MLIPs. 9 T able 1 Mean absolute error of energy predictions in mHa. "MA CE ( ∆ -MP2)" and "eSEN ( ∆ -MP2)" models are trained to predict ∆ E = E CCSD − E MP2 while "MACE" is trained to predict E CCSD directly . M¯ oLe was trained to predict ∆ t ab ij = t ab ij, CCSD − t ab ij, MP2 . W e train each mo del on a 100 molecule subset (denoted “Model”-100), and on the full training split (5732 molecules) of QM7 (denoted “Mo del”). The 100 molecule models are a veraged ov er three seeds. Pure MACE without MP2 scales square up to a certain system size, and then linear due to cutoff radii. Similar cutoffs can b e made for wa vefunction metho ds as w ell but are harder to implement, whic h is wh y we list them as the fifth p o w er here. T rain set Size extrap olation Out-of-equilibrium Model Complexity QM7 Amino acids PubChem Diels-Alder Dihedral scan Chair-to-boat MACE-100 O ( N ) – O ( N 2 ) 19.15 26.41 64.89 15.01 15.37 26.56 MACE-100 ( ∆ -MP2) O ( N 5 ) 1.64 5.53 5.29 3.17 0.79 1.16 eSEN-100 ( ∆ -MP2) O ( N 5 ) 1.48 7.43 17.41 5.55 2.74 2.53 M¯ oLe-100 O ( N 5 ) 0.66 0.67 2.80 1.50 0.33 0.24 MACE O ( N ) – O ( N 2 ) 1.83 10.60 17.74 7.05 4.61 4.50 MACE ( ∆ -MP2) O ( N 5 ) 0.16 0 . 49 2.24 1.57 0.35 0.39 eSEN ( ∆ -MP2) O ( N 5 ) 0.13 1.56 4.66 1.15 0.59 0.66 M¯ oLe O ( N 5 ) 0.12 0.78 1.63 1.16 0.22 0.08 4.1 QM7 exp eriments QM7 consists of 7165 small organic molecules comp osed of C, N, O, S, and H. W e use geometries pro vided in the original pap er 44 , 45 and recompute all lab els at the CCSD/def2-SVP 61 lev el of theory . F or each molecule, w e store the full set of coupled-cluster amplitudes. W e train a mo del on an 80/20 split, and additionally , three mo dels on 100-molecule subsets using three random seeds. Hyp erparameters and training details for M¯ oLe and the MLIPs are pro vided in Section H . W e are using the c hec kp oin t obtained from training on QM7 for all the follo wing ev aluations. The mean absolute error (MAE) on the QM7 test set for the T 1 amplitudes is 6 . 5 × 10 − 5 and 9 . 84 × 10 − 7 for the T 2 amplitudes. This translates to an excellent energy error of 0.12 mHa, sligh tly b etter than the ∆ -MP2 MLIPs (see table 1 ) and significantly better than the non ∆ -learning MACE model. Surprisingly , ev en on the m uch smaller 100 molecule subset, M¯ oLe ac hiev es strong energy predictions of 0.66 mHa. In this lo w-data regime, the data efficiency gap to the MLIPs b ecomes very clear. As an example, we plot the prediction of M¯ oLe and the ground truth amplitudes in Fig. S1 in the Appendix, illustrating that M¯ oLe successfully predicts the highly non-trivial amplitude pattern. Next, we ev aluate our model on sev eral out-of-distribution datasets, eac h of which we generate at CCSD/def2-svp lev el of theory . 4.2 Size extrap olation exp eriments T o study the size generalization, we ev aluate M¯ oLe on t wo out-of-distribution datasets con taining molecules t wice the size of those in QM7. This is particularly important giv en the steep scaling cost of CC metho ds, rendering the generation of datasets with large molecules prohibitiv e. Amino acids The amino acids dataset con tains 18 structurally div erse amino acids with up to 15 hea vy atoms. PubChem W e further construct a diverse b enc hmark by randomly sampling 100 molecules from the PubChem database, containing 14 hea vy atoms made from C, H, N, and O. The PubChem test set represents the most div erse and chemically challenging benchmark in our ev aluation. T able 1 shows that M¯ oLe generalizes b etter to larger molecules, in particular in the ultra-low data regime. 4.3 Off equilib rium exp eriments W e also test the mo del’s abilit y to generalize to off-equilibrium geometries, despite b eing trained only on equilibrium structures, using three distinct c hemical systems: Diels-Alder reaction The Diels-Alder reaction is one of the most commonly studied reactions in chemistry . As an example, w e use the Diels-Alder reaction path, turning ethylene and 1,3-butadiene into cyclohexene. The energy error along the in trinsic reaction co ordinate is sho wn in Fig. 3a 10 Dihedral Scan Next, we perform a dihedral scan of the butane molecule, as it forces the geometry through a transition p oin t. The energy error and zero-shifted relativ e energy scan is plotted in Fig. 3b . Chair-T o-Boat conformation change The c hair-to-b oat transition of cyclohexane is a famous example of conformational isomerism, where cyclohexane undergo es a “ring flip” by rotating its carbon-carb on single b onds. This pro cess forces the molecule through a high-energy half-chair transition state and a lo cal minim um t wist-b oat b efore reac hing the b oat conformation. T able 1 shows that M¯ oLe has, except for the Diels-Alder reaction where it is tied with eSEn, significantly lo wer error for each system, again particularly in the ultra lo w data regime. (a) Diels-Alder reaction of ethene and 1,3-butadiene turning into cyclohexene. (b) Dihedral scan of the butane molecule. Figure 3 The energy error of M¯ oLe, MP2+MACE (i.e., ∆ -MP2), and MACE along t w o scans. In the inset, the p oten tial energy surface is shown, with the black line indicating the ground truth energies. M¯ oLe achiev es lo wer error particularly for the transition state region, while MACE w ould ov erestimate the activ ation energies. 4.4 One-pa rticle properties F rom the amplitudes, we can derive the one-particle reduced density matrix (1-RDM) γ ( r , r ′ ) = γ pq ψ p ( r ) ψ q ( r ′ ) with XCCSD 50 , see Section C . F rom the 1-RDM, we can derive an y one-particle prop ert y , including spatially resolv ed ones. The most fundamen tal of these is the electron density ρ ( r ) = γ ( r , r ) , which in turn lets us calculate other spatially resolv ed quan tities like the electron lo calization function (ELF) 62 , F ukui function 63 , 64 , Bader’s QT AIM 65 and X-ra y diffraction patterns. W e rep ort the F rob enius norm of the density matrix error in T able 2 , whic h sho ws that our density matrices are m uch better than MP2. As an example, we plot the error of the electron density | ρ Model ( r ) − ρ CCSD ( r ) | in Fig. 4 . W e see that M¯ oLe performs significantly b etter than MP2. T able 2 The F rob enius norm of the error b et ween the density matrix of MP2 and M¯ oLe. Model Amino acids PubChem MP2 0.59 0.90 M¯ oLe 0.13 0.36 11 Figure 4 The electron density error of MP2 and M¯ oLe on L-Arginine amino acid. The error is plotted at the 95% p ercen tile of the MP2 error. 4.5 CCSD cycle reduction In cases where the M¯ oLe model is not fully trusted, for example, for un usual molecular structures, the guaran tees pro vided b y rigorous theory can b e desirable. In that case, the predicted M¯ oLe amplitudes can serv e as an initial guess for a CCSD solver to reduce the n umber of cycles needed for conv ergence. W e test this on the out-of-distribution molecules from the amino acids and PubChem datasets. W e turn the Direct In v ersion of Iterativ e Subspace 66 off to sa ve memory and set our conv ergence threshold to 10 − 3 for the L 2 norm of the T-amplitude c hanges. This threshold corresponds to energy errors | E − E tight | ≲ 0 . 1 mHa compared to a very tightly con v erged calculation. The amplitudes predicted b y M¯ oLe result in 40-50% reduction of CCSD solver iterations, the exact num b ers are in T able 3 . Notably , for the PubChem molecules, three out of one h undred systems failed to con v erge with the default MP2 guess, t wo of whic h did con v erge with our predicted amplitudes, demonstrating the high qualit y and practical utility of M¯ oLe’s predictions. T able 3 The av erage CCSD cycles needed for con vergence and n um b er of uncon verged systems with default MP2 vs. M¯ oLe amplitudes. The con vergence criterion is set to obtain an energy error of ≲ 0 . 1 mHa compared to a v ery tigh tly con verged calculation. Model Amino acids PubChem Avg. cycles Num. unconverged Avg. cycles Num. unconverged MP2 (default guess) 7.61 0 10.11 3 M¯ oLe 3.83 0 6.3 1 4.6 Impact of mo del scaling W e inv estigate the p erformance of our mo del as a function of transformer depth and width. In particular, w e scale the transformer depth from 1 to 4 la y ers and its width from 32 to 128 irreps. W e train the mo dels for one w eek on the QM7 training set. The mo del’s T 2 error on the QM7 test set is plotted in Figure 5 . Our model’s p erformance improv es monotonically with increasing transformer size, suggesting that the b ottlenec k is mainly capacit y rather than data efficiency , and that scaling could therefore impro v e results further. 4.7 Time complexit y CCSD has a formal time complexit y of O ( N 6 ) , while M¯ oLe should hav e a theoretical time complexity of O ( N 5 ) , as there are O ( N 4 ) man y amplitudes, eac h costing O ( N ) to compute. W e v alidate these exp ectations empirically b y timing increasingly larger Alkane systems and fitting a pow er la w, confirming the expected 12 scaling with O ( N 5 . 9 ) and O ( N 4 . 9 ) for CCSD and M¯ oLe, resp ectively . See Fig. 6 for a scaling plot. F or the systems measured, M¯ oLe is about ∼ 20 x faster than CCSD, b oth running on the same GPU. Figure 6 W e are timing CC as implemen ted by GPU4PySCF 67 , 68 and M¯ oLe for increasingly larger alkane systems, confirming the expected O ( N 5 ) and O ( N 6 ) scalings. 5 Conclusions and Outlo ok In this work, w e presented the first equiv ariant neural arc hitecture to predict CC amplitudes from molecular orbital inputs. W e trained our model on QM7 at the CCSD/def2-SVP level of theory and ev aluated it on a div erse set of datasets and tasks, ranging from energies to electron densities and cycle reductions. The model p erforms w ell on both the QM7 test set and sev eral out-of-distribution splits, ac hieving significan tly higher data efficiency than MLIPs. Since higher lev els of CC theory also need only the T 1 and T 2 amplitudes to compute energies, our work la ys the groundw ork to learn from very high lev els of theory (e.g. CCSDT) with v ery high data efficiency . F uture w ork will include sparse amplitude prediction to escap e the qubic scaling, larger basis sets, and prediction of the lam b da tensor for resp onse prop erties. Figure 5 Left: Scaling the transformer’s depth monotonically decreases the prediction error up to four la yers. Right: Scaling the transformer width also decreases the prediction error monotonically . 13 A ckno wledgments A.A. gratefully ackno wledged King Ab dullah Universit y of Science and T echnology (KA UST) for the KAUST Ibn Rushd Postdoctoral F ello wship. A.B. and L.T. ackno wledge the AIST support to the Matter lab for the pro ject titled "SIP pro ject - Quan tum Computing". J.A.C.-G.-A. ackno wledges funding of this pro ject by the National Sciences and Engineering Researc h Council of Canada (NSERC) Alliance Grant #ALLRP587593-23 (Quan tamole) and also ackno wledges support from the Council for Science, T ec hnology and Innov ation (CSTI), Cross-ministerial Strategic Inno v ation Promotion Program (SIP), “Promoting the application of adv anced quan tum tec hnology platforms to social issues” (F unding agency: QST). A.A.-G. thanks Anders G. F røseth for his generous supp ort. A.A.-G. also ackno wledges the generous supp ort of Natural Resources Canada and the Canada 150 Research Chairs program. Resources used in preparing this researc h w ere provided, in part, b y the Province of Ontario, the Gov ernment of Canada through CIF AR, and companies sp onsoring the V ector Institute. This researc h is part of the Univ ersity of T oronto’s Acceleration Consortium, whic h receiv es funding from the CFREF-2022-00042 Canada First Research Excellence F und. This researc h was enabled in part b y supp ort pro vided b y SciNet and the Digital Research Alliance of Canada ( alliancecan.ca ). 14 References [1] W alter K ohn and Lu Jeu Sham. Self-consistent equations including exc hange and correlation effects. Physic al r eview , 140(4A):A1133, 1965. [2] A ttila Szab o and Neil S Ostlund. Mo dern quantum chemistry: intr o duction to advanc e d ele ctr onic structur e the ory . Courier Corp oration, 2012. [3] Narb e Mardirossian and Martin Head-Gordon. Thirty years of densit y functional theory in computational c hemistry: an o verview and extensive assessment of 200 density functionals. Mole cular physics , 115(19):2315–2372, 2017. [4] Haiy ang Y u, Zhao Xu, Xiaofeng Qian, Xiaoning Qian, and Sh uiwang Ji. Efficient and equiv ariant graph net works for predicting quantum hamiltonian. In International Confer enc e on Machine L e arning , pages 40412–40424. PMLR, 2023. [5] Erpai Luo, Xinran W ei, Lin Huang, Y uny ang Li, Han Y ang, Zaish uo Xia, Zun W ang, Chang Liu, Bin Shao, and Jia Zhang. Efficien t and scalable density functional theory hamiltonian prediction through adaptive sparsit y . arXiv pr eprint arXiv:2502.01171 , 2025. [6] Seongsu Kim, Nay oung Kim, Dongwoo Kim, and Sungso o Ahn. High-order equiv ariant flow matching for densit y functional theory hamiltonian prediction. arXiv pr eprint arXiv:2505.18817 , 2025. [7] Y uny ang Li, Zaishuo Xia, Lin Huang, Xinran W ei, Han Y ang, Sam Harshe, Zun W ang, Chang Liu, Jia Zhang, Bin Shao, et al. Enhancing the scalability and applicabilit y of kohn-sham hamiltonians for molecular systems. arXiv pr eprint arXiv:2502.19227 , 2025. [8] Xuec heng Shao, Lukas P aetow, Mark E T uck erman, and Michele Pa v anello. Machine learning electronic structure metho ds based on the one-electron reduced density matrix. Natur e c ommunic ations , 14(1):6281, 2023. [9] S Hazra, U Patil, and S Sanvito. Predicting the one-particle densit y matrix with machine learning. Journal of Chemic al The ory and Computation , 20(11):4569–4578, 2024. [10] P ol F ebrer, P eter Bjørn Jørgensen, Miguel Pruneda, Alb erto García, P ablo Ordejón, and Arghy a Bho wmik. Graph2mat: universal graph to matrix con version for electron density prediction. Machine Le arning: Scienc e and T e chnolo gy , 6(2):025013, 2025. [11] Jonas Elsb org, Luca Thiede, Alán Aspuru-Guzik, T ejs V egge, and Arghy a Bhowmik. Electra: A cartesian netw ork for 3d c harge density prediction with floating orbitals. arXiv preprint , 2025. [12] T eddy K ok er, Keegan Quigley , Eric T aw, Kevin Tibbetts, and Lin Li. Higher-order equiv arian t neural net w orks for charge density prediction in materials. npj Computational Materials , 10(1):161, 2024. [13] F elix Brockherde, Leslie V ogt, Li Li, Mark E T uck erman, Kieron Burke, and Klaus-Rob ert Müller. Bypassing the k ohn-sham equations with machine learning. Natur e communic ations , 8(1):872, 2017. [14] Bruno F ocassio, Michelangelo Domina, Urvesh Patil, A dalb erto F azzio, and Stefano Sanvito. Linear jacobi- legendre expansion of the c harge densit y for mac hine learning-accelerated electronic structure calculations. npj Computational Materials , 9(1):87, 2023. [15] P eter Bjørn Jørgensen and Arghy a Bhowmik. Equiv arian t graph neural netw orks for fast electron densit y estimation of molecules, liquids, and solids. npj Computational Materials , 8(1):183, 2022. [16] Josh ua A Rack ers, Lucas T ecot, Mario Geiger, and T ess E Smidt. A recipe for cracking the quan tum scaling limit with machine learned electron densities. Machine L e arning: Scienc e and T e chnolo gy , 4(1):015027, 2023. [17] Andrea Grisafi, Alb erto F abrizio, Benjamin Meyer, Da vid M Wilkins, Clemence Cormin b oeuf, and Mic hele Ceriotti. T ransferable mac hine-learning model of the electron density . A CS c entr al scienc e , 5(1):57–64, 2018. [18] Zhe Liu, Y uyan Ni, Zhichen Pu, Qiming Sun, Siyuan Liu, and W en Y an. T ow ards a universally transferable acceleration metho d for density functional theory . arXiv pr eprint arXiv:2509.25724 , 2025. [19] Eduardo Soares, Dmitry Zubarev, Victor Y ukio Shirasuna, Emilio Vital Brazil, Breno W. S. R. Carv alho, Brandi Ransom, Holt Bui, Krystelle Lion ti, Caio Ro drigues Gama, and Daniel Djinishian de Briquez. A F oundation Mo del for Sim ulation-Grade Molecular Electron Densities. In AI for A c c eler ate d Materials Design - ICLR 2025 , April 2025. URL https://openreview.net/forum?id=9O4KmwYma0 . 15 [20] Arda v an Mehdizadeh and P eter Schindler. Surface Stability Mo deling with Universal Machine Learning Interatomic P otentials: A Comprehensiv e Clea v age Energy Benc hmarking Study. AI for Scienc e , 1(2):025002, December 2025. ISSN 3050-287X. doi:10.1088/3050-287X/ae1408 . URL . [cond-mat]. [21] Mik oła j Mart yka, Xin-Y u T ong, Joanna Janko wska, and Pa vlo O. Dral. OMNI-P2x: A Universal Neural Net w ork P otential for Excited-State Simulations, Octob er 2025. URL https://chemrxiv.org/engage/chemrxiv/article - details/68fdbb0daec32c65683cdfb3 . [22] Y uxinxin Chen and Pa vlo O. Dral. One to Rule Them All: A Univ ersal Interatomic P otential Learning across Quan tum Chemical Lev els. Journal of Chemic al The ory and Computation , 21(18):8762–8772, Septem b er 2025. ISSN 1549-9618, 1549-9626. doi:10.1021/acs.jctc.5c00858 . URL https://pubs.acs.org/doi/10.1021/acs.jctc. 5c00858 . [23] Ro opshree Banchode, Sura jit Das, Shampa Raghunathan, and Ragh unathan Ramakrishnan. Mac hine-Learned P otentials for Solv ation Mo deling, October 2025. URL . [ph ysics]. [24] Isaiah Shavitt and Rodney J Bartlett. Many-bo dy metho ds in chemistry and physics: MBPT and c ouple d-cluster the ory . Cambridge universit y press, 2009. [25] J Paldus, J Čížek, and I Shavitt. Correlation problems in atomic and molecular systems. iv. extended coupled-pair man y-electron theory and its application to the b h 3 molecule. Physical R eview A , 5(1):50, 1972. [26] Jan us J. Eriksen, Kasp er Kristensen, Thomas Kjærgaard, Poul Jørgensen, and Jürgen Gauss. A Lagrangian framew ork for deriving triples and quadruples corrections to the CCSD energy . The Journal of Chemic al Physics , 140 (6):064108, F ebruary 2014. ISSN 0021-9606. doi:10.1063/1.4862501 . URL https://doi.org/10.1063/1.4862501 . [27] T rygve Helgaker, Poul Jorgensen, and Jeppe Olsen. Mole cular ele ctr onic-structur e the ory . John Wiley & Sons, 2013. [28] Justin S Smith, Roman Zubatyuk, Benjamin Nebgen, Nicholas Lubb ers, Kipton Barros, Adrian E Roitb erg, Olexandr Isay ev, and Sergei T retiak. The ani-1ccx and ani-1x data sets, coupled-cluster and densit y functional theory prop erties for molecules. Scientific data , 7(1):134, 2020. [29] Y uji Ikeda, Axel F orslund, Pranav Kumar, Y ongliang Ou, Jong Hyun Jung, Andreas Köhn, and Blazej Grab owski. Mac hine-learning in teratomic p oten tials ac hieving ccsd (t) accuracy for systems with extended co v alent netw orks and v an der w aals interactions. arXiv pr eprint arXiv:2508.14306 , 2025. [30] Mitc hell Messerly , Sakib Matin, Alice EA Allen, Benjamin Nebgen, Kipton Barros, Justin S Smith, Nic holas Lubb ers, and Ric hard Messerly . Multi-fidelit y learning for in teratomic p oten tials: Low-lev el forces and high-lev el energies are all you need. arXiv pr eprint arXiv:2505.01590 , 2025. [31] Niamh O’Neill, Benjamin X Shi, William J Baldwin, William C Witt, Gábor Csányi, Julian D Gale, Angelos Mic haelides, and Christoph Sc hran. T ow ards routine condensed phase sim ulations with delta-learned coupled cluster accuracy: Application to liquid w ater. Journal of Chemic al The ory and Computation , 21(22):11710–11720, 2025. [32] Hao T ang, Brian Xiao, W enhao He, Pero Subasic, A vetik R. Harut yuny an, Y ao W ang, F ang Liu, Haow ei Xu, and Ju Li. Multi-task learning for molecular electronic structure approaching coupled-cluster accuracy , June 2024. URL . arXiv:2405.12229 [ph ysics]. [33] Jacob T ownsend and Konstan tinos D V ogiatzis. T ransferable mp2-based mac hine learning for accurate coupled- cluster energies. Journal of Chemic al The ory and Computation , 16(12):7453–7461, 2020. [34] Jacob T ownsend and Konstan tinos D V ogiatzis. Data-driv en acceleration of the coupled-cluster singles and doubles iterativ e solv er. The journal of physic al chemistry letters , 10(14):4129–4135, 2019. [35] Giulia Luise, Chin-W ei Huang, Thijs V ogels, Derk P . Kooi, Sebastian Ehlert, Stephanie Lanius, Klaas J. H. Giesb ertz, Amir Karton, Deniz Gunceler, Megan Stanley , W essel P . Bruinsma, Lin Huang, Xinran W ei, José Garrido T orres, Abyla y Katbashev, Ro drigo Chav ez Zav aleta, Bálint Máté, Sékou-Oumar Kaba, Rob erto Sordillo, Yingrong Chen, David B. Williams-Y oung, Christopher M. Bishop, Jan Hermann, Rianne v an den Berg, and Paola Gori- Giorgi. Accurate and scalable exchange-correlation with deep learning, June 2025. URL 2506.14665 . arXiv:2506.14665 [physics]. [36] Nic holas Gao, Eike Eb erhard, and Stephan Günnemann. Learning equiv ariant non-local electron density functionals. arXiv pr eprint arXiv:2410.07972 , 2024. 16 [37] Bikash Kanungo, Jeffrey Hatch, P aul M. Zimmerman, and Vikram Ga vini. Learning lo cal and semi-lo cal densit y functionals from exact exchange-correlation p oten tials and energies. Science A dvanc es , 11(38):eady8962, September 2025. doi:10.1126/sciadv.ady8962 . URL h t t p s : / / w w w . s c i e n c e . o r g / d o i / f u l l / 1 0 . 1 1 2 6 / s c i a d v . a d y 8 9 6 2 . Publisher: American Asso ciation for the A dv ancemen t of Science. [38] Christian V en turella, Christopher Hillen brand, Jiac hen Li, and Tianyu Zhu. Machine Learning Many-Body Green’s F unctions for Molecular Excitation Sp ectra. Journal of Chemic al The ory and Computation , 20(1):143–154, Jan uary 2024. ISSN 1549-9618. doi:10.1021/acs.jctc.3c01146 . URL https://doi.org/10.1021/acs.jctc.3c01146 . Publisher: American Chemical Society . [39] Christian V enturella, Jiac hen Li, Christopher Hillen brand, Ximena Leyv a P eralta, Jessica Liu, and Tianyu Zh u. Unified deep learning framework for many-bo dy quantum chemistry via Green’s functions. Natur e Computational Scienc e , 5(6):502–513, June 2025. ISSN 2662-8457. doi:10.1038/s43588-025-00810-z . URL https://www.nature.com/articles/s43588- 025- 00810- z . Publisher: Nature Publishing Group. [40] Hongh ui Shang, Ch u Guo, Y ang jun W u, Zhenyu Li, and Jinlong Y ang. Solving the many-electron Schrödinger equation with a transformer-based framework. Natur e Communic ations , 16(1):8464, Septem b er 2025. ISSN 2041-1723. doi:10.1038/s41467-025-63219-2 . URL htt ps: //w ww. nat ure . com /ar tic les /s4 146 7- 02 5- 63 219 - 2 . Publisher: Nature Publishing Group. [41] Daniel S. King, Daniel Grzenda, Ray Zhu, Nathaniel Hudson, Ian F oster, Bingqing Cheng, and Laura Gagliardi. Cartesian equiv ariant representations for learning and understanding molecular orbitals. Pr o c ee d- ings of the National A c ademy of Scienc es , 122(48):e2510235122, December 2025. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.2510235122 . URL https://pnas.org/doi/10.1073/pnas.2510235122 . [42] Nic holas Gao and Stephan Günnemann. Generalizing Neural W av e F unctions. In Pr o c e e dings of the 40th International Confer enc e on Machine L e arning , pages 10708–10726. PMLR, July 2023. URL https://proceedi ngs.mlr.press/v202/gao23c.html . ISSN: 2640-3498. [43] Mic hael Scherbela, Leon Gerard, and Philipp Grohs. T ow ards a transferable fermionic neural w a vefunction for molecules. Natur e Communic ations , 15(1):120, Jan uary 2024. ISSN 2041-1723. doi:10.1038/s41467-023-44216-9 . URL https://www.nature.com/articles/s41467- 023- 44216- 9 . Publisher: Nature Publishing Group. [44] M. R upp, A. Tkatc henko, K.-R. Müller, and O. A. von Lilienfeld. F ast and accurate modeling of molecular atomization energies with machine learning. Physic al R eview L etters , 108:058301, 2012. [45] L. C. Blum and J.-L. Reymond. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. A m. Chem. So c. , 131:8732, 2009. [46] Ragh unathan Ramakrishnan, Pa vlo O Dral, Matthias R upp, and O Anatole V on Lilienfeld. Big data meets quan tum chemistry approximations: the δ -mac hine learning approach. Journal of chemical theory and c omputation , 11(5):2087–2096, 2015. [47] Clemens Carel Johannes Roothaan. New developmen ts in molecular orbital theory . R eviews of mo dern physics , 23 (2):69, 1951. [48] Chr Møller and Milton S Plesset. Note on an approximation treatment for many-electron systems. Physic al r eview , 46(7):618, 1934. [49] George D Purvis I I I and Rodney J Bartlett. A full coupled-cluster singles and doubles mo del: The inclusion of disconnected triples. The Journal of chemic al physics , 76(4):1910–1918, 1982. [50] T atiana K orona and Bogumil Jeziorski. One-electron prop erties and electrostatic interaction energies from the exp ectation v alue expression and wa v e function of singles and doubles coupled cluster theory . The Journal of Chemic al Physics , 125(18):184109, Nov ember 2006. ISSN 0021-9606. doi:10.1063/1.2364489 . URL ht tp s: //doi.org/10.1063/1.2364489 . [51] Ke Chen, Christian Kunkel, Bingqing Cheng, Karsten Reuter, and Johannes T. Margraf. Physics-inspired machine learning of localized intensiv e properties. Chemic al Scienc e , 14(18):4913–4922, Ma y 2023. ISSN 2041-6539. doi:10.1039/D3SC00841J . URL ht tp s: // pu bs .r sc .o rg /e n/ co nt en t/ ar ti cl el an di ng /2 02 3/ sc /d 3s c0 08 4 1 j . Publisher: The Roy al So ciety of Chemistry . [52] Nathaniel Thomas, T ess Smidt, Stev en Kearnes, Lusann Y ang, Li Li, Kai Kohlhoff, and Patric k Riley . T ensor field netw orks: Rotation- and translation-equiv ariant neural net w orks for 3D p oint clouds. arXiv.or g , F ebruary 2018. URL . 17 [53] Mario Geiger and T ess Smidt. e3nn: Euclidean Neural Netw orks, July 2022. URL .09453 . arXiv:2207.09453 [cs]. [54] Johann Brehmer, Sönke Behrends, Pim de Haan, and T aco Cohen. Does equiv ariance matter at scale?, July 2025. URL . arXiv:2410.23179 [cs]. [55] Khang Ngo and Siamak Rav an bakhsh. Scaling La ws and Symmetry, Evidence from Neural F orce Fields, Octob er 2025. URL . arXiv:2510.09768 [cs]. [56] Divy a Suman, Jigyasa Nigam, Sandra Saade, P aolo Pegolo, Hanna T uerk, Xing Zhang, Garnet Kin-Lic Chan, and Michele Ceriotti. Exploring the design space of machine-learning mo dels for quan tum c hemistry with a fully differen tiable framework. Journal of Chemic al The ory and Computation , 21(13):6505–6516, July 2025. ISSN 1549-9618, 1549-9626. doi:10.1021/acs.jctc.5c00522 . URL . [ph ysics]. [57] Ily es Batatia, Da vid P . Ko v acs, Gregor Simm, Christoph Ortner, and Gab or Csanyi. MACE: Higher Order Equiv ariant Message Passing Neural Netw orks for F ast and Accurate F orce Fields. A dvanc es in Neur al Information Pr o c essing Systems , 35:11423–11436, December 2022. [58] Zhen Qin, XiaoDong Han, W eixuan Sun, Dongxu Li, Lingp eng K ong, Nick Barnes, and Yiran Zhong. The Devil in Linear T ransformer. arXiv.or g , October 2022. URL . [59] Yi-Lun Liao, Brandon W o o d, Abhishek Das, and T ess Smidt. Equiformerv2: Improv ed equiv ariant transformer for scaling to higher-degree represen tations. arXiv pr eprint arXiv:2306.12059 , 2023. [60] Xiang F u, Brandon M W o od, Luis Barroso-Luque, Daniel S Levine, Meng Gao, Misk o Dzamba, and C Lawrence Zitnic k. Learning smooth and expressive in teratomic potentials for physical prop erty prediction. arXiv pr eprint arXiv:2502.12147 , 2025. [61] Florian W eigend and Reinhart Ahlrichs. Balanced basis sets of split v alence, triple zeta v alence and quadruple zeta v alence qualit y for h to rn: Design and assessmen t of accuracy . Physic al Chemistry Chemic al Physics , 7(18): 3297–3305, 2005. [62] Axel D Beck e and Kenneth E Edgecombe. A simple measure of electron lo calization in atomic and molecular systems. The Journal of chemic al physics , 92(9):5397–5403, 1990. [63] John F av er and Kenneth M Merz Jr. Utility of the hard/soft acid- base principle via the fukui function in biological systems. Journal of chemic al the ory and c omputation , 6(2):548–559, 2010. [64] Nur Zalin Khaleda Razali, W an Nur Shakirah W an Hassan, Sheikh Ahmad Izaddin Sheikh Mohd Ghazali, Siti Noriah Mohd Shotor, and Nur Nadia Dzulkifli. Dft, fukui indices, and molecular dynamic simulation studies on corrosion inhibition characteristics: a review. Chemic al Pap ers , 78(2):715–731, 2024. [65] RFW Bader. Atoms in moleculars: a quantum theory , 1990. [66] Gusta vo E Scuseria, Timothy J Lee, and Henry F Schaefer I I I. Accelerating the conv ergence of the coupled-cluster approac h: The use of the diis metho d. Chemic al physics letters , 130(3):236–239, 1986. [67] R ui Li, Qiming Sun, Xing Zhang, and Garnet Kin-Lic Chan. Introducing gpu-acceleration into the p ython-based sim ulations of c hemistry framew ork, 2024. URL . [68] Xiao jie W u, Qiming Sun, Zhichen Pu, Tianze Zheng, W enzhi Ma, W en Y an, Xia Y u, Zhengxiao W u, Mian Huo, Xiang Li, W eiluo R en, Sheng Gong, Y umin Zhang, and W eihao Gao. Enhancing gpu-acceleration in the p ython-based sim ulations of chemistry framework, 2024. URL . [69] Gerald Knizia. Intrinsic atomic orbitals: An un biased bridge betw een quantum theory and chemical concepts. Journal of chemic al the ory and c omputation , 9(11):4834–4843, 2013. [70] János Pip ek and Paul G Mezey . A fast intrinsic lo calization procedure applicable for ab initio and semiempirical linear combination of atomic orbital w av e functions. The Journal of Chemic al Physics , 90(9):4916–4926, 1989. [71] Joseph E Subotnik, Anthon y D Dutoi, and Martin Head-Gordon. F ast localized orthonormal virtual orbitals whic h depend smoothly on nuclear co ordinates. The Journal of chemic al physics , 123(11), 2005. [72] Zhenling W ang, Ab dulrahman Aldossary , and Martin Head-Gordon. Sparsity of the electron repulsion integral tensor using differen t localized virtual orbital representations in local second-order møller–plesset theory . The Journal of chemic al physics , 158(6), 2023. 18 [73] Elvira R Sayfut yaro v a, Qiming Sun, Garnet Kin-Lic Chan, and Gerald Knizia. Automated construction of molecular activ e spaces from atomic v alence orbitals. Journal of chemic al the ory and c omputation , 13(9):4063–4078, 2017. 19 App endix T able of Contents A Equiv ariant graph neural net works 21 A.1 Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 B Extended background on Correlated Methods 22 B.1 Hartree-F o ck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 B.2 Integrals in quan tum chemistry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.3 Coupled Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 B.3.1 T1-transformed integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 B.3.2 Solving the CCSD equations as a ro ot-finding problem . . . . . . . . . . . . . . . . . . 25 B.3.3 Quasi-Newton solution algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 C Density matrix calculation 26 D Molecular orbital localization 26 E Architecture details 27 E.1 P adding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 E.2 La yer Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 F Visualization of results 28 F.1 Amplitude visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 G CCSD solv er iteration n umbers 28 H Hyp erparamters 29 20 A Equiva riant graph neural net w o rks SO (3) -Equiv ariant graph neural netw orks (GNNs) curren tly represent the most successful class of mac hine- learning mo dels for in teratomic p otentials. In these approaches, a molecule is represented as a graph G = ( V , E ) em b edded in three-dimensional space R 3 , where nodes I ∈ V corresp ond to atoms located at positions r I ∈ R 3 , and edges ( A, B ) ∈ E are defined b y a distance cutoff. At lay er t , eac h node carries a feature v ector h ( t ) A made up of a direct sum of irreducible represen tations (irreps) of SO(3) , x ( t ) A = ℓ max M ℓ =0 x ( t ) A,ℓ , x ( t ) A,ℓ = { x ( t ) A,ℓ,m } ℓ m = − ℓ , (23) where each order- ℓ arra y has 2 ℓ + 1 comp onents. F eature up dates are p erformed by exc hanging messages along edges, aggregating them in a p ermutation-in v arian t manner (t ypically b y summation or atten tion-weigh ted summation), and com bining the result with the target no de’s curren t features. In SO (3) -equiv ariant GNNs, node features transform cov arian tly under a global rotation Q according to the Wigner D ℓ ( Q ) -matrices, x ( t ) A,ℓ,m  D 1 ( Q ) { r k } N k =1  = ℓ X m ′ = − ℓ D ℓ,mm ′ ( Q ) h ( t ) A,ℓ,m ′  { r k } N k =1  . (24) T ranslation equiv ariance is ensured b y constructing messages only from relative displacements r A,B = r B − r A . If full O(3) equiv ariance is desired, each order- ℓ feature additionally carries a parity label sp ecifying its b eha vior under spatial in version. Couplings b et w een irreps p ℓ 1 and g ℓ 2 are formed via the Clebsc h-Gordan tensor pro duct 52 , x ℓ 3 ,m 3 =  p ℓ 1 ,m 1 ⊗ ℓ 3 ℓ 1 ,ℓ 2 g ℓ 2 ,m 2  m 3 = ℓ 1 X m 1 = − ℓ 1 ℓ 2 X m 2 = − ℓ 2 C ( ℓ 3 ,m 3 ) ( ℓ 1 ,m 1 ) , ( ℓ 2 ,m 2 ) p ℓ 1 ,m 1 g ℓ 2 ,m 2 . (25) W e use this to construct messages sent from a source atom B to a target atom A as v A,B ,ℓ 3 = v ℓ 3 ( x A , x B , r A,B ) = X ℓ 1 ,ℓ 2 w ℓ 1 ,ℓ 2 ,ℓ 3 ( || r A,B || )  f ℓ 1 ( x A , x B ) ⊗ ℓ 3 ℓ 1 ,ℓ 2 Y ℓ 2  r A,B || r A,B ||  , (26) where Y ℓ 2 denotes the (2 ℓ 2 + 1) -dimensional vector of spherical harmonics of degree ℓ 2 and w ℓ 1 ,ℓ 2 ,ℓ 3 : R → R is a learned radial weigh ting function, usually implemen ted as a neural netw ork. The function f ℓ 1 is a simple map com bining no de features (whic h reduces to f ℓ 1 ( h I , h J ) = h J,ℓ 1 in MA CE). A.1 Graph Construction No de F e atur es: F or eac h molecular orbital p , an indep endent molecular graph is constructed where eac h no de represen ts an atom. The no de features for atom A in the graph corresponding to orbital p are simply the MO em b eddings x (0) pA = { x (0) pA,kℓm } . No de A ttributes: Each node stores information ab out atomic species as attributes. The atomic species for atom A is enco ded as a one-hot vector, z A , where each component corresponds to a differen t atomic element. This atomic-t yp e information is essen tial for the mo del to distinguish b et w een elemen ts and their c hemical prop erties. In the con text of our model, it is important to understand the basis set. R adial weighting function: The edge features are computed using a radial em b edding blo ck that creates ric h distance-dependent representations. Let r A b e the p osition of atom A , and d AB = || r B − r A || b e the in teratomic distance betw een atoms A and B . The radial embedding uses Bessel basis functions to encode the in teratomic distances: e AB = RadialEm b edding ( d AB ) , (27) 21 where the radial em b edding is computed as: RadialEm b edding ( r ) = f cutoff ( r ) ·          ϕ 1 ( r ) ϕ 2 ( r ) . . . ϕ N basis ( r )          , (28) with Bessel basis functions: ϕ n ( r ) = r 2 r max · sin( nπ r /r max ) r , (29) and p olynomial cutoff function: f cutoff ( r ) =    1 − ( p +1)( p +2) 2  r r max  p + p ( p + 2)  r r max  p +1 − p ( p +1) 2  r r max  p +2 if r < r max 0 if r ≥ r max (30) The direction v ector v AB is used separately in the MACE lay er for computing spherical harmonics and directional features that are essen tial for equiv ariant message passing. B Extended background on Co rrelated Metho ds B.1 Ha rtree-F o ck In the Hartree-F o ck metho d, we assume that the many-electron wa vefunction Φ can b e represented by a simple pro duct of one-electron wa vefunctions, that are usually referred to as orbitals. Eac h orbital then con tains exactly one electron and ma y extend o v er the whole molecule. How ever, because electrons are fermions, a simple product is not quite sufficient: indeed, when exc hanging t w o electrons, the w a v efunction should c hange sign. F or that reason, the orbitals are arranged in a so-called Slater determinan t, that ensures the correct b eha vior of the sign when exc hanging particles. Ph ysically , the Hartree-F o ck approximation corresponds to treating eac h electron in the mean field generated b y all the other electrons and n uclei, thereby neglecting the instan taneous electron-electron interactions that results in correlations b etw een their p ositions. The Hamiltonian is an operator that describ es the energetic con tributions of all particles inside a molecule, including particle-particle in teractions and their kinetic energy . It can b e written as: ˆ H = ˆ h + 1 2 X ij d r − 1 ij (31) where the core Hamiltonian ˆ h con tains the kinetic energy of all electrons, and the attraction b etw een the fixed n uclei and all electrons, while d r − 1 ij is the Coulom b repulsion b et ween electrons written in atomic units. T o obtain the total energy of the system, we tak e the in tegral of the man y-electron w av efunction Φ with the Hamiltonian o ver all electrons and all space, that can b e informally written (for a real wa v efunction): E = Z Φ( x ) ˆ H Φ( x ) d x (32) where x denotes the co ordinates of all the electrons, and the dep endence on the n uclear p ositions R A is implied for simplicit y . T o further simplify the notation, quan tum c hemists and ph ysicists denote suc h integrals as an exp ectation v alue: 22 E = ⟨ Φ | ˆ H | Φ ⟩ (33) W e can then replace Φ b y the expression assumed in the Hartree-F o ck approximation to obtain the Hartree-F o ck energy: E HF [ { ψ i } ] = X i h ii + 1 2 X ij ⟨ ij || j i ⟩ , (34) where h ii = ⟨ ψ i | ˆ h | ψ i ⟩ is a one-electron in tegral with the core Hamiltonian ˆ h , and ⟨ ij || j i ⟩ = ⟨ ψ i ψ j | d r − 1 12 | ψ i ψ j ⟩ − ⟨ ψ i ψ j | d r − 1 12 | ψ j ψ i ⟩ (35) is an antisymmetrized t wo-electron in tegral with the interelectronic repulsion Coulom b op erator. W e expand on the meaning of these in tegrals in the next section, and fo cus on the Hartree-F o ck method here. Eac h orbital ψ p dep ends on the co efficien ts C as describ ed in Equation 1.1 . In the Hartree-F o ck method, we aim to obtain the best possible orbitals by minimizing the Hartree-F o ck energy as a function of the coefficients C . This minimization yields the canonical Hartree-F o ck in Equation 1 , that we repeat here for conv enience: F ( C ) C = ε C (36) The F o ck matrix F contains one- and t w o-electron integrals similar to the ones that app ear in the energy and therefore dep ends on the shap e of the orbitals and the co efficients C . Because of this dep endence, the Hartree-F o ck equations hav e to b e solved iteratively , successiv ely diagonalizing F to obtain updated C un til the ab o v e equation is obeyed, at whic h p oint the columns of C are eigenv ectors of F ( C ) and the final Hartree-F o ck energy can b e calculated. B.2 Integrals in quantum chemistry As introduced ab ov e, integrals in quan tum c hemistry can b e divided in tw o categories, one-electron integrals that are generally denoted b y: h ij = D ψ i | ˆ h | ψ j E (37) and t wo-electron integrals that are denoted b y: ⟨ ij | ab ⟩ = ⟨ ψ i ψ j | d r − 1 12 | ψ a ψ b ⟩ (38) where, in general, indices i , j , a , b can b e differen t. Both of these integrals stem from in tegrals of many-electron w av efunctions ov er the Hamiltonian in the form of Z Φ( x ) ˆ H Φ ′ ( x ) d x = ⟨ Φ | ˆ H | Φ ′ ⟩ (39) as in tro duced in the previous section, but where no w Φ and Φ ′ are not necessarily the same. One-electron integrals only in tegrate o v er the co ordinates of a single electron, and inv olve one orbital on each side of the op erator ˆ h . They can be written more explicitly as: h ij = Z ψ i ( r |{ R A } ) ˆ hψ j ( r |{ R A } ) d r (40) In our case, each orbital is a combination of radial basis functions and spherical harmonics (see Equation 1.1 ) and the in tegral can b e ev aluated by standard quantum c hemistry metho ds. T wo-electron in tegrals in tegrate o ver the coordinates of t w o electrons, and in volv e tw o orbitals on eac h side of the op erator d r − 1 12 . Often, the integral ⟨ ij | ab ⟩ app ears paired with the integral − ⟨ ij | ba ⟩ , so for conv enience the follo wing antisymmetrized integral is in troduced: ⟨ ij || ab ⟩ = ⟨ ij | ab ⟩ − ⟨ ij | ba ⟩ (41) 23 Once again, a t wo-electron integral can be written more explicitly as: ⟨ ij | ab ⟩ = Z ψ i ( r 1 |{ R A } ) ψ j ( r 2 |{ R A } ) 1 r 1 − r 2 ψ a ( r 1 |{ R A } ) ψ b ( r 2 |{ R A } ) d r 1 d r 2 (42) where w e expanded the op erator d r − 1 12 explicitly . Once again, all orbitals are comp osed of radial basis functions and spherical harmonics (see Equation 1.1 ) and the in tegrals are ev aluated using standard quan tum chemistry metho ds. B.3 Coupled Cluster The coupled cluster metho d 2 , 49 expresses the exact electronic wa v efunction as an exp onential excitation of a reference Hartree-F o ck wa v efunction: | Ψ CC ⟩ = e ˆ T | Φ HF ⟩ , (43) where the cluster op erator ˆ T = ˆ T 1 + ˆ T 2 + ˆ T 3 + · · · generates single, double, triple, and higher excitations out of the reference state. Unlike Configuration In teraction (CI), whic h uses a linear expansion, this exp onential form ensures size extensivity , the total energy scales correctly with the n um b er of noninteracting subsystems, and includes higher-order excitation effects implicitly through pro ducts of low er-order terms (e.g., 1 2 ˆ T 2 1 ). The coupled cluster equations are obtained b y pro jecting the similarit y-transformed Schrödinger equation, ¯ H | Φ HF ⟩ = E CC | Φ HF ⟩ , with ¯ H = e − ˆ T ˆ H e ˆ T , (44) on to the reference and excited determinan ts. T runcating ˆ T to a giv en excitation rank defines a hierarc hy of systematically impro v able metho ds: Coupled Cluster with Singles and Doubles (CCSD) includes single and double excitations, CCSDT adds triples, CCSDTQ adds quadruples, and so on. Coupled-cluster with single and double excitations (CCSD) is one of the most widely used correlated electronic- structure metho ds. In CCSD, the correlated w a v efunction is written as | CCSD ⟩ = e ˆ T 1 + ˆ T 2 | Φ HF ⟩ , (45) where ˆ T 1 = X ia t a i ˆ a † a ˆ a i , (46) ˆ T 2 = 1 4 X ij ab t ab ij ˆ a † a ˆ a † b ˆ a j ˆ a i , (47) and | Φ HF ⟩ is the Hartree-F o ck reference determinant and ˆ T 1 and ˆ T 2 are the single- and double-excitation cluster op erators. The unknown parameters of the metho d are the excitation amplitudes { t a i , t ab ij } , which sp ecify the coefficients of these operators. The amplitudes are determined b y requiring that the projected Schrödinger equation is satisfied within the space of singly and doubly excited determinan ts. This leads to the nonlinear system of CCSD equations: E CCSD = ⟨ Φ HF | ˆ H | CCSD ⟩ , (48) 0 = ⟨ Φ a i | e − ( ˆ T 1 + ˆ T 2 ) ˆ H | CCSD ⟩ , (49) 0 =  Φ ab ij   e − ( ˆ T 1 + ˆ T 2 ) ˆ H   CCSD  , (50) where | Φ a i ⟩ = ˆ a † a ˆ a i | Φ HF ⟩ denotes a singly excited determinant and   Φ ab ij  = ˆ a † a ˆ a † b ˆ a i ˆ a j | Φ HF ⟩ denotes a doubly excited determinan t. Once a set of amplitudes is a v ailable, the CCSD energy can b e ev aluated directly as E CCSD = E HF + X aibj  t ab ij + t a i t b j  (2 ⟨ ij | ab ⟩ − ⟨ ij | ba ⟩ ) . (51) This expression highligh ts a cen tral feature of CCSD: the amplitudes serve as the effectiv e parameters of a highly structured, ph ysics-informed mo del that maps molecular orbital in tegrals to correlated energies. 24 B.3.1 T1-transfo rmed integrals Ev aluation of the pro jected equations in Eq. ( 48 ) requires a large num b er of tensor con tractions. T o simplify the notation and reduce computational cost, it is common to in tro duce the so-called T1-tr ansforme d one- and t wo-electron integrals, ˜ h pq = X rs ( δ pr − t p r )( δ q s + t s q ) h rs , (52) ^ ⟨ pr | q s ⟩ = X tu X mn ( δ pt − t p t )( δ q u + t u q )( δ rm − t r m )( δ sn + t n s ) ⟨ tm | un ⟩ . (53) These transformed in tegrals incorp orate the effect of the single-excitation amplitudes and allow the CCSD equations to b e expressed in a compact form. Using this notation, the residual equations for the single and double amplitudes can b e written as 0 = X ckd (2 t cd ki − t cd ij ) ^ ⟨ ak | dc ⟩ − X ckl (2 t ac kl − t ac lk ) ^ ⟨ k l | ic ⟩ + X ck (2 t ac ik − t ac ki ) " ˜ h kc + X l (2 ^ ⟨ k l | cl ⟩ − ^ ⟨ k l | l c ⟩ ) # + ˜ h ai + X j (2 ^ ⟨ aj | ij ⟩ − ^ ⟨ aj | j i ⟩ ) , (54) and 0 = ^ ⟨ ab | ij ⟩ + X cd t cd ij ^ ⟨ ab | cd ⟩ + X kl t ab kl ^ ⟨ k l | ij ⟩ + X cd t cd ij ^ ⟨ k l | cd ⟩ ! + P ab ij − X ck " 1 2 t bc kj ^ ⟨ k a | ic ⟩ − 1 2 X dl t ad li ^ ⟨ k l | dc ⟩ ! + t bc ki ^ ⟨ k a | j c ⟩ − 1 2 X dl t ad lj ^ ⟨ k l | dc ⟩ !# + 1 2 X ck  2 t bc j k − t bc kj  " 2 ^ ⟨ ak | ic ⟩ − ^ ⟨ ak | ci ⟩ + 1 2 X dl  2 t ad il − t ad li   2 ^ ⟨ lk | dc ⟩ − ^ ⟨ lk | cd ⟩  # + X c t ac ij (" ˜ h bc + X k  2 ^ ⟨ bk | ck ⟩ − ^ ⟨ bk | k c ⟩  # − X dkl  2 t bd kl  ^ ⟨ lk | dc ⟩ ) − X k t ab ik " ˜ h kj + X l  2 ^ ⟨ k l | j l ⟩ − ^ ⟨ k l | l j ⟩  + X cdl  2 t cd lj − t cd j l  ^ ⟨ k l | dc ⟩ #! , (55) where P ab ij is the p ermutation operator that enforces the required an tisymmetry , P ab ij A ab ij = A ab ij + A ba j i . These expressions define the nonlinear mapping from amplitudes to residuals that m ust b e driv en to zero. B.3.2 Solving the CCSD equations as a ro ot-finding p roblem F rom a numerical persp ective, CCSD can b e interpreted as solving a high-dimensional nonlinear system Ω ( t ) = 0 , (56) where t is the vector containing all t a i and t ab ij amplitudes and Ω ( t ) denotes the collection of residual expressions giv en ab o ve. Standard quan tum-chemistry implementations solv e this system iterativ ely using a quasi-Newton pro cedure. A t iteration n , the amplitudes are up dated according to t ( n +1) = t ( n ) − ε − 1 Ω ( t ( n ) ) , (57) where ε is a diagonal appro ximation to the Jacobian matrix of the residual function. In practice, ε is constructed from orbital-energy differences and serv es as a computationally inexp ensiv e preconditioner. 25 B.3.3 Quasi-Newton solution algo rithm The CCSD solv er can b e view ed as a sp ecialized fixed-point iteration with a ph ysics-motiv ated preconditioner. The pro cedure can be summarized algorithmically as follo ws. Algorithm 1 Quasi-Newton solution of the CCSD equations 1: Input: Hartree–F o ck orbitals and integrals 2: Initialize amplitudes t (0) (t ypically zeros or MP2 estimates) 3: Construct diagonal preconditioner ε − 1 from orbital energies 4: fo r n = 0 , 1 , 2 , . . . until con vergence do 5: Ev aluate residual vector Ω ( n ) = Ω ( t ( n ) ) 6: if ∥ Ω ( n ) ∥ < τ then 7: b reak ▷ con verged 8: end if 9: Up date amplitudes: t ( n +1) = t ( n ) − ε − 1 Ω ( n ) 10: end fo r 11: Output: Con verged amplitudes t ∗ C Densit y matrix calculation W e can calculate an y one particle observ able given the 1-RDM γ . How ever, to calculate γ at the CCSD lev el of theory requires the Λ tensor, another T 2 -lik e ob ject. A common workaround for testing purposes is exp ectation v alue CCSD (XCCSD) 50 . In XCCSD(2), the 1-RDM is assem bled without the Λ tensor from the o ccupied-occupied blo ck: γ ij = δ ij − X a t a ∗ i t a j − 1 2 X a,b,k t ab ∗ ik t ab j k (58) the virtual-virtual blo c k: γ ab = X i t a ∗ i t b i + 1 2 X i,j,c t ac ∗ ij t bc ij (59) and the o ccupied-virtual blocks: γ ia = t a ∗ i + X j,b t ab ∗ ij t b j (60) D Molecula r o rbital lo calization Canonical molecular orbitals (MOs), as defined in Eq. ( 1 ), are generally delocalized ov er the en tire molecule. W e can transform the o ccupied and virtual MOs b y unitary rotations that mix orbitals only within their resp ectiv e subspaces: ˜ ψ i = X j ∈ o cc ( U occ ) ij ψ j , U † occ U occ = I (61) ˜ ψ a = X b ∈ virt ( U virt ) ab ψ b , U † virt U virt = I . (62) W e start b y lo calizing t he o ccupied orbitals with intrinsic bonding orbital (IBO) lo calization 69 . The IBO lo calization can b e understo o d as a Pip ek-Mezey (PM) lo calization 70 , but with intrinsic atomic orbital (IAO) p opulations. Both PM and IBO localization minimize: L = X A ∈ atoms X i ∈ occ ( N A ( i )) m (63) 26 , where N A ( i ) is the p opulation of the i -th orbital on the A -th atom, and m is the p ow er typically c hosen to b e 2 or 4 . The population N A ( i ) can b e written as: N A ( i ) = X µν ∈ basis centered on A C iµ C iν (64) , where the lab eling of the basis functions is the key difference b etw een PM and IBO. IBO uses IAOs for assignmen t of basis function, which can b e referred to in Ref. 69 . F or the virtual orbital lo calization, we hav e designed a lo calization sc heme resem bling that of Sub otnik and Head-Gordon 71 , later referred to as VV-HV 72 . In this sc heme, the v alence virtuals are differentiated from the o ccupied ones b y projecting onto a minimal basis, suc h as STO-3G or IAOs, whic h is then lo calized with IBO. This approach has strong resemblance and similarity to A V AS 73 . The lefto v er space of virtuals, i.e., hard virtuals, are constructed by projecting onto the original basis, shell by shell, and Sc hmidt orthogonalizing for eac h shell. T o recap, the localization scheme used is as follows: 1) IBO localize the o ccupied orbitals (with m = 4 ), 2) pro ject the virtuals onto the IA O basis recov ering v alence virtuals, similar to A V AS, and localizing the whole set with IBO, and finally 3) pro ject the leftov er virtual space, shell by shell, symmetrically orthogonalizing of eac h shell, and removing those virtuals from subsequent operations. Lo calized orbitals form an alternative representation of the o ccupied and virtual Hartree–F o c k subspaces, lea ving all observ ables, including the energy , in v ariant. This represen tation offers impro ved c hemical in ter- pretabilit y , with features such as bonding and lone pairs more easily identified. Moreov er, it yields sparse F o ck, densit y , tw o-electron in tegrals, and, therefore, excitation-amplitude matrices. More imp ortan tly for us, we hypothesize that this represen tation facilitates b etter learning. T o v erify this in tuition, we ablate lo calization and train a 64 feature mo del on canonical and lo calized orbitals on QM7. The results are in T able S1 . W e can see that lo calization indeed cuts the error significantly . T able S1 Ablation of the orbital lo calization. W e train a smaller model with 64 irreps features in the transformer with canonical and lo calized orbitals. W e see that lo calized orbitals reduce the error significantly . Model QM7 M¯ oLe - canonical 0.27 M¯ oLe - lo calized 0 . 16 E Architecture details In this sections w e provide some more details ab out our MoLe architecture. E.1 P addi ng Before encoding the MO co efficien ts, they ha ve to be padded with zeros so that all atoms hav e the same n um b er of MO coefficients. In irreps string notation, features of the form ax0e + bx1e + cx2e + ... , where the m ultiplicities a,b,c dep end on atom types and basis set, transform to kx0e + kx1e + kx2e + ... , with k b eing the maxim um ov erall m ultiplicities of all atom types. The padded co efficien ts are then em b edded into graphs b y initializing the features of an equiv ariant graph neural net work (see Fig. 1 ). E.2 La y er Normalization Let x ∈ R ( L max +1) 2 × C b e an irreps feature with maxim um degree L max and C c hannels on an atom. W e denote its comp onen ts by x ( L ) m,k , where L is the degree, m ∈ {− L, . . . , L } the order, and k ∈ { 1 , . . . , C } the c hannel index. The separable lay er norm normalizes the scalar part ( L = 0 ) and the rest ( L > 0 ) separately: 27 Sc alar p art ( L = 0 ). W e apply a standard lay er normalization ov er c hannels to the scalar part x (0) i : µ (0) = 1 C C X i =1 x (0) k ,  σ (0)  2 = 1 C C X k =1  x (0) k − µ (0)  2 , y (0) k = γ (0) k x (0) k − µ (0) σ (0) + | ϵ (0) | , (65) with learnable parameters γ (0) , ϵ (0) ∈ R C . Higher-de gr e e p art ( L > 0 ). All higher degrees L ≥ 1 are normalized together as:  σ ( L )  2 = 1 C C X k =1 1 2 L + 1 L X m = − L  x ( L ) m,k  2 ,  σ > 0  2 = 1 L max L max X L =1  σ ( L )  2 , y ( L ) m,k = γ ( L ) k x ( L ) m,k σ > 0 + | ϵ > 0 | , (66) where eac h degree L has learnable scale parameters γ ( L ) , ϵ > 0 ∈ R C . The output of the lay er normalization blo ck is then the concatenation o ver all degrees: y = { y (0) k } C k =1 ∪ { y ( L ) m,k } L ≥ 1 , − L ≤ m ≤ L, 1 ≤ k ≤ C . This construction is rotation-equiv ariant b ecause it only rescales irrep blo cks b y scalar factors and do es not mix orders m within a giv en L . It also do es not change the signs, thereby preserving sign equiv ariance. The ϵ ensures that, if all coefficients on an atom are 0 for a specific MO, they are also 0 after the normalization, ensuring size extensivit y . F Visualization of results F.1 Amplitude visualization W e are visualizing the output of the T 2 amplitude in figure Fig. S1 . Since the T 2 amplitude is a four index tensor, we only plot one slice. W e see that our model successfully predicts the highly non trivial structure of the amplitudes. Figure S1 The prediction of M¯ oLe and ground truth for the slice ( t CCSD − t MP2 ) a,b i =2 ,j =2 of Methanol. W e see that M¯ oLe correctly predicts the highly non-trivial correction betw een MP2 and CCSD. G CCSD solver iteration numb ers W e provide the av erage num b er of CCSD solv er iterations necessary to conv erge a CCSD calculation with default MP2 and M¯ oLe initializations, as w ell as the num ber of systems that did not conv erge at all in T able 3 . 28 Clearly , the M¯ oLe amplitudes serve as a high-quality initial guess that can even make systems con v erge, whic h w ould not hav e conv erged with MP2. H Hyp erpa ramters W e are training our M¯ oLe mo dels for 4 w eeks on an H100 GPU using the Adam optimizer to minimize the mean squared error b et w een the predicted and ground truth amplitudes: L ( { t a i } , { t a i, GT } , { t ab ij } , { t ab ij, GT } ) = X ia  t a i − t a i, GT  2 + X ij ab  t ab ij − t ab ij, GT  2 . The h yp erparameter details are giv en in table S2 . Mace is trained for 1000 epo chs, which tak es around 12h on a L40s GPU, using the recommended settings. One important c hange w e did is to switc h the default A dam to Adam W optimizer, whic h greatly impro v ed results for MA CE on our data. The eSEN mo del is based on the 6.3 million parameter small version, but with irreps higher than ℓ = 0 , 1 remo ved. Like MACE, w e train eSEN for 1000 ep ochs, whic h tak es around 10h on a L40s GPU. 29 Catego ry Configuration Mo del / Geometry Basis set def2-SVP Num b er of la y ers 4 T ransformer irrep dimensions Hidden irreps 128x0e + 128x1o + 128x2e Edge irreps 1x0e + 1x1o + 1x2e Radial / GNN Block Radial type Bessel Num b er of Bessel functions 10 P olynomial cutoff order 5 Max radius 4.0 MA CE correlation order 3 A ttention Blo ck Laten t irreps 32x0e + 32x1o + 32x2e Num b er of atten tion heads 4 Readout Heads t 1 readout irreps 16x0e + 16x1o + 16x2e t 1 MLP neurons 16 Single → P air irreps 16x0e + 16x0o + 16x1e + 16x1o + 16x2e + 16x2o P air → Quadruple irreps 8x0e + 4x1e + 2x2e P air → Quadruple MLP neurons 8 P air → Quadruple la yers 1 T raining Batc h size 1 Optimizer A dam Base learning rate 10 − 2 Loss function MSE Sc heduler StepLR Sc heduler step 24 Sc heduler γ 0.5 T able S2 Model arc hitecture and training hyperparameters used in our exp eriments. 30 Catego ry Configuration Mo del Architecture Num b er of in teractions 3 Hidden irreps 128x0e + 128x1o Num b er of radial basis 8 Cutoff radius 6.0 Å T raining Max ep o chs 1000 Batc h size 32 Loss function L1 (MAE) per atom Gradien t clipping 10 Optimizer Optimizer Adam W Learning rate 10 − 3 W eight decay 10 − 3 LR Scheduler Sc heduler Cosine Annealing + W armup LR min 10 − 6 T able S3 Model arc hitecture and training hyperparameters for MACE. 31 Catego ry Configuration Mo del Architecture Bac kb one eSEN-MD Num b er of la y ers 4 Hidden channels 128 Sphere channels 128 ℓ max 1 m max 1 Edge channels 128 Distance function Gaussian Num b er of distance basis 64 Cutoff radius 6.0 Å Norm type RMS norm (SH) T raining Max ep o chs 1000 Max atoms per batc h 256 Loss function L1 (MAE) per atom Gradien t clipping 100 Optimizer Optimizer A dam W Learning rate 5 × 10 − 4 W eight decay 10 − 3 LR Scheduler Sc heduler Cosine W armup factor 0.2 W armup ep o c hs 0.01 LR min factor 0.01 T able S4 Model arc hitecture and training hyperparameters for eSEN. 32

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment