Understanding Machine-learned Density Functionals

Understanding Mac hine-learned Densit y F unctionals Li Li , 1 John C. Sn yder , 1, 2 Isab elle M. P elaschier , 3, 1 Jessica Huang , 2 Uma-Naresh Niranjan , 4 P aul Duncan , 2 Matthias Rupp , 5 Klaus-Rob ert M¨ uller , 6, 7 and Kieron Burk e 2, 1 1 Dep artment of Physics and Astr onomy, University of California, Irvine, CA 92697 2 Dep artment of Chemistry, University of California, Irvine, CA 92697 3 Dep artment of Physics, V anderbilt University, Nashvil le, TN 37235, USA 4 Dep artment of Computer Scienc e, University of California, Irvine, CA 92697 5 Dep artment of Chemistry, University of Basel, Klingelb er gstr. 80, 4056 Basel, Switzerland 6 Machine L e arning Gr oup, T e chnic al University of Berlin, 10587 Berlin, Germany 7 Dep artment of Br ain and Co gnitive Engineering, Kor e a University, Anam-dong, Se ongbuk-gu, Se oul 136-713, Kor e a (Dated: June 7, 2018) Kernel ridge regression is used to approximate the kinetic energy of non-interacting fermions in a one-dimensional box as a functional of their densit y . The properties of diﬀeren t k ernels and methods of cross-v alidation are explored, and highly accurate energies are achiev ed. Accurate c onstr aine d optimal densities are found via a mo diﬁed Euler-Lagrange constrained minimization of the total energy . A pro jected gradien t descen t algorithm is deriv ed using lo cal principal component analysis. Additionally , a sparse grid represen tation of the density can b e used without degrading the p erfor- mance of the metho ds. The implications for machine-learned density functional appro ximations are discussed. P A CS n umbers: 31.15.E-, 31.15.X-, 02.60.Gf, 89.20.Ff CONTENTS I. In tro duction 1 I I. Theory and Bac kground 2 A. Mo del system 2 B. Orbital-free DFT 3 C. Data top ology and represen tation 4 D. The k ernel trick and feature space 4 E. Kernel ridge regression 5 I I I. Mo del selection 6 A. Kernels 6 B. Optimization of h yp erparameters 7 IV. Results and Discussion 8 A. Errors on exact densities 8 B. Sparse grid 8 C. Challenge of ﬁnding densit y 8 D. Manifold reconstruction using principal comp onen t analysis 10 E. Pro jected gradien t descent algorithm 11 F. Errors on constrained optimal densities 12 V. Conclusion 12 Ac knowledgmen ts 13 References 13 I. INTR ODUCTION Since the early da ys of quantum mechanics, it has b een known that suﬃciently accurate solutions of Sc hr¨ odinger’s equation for electrons and nuclei yield go o d predictions of the prop erties of solids and molecules [ 1 ]. But the Coulomb repulsion b etw een electrons causes the computational cost of solving the Sc hr¨ odinger equation to gro w rapidly with the num b er of electrons, N [ 2 ]. How- ev er, as Hohenberg and Kohn pro ved in 1964 [ 3 ], the one-electron density may b e used as the basic v ariable of quan tum mechanics instead of the wa vefunction, greatly reducing the complexity of the computational problem. This is called density functional theory (DFT) [ 4 ]. In principle, the mapping of the Schr¨ odinger equation to one with the electron densit y is exact, but in practice, b oth the kinetic energy and the energy of the in teraction b et ween electrons must b e approximated. In the orig- inal Thomas-F ermi theory [ 5 , 6 ], a local density func- tional approximation to the kinetic energy is used. Ho w- ev er, Thomas-F ermi theory prov ed unsuitable for chemi- cal and solid-state applications as it do es not bind mat- ter [ 7 ]. Shortly after the Hohenberg-Kohn theorems, Kohn and Sham (KS) [ 8 ] found a middle ground b y map- ping the man y-b o dy system on to a ﬁctitious system of non-in teracting electrons whic h repro duce the exact elec- tron density . The main reason KS DFT b ecame success- ful is b ecause the kinetic energy of these non-interacting electrons is an excellent appro ximation to the many-bo dy kinetic energy . Simple appro ximations to the interac- tion energy pro duce muc h greater accuracy and reliabil- it y compared with the standard orbital-free DFT sc hemes built on Thomas-F ermi theory . Ho wev er, the accuracy of the results are still sensitive to the approximation of the exc hange-correlation (X C) functional. In the past four decades, there has been extensive researc h into impro v- ing densit y functional XC approximations. Dev elopment of both empirical and non-empirical functionals require 2 great intuition built on years of experience, as well as painstaking trial and error [ 9 – 11 ]. Despite the great success KS DFT has enjoy ed, the computational cost scales as O ( N 3 ), whic h is muc h worse than the linear scaling of orbital-free DFT [ 12 ]. Th us, there con tinues to b e strong interest in improving upon existing orbital-free approximations to the kinetic en- ergy [ 12 – 14 ]. A suﬃciently accurate appro ximation to T S [ n ], the kinetic energy of KS electrons as a functional of the ground-state densit y n ( r ) would enable highly ac- curate orbital-free DFT calculations with the same ac- curacy as KS DFT at a fraction of the computational cost. F or example, b enchmark orbital-free DFT calcu- lations are capable of treating millions of atoms in met- als [ 15 ] or proteins in solven t [ 16 ]. Note that accuracy in T S b ey ond that of current XC approximations would b e unnecessary , since all standard orbital-free DFT schemes utilize the KS decomp osition of the energy , so that stan- dard X C approximations developed for KS DFT can b e utilized. Ho wev er, since T S is typically comparable to the total energy of the system [ 4 ], an unav oidable prob- lem is that a useful kinetic energy (KE) functional calls for muc h stricter relative accuracy than XC functionals. Additionally , accurate f unctional deriv atives are required b ecause one ﬁnds the ground state density by solving an Euler equation with the appro ximate kinetic energy func- tional. Contin ued eﬀorts ha ve b een made in this researc h direction, with some notable progress [ 17 – 28 ]. F or a re- view of state-of-the-art orbital-free DFT functionals, we refer the reader to Ref. [ 12 ]. In DFT, functionals typically fall into t w o categories. Non-empirical functionals deriv ed from ﬁrst principles tend to work w ell across a broad range of systems, and ma y exhibit systemic errors in treating certain t yp es of in teractions. Semi-empirical functionals introduce pa- rameters that are ﬁtted to standard data sets, and are t ypically more accurate with less systematic errors. Recen tly , some of us applied machine learning (ML) in a completely new approac h to appro ximating densit y functionals [ 29 , 30 ]. In a pro of of principle, kernel ridge regression w as used to approximate the kinetic energy of non-in teracting fermions conﬁned to a 1d b ox as a func- tional of the electron density [ 29 ]. In that work, a mo di- ﬁed orbital-free DFT scheme was able to pro duce highly accurate self-consisten t densities and energies that were systematically improv able with additional training data. ML algorithms are capable of learning high-dimensional patterns by non-linear in terp olation b etw een given data. These p ow erful metho ds hav e prov ed to b e very success- ful in man y applications [ 31 ], including medical diag- noses [ 32 ], sto c k market predictions [ 33 ], automated text categorization [ 34 ], and others. Recently , ML has b een applied to quan tum chemistry , including fast and accu- rate mo deling of molecular atomization energies [ 35 – 37 ], optimizing transition state theory dividing surfaces [ 38 ], and calculating bulk crystal prop erties at high temp era- tures [ 39 ]. This new approach to density functional appro xima- tion suﬀers none of the t ypical c hallenges found in tra- ditional appro ximations, but presents many new ones. First and foremost, ML is data-driven: reference calcu- lations are needed to build a mo del for the KE func- tional. Since ev ery iteration in a KS DFT calculation pro vides an electron densit y and its exact non-interacting KE, reference data is relativ ely easy to obtain. Addition- ally , the ML approximation (MLA) to the KE may ha ve thousands or millions of parameters and satisfy none of the standard exact conditions in DFT, such as p ositiv- it y , scaling, and exactness for a uniform electron gas. On the other hand, the form of the MLA is completely gen- eral and th us directly approximates the functional itself, suﬀering none of the typical issues plagued by standard functionals starting from a local approximation. F or ex- ample, some of us recen tly show ed that an MLA for the KE has no problem accurately disso ciating soft-Coulom b diatomics in 1d—a huge challenge for standard approxi- mations [ 30 ]. How ev er, kernel ridge regression is strictly a metho d of in terp olation. An MLA can only b e used on systems it w as designed for. In this pap er, w e explore the properties of the MLA deriv ed in Ref. [ 29 ] in greater detail. In particular, w e in vestigate the use of v arious kernels and their prop er- ties and the eﬃciency of v arious cross v alidation meth- o ds. W e discuss the issue of functional deriv atives of the MLA in greater detail, and explain ho w a mo diﬁed constrain t to the standard Euler equation enables highly accurate self-consistent densities, or c onstr aine d optimal densities , to b e found. Additionally , a pro jected gra- dien t descen t algorithm is deriv ed using local principal comp onen t analysis in order to solve the modiﬁed Euler equation. Finally , w e explore the use of a sparse grid represen tation of the electron densit y and its eﬀects on the metho d. I I. THEOR Y AND BA CKGR OUND Throughout this work, w e consider only non- in teracting same-spin fermions in one-dimension. Thus, all electron densities n ( x ) are fully spin-p olarized. A tomic units are used in symbolic equations, but energies are usually presen ted in kcal / mol. A. Mo del system Consider N non-in teracting same-spin fermions sub- ject to a smo oth external p oten tial in one-dimension, with hard w alls at x = 0 and x = 1. W e restrict this study to a simple class of potentials, namely a sum of 3 Gaussian dips with v arying heights, widths and centers: v ( x ) = 3 X j =1 a j exp( − ( x − b j ) 2 / (2 c 2 j )) , (1) 3 0.00 0.75 1.50 2.25 3.00 n ( x ) 0.0 0.2 0.4 0.6 0.8 1.0 x 0 5 10 15 20 v ( x ) FIG. 1. A few sample densities and their corresp onding po- ten tials, for N = 1. for x ∈ [0 , 1], and v ( x ) = ∞ elsewhere. The Hamil- tonian for this system is simply ˆ H = ˆ T + ˆ V , where ˆ T = − ∂ 2 / 2 ∂ x 2 and ˆ V = v ( x ). W e solve the Schr¨ odinger equation  − 1 2 ∂ 2 ∂ x 2 + v ( x )  φ ( x ) = φ ( x ) , (2) for the eigenv alues  j and orbitals φ j ( x ). As our fermions are same-spin, each orbital φ j ( x ) is singly-o ccupied. Th us, the electron density is given by n ( x ) = N X j =1 | φ j ( x ) | 2 , (3) and the kinetic energy is T = 1 2 N X j =1 Z 1 0 dx | φ 0 j ( x ) | 2 . (4) A dataset is created by randomly sampling a j ∈ [1 , 10] , b j ∈ [0 . 4 , 0 . 6] , c j ∈ [0 . 03 , 0 . 1], to generate 2000 diﬀeren t p otentials. F or each p otential, the system is o ccupied with up to 4 fermions, and the exact densi- ties and kinetic energies are computed. Numerically , the Sc hr¨ odinger equation is solved by discretizing the density on a grid: x j = ( j − 1) / ( N G − 1) , j = 1 , . . . , N G (5) and ∆x = 1 / ( N G − 1) is the grid spacing. Numerov’s metho d [ 40 ] together with a sho oting metho d is used to solv e for the eigenv alues and eigenfunctions of Eq. ( 2 ). F or N G = 500, the error in our reference kinetic energies is less than 10 − 7 . Fig. 1 gives a few sample densities and their corresp onding p oten tials. The data used here is identical to that of Ref. [ 29 ]. The exact v alues of the parameters used in eac h sample is giv en in the supplementary information of Ref. [ 29 ]. Of the 2000 samples generated, the ﬁrst half is reserved for training while the second half is reserv ed for testing (whic h we refer to as the test set ). B. Orbital-free DFT In orbital-free DFT, T S is appro ximated as a functional of n ( x ). F or our mo del system with non-interacting fermions, the total energy is giv en as E v = min n { T [ n ] + V [ n ] } , (6) for a giv en potential v ( x ). The potential is known exactly as a functional of n ( x ): V [ n ] = Z 1 0 dx n ( x ) v ( x ) . (7) Via the v ariational principle, the ground-state density is found b y the Euler-Lagrange constrained search δ  E v [ n ] − µ  Z n ( x ) dx − N  = 0 , (8) where the chemical p otential µ is adjusted to produce the required particle n umber N . This b ecomes simply δ T [ n ] δ n ( x ) = µ − v ( x ) . (9) The density that satisﬁes this equation, minimizing E v [ n ] with the normalization constrain t, is found self consis- ten tly . Giv en the exact functional T [ n ], solving Eq. ( 9 ) will yield the exact ground-state densit y of the system. But in practice, T must b e appro ximated. Let ˜ T b e such an appro ximation, n ( x ) b e the exact densit y , and ˜ n ( x ) b e the self-consisten t densit y found with ˜ T . There are tw o measures of the error of suc h an appro ximate ˜ T [ 41 ]. The ﬁrst is to compute the functional-driv en error ∆T F = ˜ T [ n ] − T [ n ], whic h is simply the error in the KE ev alu- ated on the exact density . The second (and muc h more diﬃcult) test is to insert ˜ T into Eq. ( 9 ), solve for the appro ximate density ˜ n , and compute its error relative to the KE of the exact densit y ∆E = ˜ E v [ ˜ n ] − E v [ n ]. Then the density-driv en error is deﬁned as ∆E D = ∆E − ∆T F [ 41 ]. This is the additional error incurred b y the ap- pro ximate densit y . In practice, a functional which only satisﬁes the ﬁrst test is not muc h use, as the ground-state densit y itself m ust also be obtained from this appro xima- tion. In orbital-free DFT, self-consistent results can b e m uch worse than energies of KS densities, as inaccuracies in the functional deriv ativ e can cause large errors in the corresp onding densit y . In the case of the KE functional for real systems, functional deriv ativ es of traditional ap- pro ximations can hav e singularities at the n uclei, making all-electron calculations very diﬃcult, if not imp ossible, to conv erge [ 12 ]. Many of these problems can b e av oided through use of pseudopotentials [ 12 , 28 ], but in general the solution for Eq. ( 9 ) is non trivial. As men tioned ab ov e, the simplest densit y functional appro ximation to T S is the lo cal appro ximation [ 4 ], whic h 4 for spin-p olarized densities in 1d is T loc [ n ] = π 2 6 Z dx n 3 ( x ) . (10) F or N = 1, the exact KE has the v on W eizs¨ ack er [ 17 ] form: T W [ n ] = Z dx n 0 ( x ) 2 8 n ( x ) . (11) As was shown in Ref. [ 29 ], the local approximation do es p o orly . The mean absolute error (MAE) on the test set is 217 kcal/mol, and self-consistent results are ev en w orse at 1903 k cal/mol. A standard extension of the lo cal ap- pro ximation to a semi-lo cal form is to add a fraction of T W [ n ] to T loc [ n ], forming a mo diﬁed gradient expansion appro ximation. It w as sho wn in Ref. [ 29 ] that this did little to impro ve up on the lo cal approximation. C. Data top ology and representation T ypically in ML, the data has a ﬁnite representation. F or example, in Ref. [ 35 ], molecular structures are rep- resen ted b y a Coulom b matrix and the mo del predicts atomization energies. In con trast, the electronic densit y n ( x ) is a contin uous function restricted to the domain [ 42 ] J N ≡  n    n ( x ) ≥ 0 , n 1 / 2 ( x ) ∈ H 1 ( R ) , Z n ( x ) dx = N  , (12) where H 1 ( R ) is a Sobolev space 1 . Although J N is inﬁnite dimensional, in practice n ( x ) is expanded in a ﬁnite basis (with N G basis functions). In this work, we use a real space grid to represent the densit y , since our reference calculations are done using the same grid. W e use the L 2 inner pro duct and norm b et ween densities n i ( x ) , n j ( x ) h n i , n j i = Z ∞ −∞ dx n i ( x ) n j ( x ) , k n k = p h n, n i . (13) (In actual calculations, all densities are represented on a ﬁnite basis, and th us will ha ve ha v e a ﬁnite L 2 -norm). Since the ML algorithm is expressed in terms of this in- ner product, the results are indep endent of the sp eciﬁc represen tation used as long as the basis is conv erged. Ev en with a truncated basis, J N is still high- dimensional and applying ML to learn the KE of all den- sities in J N w ould not b e feasible. F ortunately , we are only interested in a subspace of J N related to a sp eciﬁc class of p otentials (e.g. Gaussian dips), which greatly 1 A Sobolev space W k,p ( R ) is a vector space of functions with a norm that is a combination of L p -norms of the function itself and its deriv atives up to a given order k . It is conv entional to write W 1 , 2 ( R ) as H 1 ( R ). f ∈ H 1 ( R ) means that f and its ﬁrst order deriv ative are in L 2 . 1 0 1 x 1 0 1 y (a) 0 π / 2 π θ 0 1 2 ρ (b) 1 0 1 x 1 0 1 y (c) FIG. 2. Example of the non-linear transformation of data to feature space. (a) The data is non-linear (a circle) in Carte- sian co ordinates. The green dashed line is a linear ﬁt to the data p oints (blue crosses). (b) When the data is transformed to feature space by x → ρ cos θ , y → ρ sin θ , the linear struc- ture in the data is rev ealed (red solid line). (c) The model can b e transformed back to the original space to give a non-linear ﬁt of the data. reduces the v ariety of p ossible densities. In general, let the p oten tial v ( x ) b e parametrized by the parameters { p 1 , . . . , p d } . W e deﬁne the density manifold M N ⊂ J N as the set of all densities that come from these poten- tials with a given particle num b er N . In general, M N is a d -dimensional manifold. The training densities, n j ( x ) for j = 1 , . . . , N T , are sampled from M N . In the present w ork, the external potential has 9 parameters, and thus d is at most 9. D. The kernel trick and feature space In ﬁnding the structure of lo w-dimensional data, it is often suﬃcient to optimize parametrized non-linear forms (e.g., using a p olynomial to ﬁt a sinusoid). F or high-dimensional, nonlinear data this becomes increas- ingly diﬃcult. In k ernel-based machine learning, the ap- proac h is to transform the data itself non-linearly to a high-dimensional space kno wn as feature space, such that the data b ecomes linear [ 31 , 43 – 46 ]. Fig. 2 illustrates data p oin ts that lie on a circle in the Cartesian plane. As shown, the data b ecomes linear on transformation to polar co ordinates, and linear regression can subsequently b e used to ﬁt the data. T ransforming bac k to Cartesian co ordinates recov ers the non-linearit y . Let the data points b elong to a v ector space χ , also called input space, and let Φ : χ → F b e the map to feature space F . Assuming we wish to apply a linear metho d suc h as regression in feature space F , w e note that re- gression can be expressed solely in terms of the inner pro duct betw een feature vectors Φ ( x ) and Φ ( y ), where x, y ∈ χ . W e deﬁne the kernel k such that k ( x, y ) = h Φ ( x ) , Φ ( y ) i . (14) The k ernel can generally b e though t of a measure of simi- larit y b etw een data, but must satisfy Mercer’s condition: Z Z k ( x, y ) g ( x ) g ( y ) dxdy ≥ 0 , (15) 5 for all g ( x ) satisfying R ∞ −∞ | g ( x ) | 2 dx < ∞ . Mercer’s the- orem [ 47 ] guaran tees the existence of a feature space F , whic h is a reproducing kernel Hilb ert space [ 48 ]. Since the linear algorithm in F ma y b e expressed in terms of the k ernel in Eq. ( 14 ), Φ need never b e explicitly com- puted. This pro cedure is kno wn as the kernel trick , and enables easy nonlinearization of all linear scalar pro duct- based metho ds that can b e expressed via an inner pro d- uct [ 49 ]. E. Kernel ridge regression Kernel ridge regression is a nonlinear version of regres- sion with a regularization term to prev ent o verﬁtting [ 50 ]. Our MLA for the KE has the form T ML [ n ] = N T X j =1 α j k [ n, n j ] , (16) where N T is the num b er of training densities, α j are w eights to b e determined, n j are training densities and k [ n, n j ] is the kernel. The weigh ts are found by minimiz- ing the quadratic cost plus regularization C ( α ) = M X j =1 ( T ML [ n j ] − T [ n j ]) 2 + λ α > K α , (17) where α = ( α 1 , . . . , α N T ), K is the kernel matrix, K ij = k [ n i , n j ], and λ is called the regularization strength. The second term penalizes weigh ts with large magnitudes in order to preven t o v erﬁtting. 2 By setting the gradien t of Eq. 17 to zero, minimizing C ( α ) gives α = ( K + λ I ) − 1 T , (18) where I is the iden tity matrix and T = ( T [ n 1 ] , . . . , T [ n N T ]). The hyp erp ar ameters , whic h include the regularization strength λ and the parameters of the k ernel suc h as the length scale σ , are found via cross v alidation (see [ 36 ] and Sect. I I I ). The c hoice of the kernel will depend on the given data. Some kernels are designed to b e generally robust and applicable (e.g., the Gaussian kernel), while others are designed for a sp eciﬁc type of data (see e.g. [ 31 , 52 , 53 ]). A go o d choice of kernel can reﬂect the characteristics of the data (see [ 54 ]). In Ref. [ 29 ], we c hose the Gaussian k ernel k [ n i , n j ] = exp  −k n i − n j k 2 / 2 σ 2  , (19) 2 The regularization term accoun ts for the possibility of noisy data (e.g. experimental data), and imposes certain smo othness condi- tions on the mo del (see [ 51 ]). Our reference data is deterministic and thus noise-free in this sense, but, b ecause the precision of our calculations is limited, we may consider the numerical un- certaint y to b e noise. 0.0 0.2 0.4 0.6 0.8 1.0 k n i − n j k 0 1 2 3 4 5 Fractions (a) N = 1 3000 3300 3600 3900 4200 T ( k c a l / m o l ) 0 20 40 60 80 100 120 Occurrences (b) N = 1 FIG. 3. (a) Normalized distribution of the Euclidean distance b et ween all distinct pairs of densities in the dataset (2000 densities). The maximum distance betw een any pair is 0 . 9. (b) Histogram of the KE in the dataset. The vertical dashed line at 3093 kcal/mol is the ground-state energy of one fermion in a ﬂat b ox of length 1. 0.0 0.5 1.0 x 0.85 0.90 0.95 1.00 1.05 y (a) 0.050 0.015 0.020 Φ ( x ) 0.85 0.90 0.95 1.00 1.05 y (b) 0.035 0.000 0.035 ˜ Φ ( x ) 0.10 0.05 0.00 0.05 0.10 ˜ y (c) FIG. 4. (a) An example 1d noisy data set. (b) T ransformation to feature space Φ ( x ). (c) Centering of data in feature space. where σ is the length scale. Since the densit y is repre- sen ted on a uniform grid, the L 2 -norm can b e appro xi- mated b y 3 k n i − n j k 2 = ∆x N G X l =1 ( n i ( x l ) − n j ( x l )) 2 (20) where x l is given by the grid deﬁned in Eq. ( 5 ). This ap- pro ximation b ecomes exact as ∆x → 0. Fig. 3 shows the range and distribution of Euclidean distances b etw een all pairs of densities and KE of all densities in the dataset with N = 1. Ordinary linear regression mo dels frequen tly employ a bias term to account for the fact that the data might lie aw ay from the origin. Without this term, the regres- sion line is forced to go through the origin, causing a 3 Note that, in Ref. [ 29 ], the same representation for the density was used, but the densit y w ere treated as v ectors, so the standard Euclidean distance was used in the kernel. This is equiv alent to the formulation here, except our notation is more general now (e.g. Simpson’s rule could be used to appro ximation the L 2 -norm instead of a Riemann sum), and the length scale in Gaussian kernel here is related to the scale of the kernel in Ref. [ 29 ] by a factor of √ ∆x . 6 1 0 1 2 l o g 1 0 σ 16 12 8 4 l o g 1 0 λ 0.13 0.5 10 10 100 1 0 1 2 l o g 1 0 σ 16 12 8 4 l o g 1 0 λ 0.13 0.5 10 100 1 0 8 16 l o g 1 0 σ 16 12 8 4 0 l o g 1 0 λ 6.4 10 100 100 FIG. 5. Contour plots of the functional-driv en MAE | ∆T F | ov er the test set in kcal/mol for selected k ernels with N T = 100. The dashed line delineates the region where the mo del achiev es chemical accuracy . Each gray dot give s the optimal choice of h yp erparameters from a randomized 10-fold cross v alidation. The black dot denotes the median o ver 40 rep etitions. In the lo wer right region (i.e. small λ and large σ ), the matrix inv erse in Eq. ( 18 ) is numerically unstable due to the limited precision of the calculation. systematic error if the data do es not. The bias term can b e implemen ted directly , or by centering the sam- ples suc h that the mean is zero. Fig. 4 illustrates the transformation to feature space for an example 1d data set and linear regression in feature space. If the data is cen tered in feature space, the bias term is unnecessary . Here, we center the densities in features space such that P N T j =1 Φ ( n j ) = 0. W e deﬁne the centered map to feature space ˜ Φ ( n ) = Φ ( n ) − P N T j =1 Φ ( n j ) / N T . Then the centered k ernel is [ 49 ] ˜ k [ n, n 0 ] = h ˜ Φ ( n ) , ˜ Φ ( n 0 ) i = k [ n, n 0 ] − 1 N T N T X j ( k [ n 0 , n j ] + k [ n, n j ]) + 1 N 2 T N T X i,j =1 k [ n i , n j ] . (21) F or simplicity , all equations given in this w ork assume that the data is cen tered (i.e. k = ˜ k ). In fact, kernels suc h as the Gaussian kernel Eq.( 19 ) whose induced re- pro ducing k ernel Hilb ert space on a b ounded domain is dense in the space of con tinuous functions on this domain do not require cen tering [ 55 ]. I I I. MODEL SELECTION A. Kernels Mo del selection refers to the pro cess of selecting a ker- nel and the corresp onding h yp erparameters. In kernel ridge regression, this includes the regularization strength λ and the k ernel parameters (e.g. in the Gaussian k er- nel, the length scale σ ). T able I lists some standard ker- nels. Radial basis function (RBF) k ernels, whic h in- clude the Gaussian, Cauc h y , and Laplacian kernels, all b eha ve similarly and tend to work for a broad range of Kernel k [ n, n 0 ] Gaussian exp( −k n − n 0 k 2 / 2 σ 2 ) Cauc hy (1 + k n − n 0 k 2 /σ 2 ) − 1 Laplacian exp( −k n − n 0 k / 2 σ ) W a ve θ k n − n 0 k sin k n − n 0 k θ P ow er k n − n 0 k d Linear h n, n 0 i T ABLE I. Standard k ernels. The parameters σ , θ , d are k ernel parameters. The linear kernel has no parameters. problems. Other k ernels w ork w ell for sp eciﬁc data struc- tures [ 31 , 52 , 53 ] and regularization prop erties [ 49 ]. Fig. 5 sho ws the contours of the functional-driv en MAE o v er the test set as a function of the regulariza- tion strength λ and the kernel parameter σ . W e see that the qualitative b eha vior is similar for the Gaussian, Cauc hy and Laplacian kernels. In the left region (where the contour lines are vertical), the length scale σ is m uch smaller than the distance b etw een neighboring training densities. Thus the RBF-type kernel functions centered at eac h training densit y ha ve minimal ov erlap, yielding a p o or approximation to the KE functional. The ker- nel matrix becomes nearly unit y , and the regularization λ has negligible eﬀect. On the right side of the con tour plot, the length scale is comparable to the global scale of the data. In these regions, the k ernel functions are slowly v arying and do not hav e enough ﬂexibility to ﬁt the non- linearit y in the data. The region with minimum MAE lies in the middle. The Gaussian and Cauch y kernels b oth give the same performance, with errors less than 1 k cal/mol in the middle region (enclosed b y the dashed line), while the Laplacian kernel b ehav es p o orly in com- parison. This is likely due to the cusp in the form of the k ernel, which cannot ﬁt the smo oth KE functional. 7 training training set validation set M N test FIG. 6. Carto on shows the relation of each data set in M N . Eac h blac k dot represen ts a sample (density and its corre- sp onding KE). T raining, v alidation and test set are subsets of the full data set. B. Optimization of h yp erparameters After picking a kernel family , the v alues of the hyper- parameters m ust b e chosen. Ideally , we select the h y- p erparameters suc h that the generalization error, which is the error not only on our training set but also on all future data, is minimal. The out-of-sample error must b e estimated without lo oking at the test set (the test set is never touched during mo del selection, so that it can giv e a true test of the ﬁnal p erformance of the mo del) [ 31 , 36 ]. This procedure, known as cross-v alidation, is essen tial for mo del selection in preven ting ov eroptimistic p erformance estimates (o verﬁtting). V arious schemes for cross v alidation exist [ 31 , 36 , 56 ], but all obey a basic principle: the a v ailable data is sub- divided into three parts: the tr aining , validation and the test sets . The ML model is built from the training set and the h yp erparameters are optimized b y minimizing the error on the v alidation set (Fig. 6 ). The test set is nev er touc hed until the weigh ts and hyperparameters ha ve b een determined. Then and only then, the general- ization ability of the mo del can b e assessed with the test data ([ 31 , 36 ], see also [ 57 ]). Typically , the data is sh uf- ﬂed to ensure its random distribution betw een training and v alidation division. This can b e rep eated with diﬀer- en t sub divisions. A few sc hemes, which will b e analyzed for our kinetic energy functional estimation problem, are describ ed below. F or eac h sc heme, a test set of 1000 sam- ples is used to estimate the generalization error after the ML mo del is selected. Simple cross v alidation: The training data ( N T sam- ples) is randomly divided into a training set of 70% and a v alidation set of 30%. The hyperparameters are optimized by minimizing the MAE on the v ali- dation set. k -fold cross v alidation: Step 1: The N T training data is randomly divided in to k bins. 40 60 80 100 120 140 N T 1 0 1 2 3 4 | ∆ T F | Simple 2-fold 3-fold 4-fold 5-fold 10-fold Leave-one-out FIG. 7. Comparison of diﬀerent cross v alidation metho ds, in- cluding simple, 2-fold, 3-fold, 4-fold, 5-fold, 10-fold and leav e- one-out. The mean of the absolute functional-driven error | ∆T F | (in kcal/mol) is ev aluated on the test set and the error bars represent the standard deviation. Step 2: The j th bin is used as the v alidation set and the remaining k − 1 bins as training set. The mo del is built on the training set and the h yp erparameters are selected by minimizing the MAE on the v alidation set. Step 3: Repeat step 2 k times such that all bins ha ve been used as v alidation sets. W e will then ha ve k models in total and the ﬁnal hy- p erparameters are selected as the median o ver all mo dels. Because the mean cross v alidation error still de- p ends on the initial random partitioning of data in cross v alidation, we repeat the pro cedure with diﬀeren t sub divisions [ 31 ]. Lea v e-one-out: Lea ve-one-out (LOO) is a sp ecial case of k -fold cross v alidation, when k = N T . Thus eac h bin con tains only one sample. T ypically , it is b etter to lea ve out as little data as p os- sible to exploit the statistical p ow er in the data. Simple cross v alidation is computationally exp edient, but w aste- ful since not all training data participates in the opti- mization. k -fold cross v alidations are used in situations where data is very limited, or exp ensive to collect. Leav e- one-out is often used with limited data and it b ecomes computationally intensiv e if N T is large. k -fold cross v al- idation giv es a go o d balance on all counts. F or the Gaussian kernel, Fig. 7 shows the MAE on the test set with the hyperparameters optimized with diﬀer- en t cross v alidation methods. With 120 training densi- ties, all sc hemes give a similar MAE, despite the large v ariations in σ and λ . This means that multiple mo dels exist that give comparable performance. As exp ected, the only v ariations in MAE o ccur for more limited data. Fig. 5 shows how 10-fold cross v alidation p erforms in selecting h yp erparameters that generalize w ell to the test set, for a few k ernels. The gray dots represen t the optimal parameter choice for each rep etition, and the blac k dot is the median ov er all rep etitions. In this case, the global 8 Kernel λ p | ∆T F | | ∆T F | max Gaussian 4 . 5 · 10 − 14 1.6 0.13 3.4 Cauc hy 7 . 8 · 10 − 14 3.5 0.13 2.9 Laplacian 1 . 0 · 10 − 15 3 . 6 · 10 5 6.4 231 Linear 6 . 2 · 10 − 1 - 53.1 380 W a ve 4 . 5 · 10 − 1 0.14 19.2 252 P ow er 1 . 0 · 10 − 13 1.96 3.3 104 T ABLE I I. The optimal hyperparameters found through 10- fold cross v alidation and the MAE o ver the test set for v arious k ernels with N = 1 and N T = 100. The k ernel parameter p refers to σ for the Gaussian, Cauc hy and Laplacian kernels, θ for the wa v e kernel and d for the p ow er kernel. The linear k ernel has no parameter. Errors are given in kcal/mol. minim um of the MAE lies in a relatively ﬂat basin. Each randomized cross v alidation lies near the true minimum, indicating the mo del generalizes w ell to the test set. Finally , we use 10-fold cross v alidation (rep eated 40 times) to optimize the hyperparameters. T able I I shows the optimal h yp erparameters and functional driv en errors for the k ernels listed in T able I . Some optim um v alues for the Gaussian k ernel are listed in T able I I I . Detailed infor- mation with optimum v alues of other kernels are sho wn in supplemen tary material. IV. RESUL TS AND DISCUSSION In the main work of this pap er, w e test in greater detail some of the methods that were introduced in Ref. [ 29 ] using only the Gaussian k ernel, as it p erforms the b est. A. Errors on exact densities In T able I I I , we ev aluate our MLA, constructed us- ing the ﬁrst N T training densities in our data set, on the exact densities of the test set and compute the errors ∆T F = T ML [ n ] − T [ n ]. The Gaussian and Cauch y k ernels giv e the b est p erformance. F or the Gaussian kernel with N = 1 chemical accuracy is ac hieved (i.e. MAE less than 1 kcal/mol) at N T = 60. Just as we saw in Ref. [ 29 ], the p erformance is systematically improv able with increas- ing n umber of training densities. The Laplacian kernel giv es a mean absolute error (MAE) of 6 . 9 k cal/mol at N T = 100 (still b etter than LD A), which impro ves as N T increases. On the other hand, the p erformance of the wa ve k ernel do es not impro ve as N T increases (see supplemen tal information). This indicates the form of the w av e kernel is not ﬂexible enough to ﬁt the form of the KE functional. B. Sparse grid Note that the choice of N G used in the reference cal- culations is needed to con verge our reference energies and densities, but may be larger then the grid needed to “conv erge” our ML functional. As the ML model de- p ends only on the inner pro duct b et ween densities, this will typically conv erge muc h faster than, e.g. Numero v’s metho d. T o demonstrate this, we deﬁne a “sparse” grid, { x s ( j − 1)+1 | j = 1 , . . . , N G /s } , using every s th point in the grid (w e only choose s such that N G is divisible b y s ). Fig. 8 sho ws that p erformance of the mo del is unaf- fected until N G is reduced to ab out 10 grid p oints. The mo del is cross-v alidated eac h time, but the h yp erparame- ters change only slightly . Th us, ML can accurately learn the KE functional with a far less complete basis than is required to accurately solve the Schr¨ odinger equation. This is p ossible b ecause we hav e restricted the learning problem to a simple t yp e of p otential with a limited range of possible densities and energies. The underlying di- mensionalit y of the data is about 9, comparable to the n umber of parameters that determine the p otential. The mo del needs only enough degrees of freedom in th e repre- sen tation of the density to distinguish b etw een densities, but no more. Thus it is no coincidence that the mini- m um grid required is comparable to the dimensionalit y of the data (i.e. the dimensionalit y of the density mani- fold M N ). Ho wev er, we also need a suﬃciently ﬁne grid to com- pute the integral in Eq. ( 7 ) to the desired accuracy . In the problem shown here, the dimensionality of the data is relativ ely small, and will increase for larger systems (e.g. real molecules with many degrees of freedom). In general, ho wev er, we need to consider both factors in c ho osing a suitable basis. But, we may b e able to use a basis that is more sparse than that of the reference data, which w ould greatly reduce the computational cost of the metho d. C. Challenge of ﬁnding density Th us far, w e hav e fo cused on the discussion of the p er- formance of the MLA ev aluated on exact densities (i.e. the functional-driv en errors). How ever, in order for a functional to b e useful, it must also predict the ground- state densit y . As discussed previously , an accurate func- tional deriv ative is necessary in order to solve Eq. ( 9 ) and yield an accurate density . The functional deriv ativ e of our MLA is giv en by: δ T ML [ n ] δ n ( x ) = N T X j =1 α j δ k [ n, n j ] δ n ( x ) , (22) where, for the Gaussian k ernel, δ k [ n, n j ] /δ n ( x ) = ( n j ( x ) − n ( x )) k [ n, n j ] /σ 2 . (23) In Fig. 10 , we plot the functional deriv ative of our mo del compared with the exact deriv ativ e. The mo del displays 9 | ∆T F | | ∆T | | ∆E | N N T λ · 10 14 σ Mean Max Mean Max Mean Max 1 40 50 . 4 . 2 1 . 9 30 . 15 120 5 . 1 32 60 10 . 1 . 8 0 . 62 11 . 3 . 0 19 0 . 66 4 . 4 80 54 . 1 . 5 0 . 23 3 . 1 1 . 1 11 0 . 44 2 . 6 100 4 . 5 1 . 6 0 . 13 3 . 5 1 . 4 16 0 . 41 2 . 3 150 1 . 2 1 . 3 0 . 06 1 . 0 0 . 81 5 . 1 0 . 27 1 . 9 200 1 . 3 1 . 0 0 . 03 0 . 87 0 . 67 10 . 0 . 28 1 . 6 2 60 60 . 3 . 0 0 . 46 4 . 8 1 . 79 9 . 9 0 . 73 3 . 6 100 1 . 0 2 . 2 0 . 14 1 . 7 1 . 25 5 . 0 0 . 44 2 . 5 3 60 6 . 0 5 . 8 0 . 31 3 . 9 1 . 03 5 . 0 0 . 82 6 . 5 100 1 . 9 2 . 5 0 . 13 1 . 7 1 . 11 8 . 3 0 . 59 3 . 8 4 60 0 . 6 14 0 . 46 5 . 4 2 . 44 9 . 5 0 . 93 6 . 3 100 1 . 4 2 . 7 0 . 08 2 . 6 1 . 12 9 . 8 0 . 63 5 . 0 1-4 400 1 . 7 2 . 2 0 . 12 3 . 0 1 . 28 12 . 6 0 . 52 5 . 1 T ABLE I I I. Hyp erparameters and errors measured ov er the test set using the Gaussian k ernel, for diﬀerent N and N T . The regularization strength λ and length scale of the Gaussian kernel σ is optimized with 10-fold cross v alidation. The functional- driv en error ∆T F = T ML [ n ] − T [ n ] is ev aluated on the test set. Mean and max absolute errors are given in k cal/mol. ∆T = T ML [ ˜ n ] − T [ n ], gives the error in the KE ev aluated on constrained optimal densities. Likewise ∆E = E ML [ ˜ n ] − E [ n ]. 0.12 0.13 0.14 | ∆ T F | (a) ( 1 0 , 1 . 0 8 ) 1.1 1.4 1.7 | ∆ T | (b) ( 1 0 , 8 2 0 4 ) 0 50 100 150 200 250 300 N G 1.9 2.1 2.3 2.5 σ 0 1 2 3 λ × 1 0 1 4 (c) σ λ × 1 0 1 4 FIG. 8. The eﬀect of using a sparse grid to represent the den- sit y on the performance of the MLA, for N = 1, N T = 100, with the Gaussian kernel. Here (a) | ∆T F | = T ML [ n ] − T [ n ] is the mean absolute functional-driv en error of the MLA ev al- uated on the test set in kcal/mol, (b) | ∆T | = T ML [ ˜ n ] − T [ n ] giv es the error of KE ev aluated on constrained optimal densi- ties in kcal/mol and (c) the corresp onding re-cross v alidated h yp erparameters λ and σ . The MAE is completely unaﬀected as N G is reduced un til appro ximately N G = 10, when it jumps sharply . a highly inaccurate functional deriv ative, with a huge amoun t of apparent “noise”, as was found in Ref. [ 29 ]. What is the source of this noise? In general, if the un- derlying dimensionality of the data is muc h less than the dimensionalit y of J N (whic h in this case is essen tially inﬁnite), ML will b e unable to capture the functional deriv ative. The functional deriv ative contains informa- tion on how the KE c hanges along a n y direction, but ML cannot learn this b ecause it only has information in di- rections in which it has data (i.e. along M N ). Fig. 11 il- lustrates the problem: standard minimization tec hniques will rapidly exit the “in terp olation” region in which the MLA is exp ected to b e accurate. The MLA is only given information about how the KE c hanges along the densit y manifold M N . In the man y dimensions orthogonal to M N , the MLA pro duces an inaccurate deriv ativ e (each of these dimensions pro duces a large relativ e error since no data exists in these directions; the sum o ver many di- mensions creates a large total error in functional deriv a- tiv e). A standard gradient descent will quickly ven ture oﬀ of M N in to regions of J N where the mo del is guaran- teed to fail. Fig. 9 shows the deviation of self-consisten t densit y if the searc h is not constrained to M N . T o ﬁx this, we further constrain the minimization in Eq. ( 9 ) to sta y on M N . The Euler-Lagrange minimization for the ground-state densit y can b e expressed as δ { E [ n ] − ζ g [ n ] } = 0 , (24) where g is any function that is zero on M N and p os- itiv e elsewhere. Th us g [ n ] = 0 implicitly deﬁnes the densit y manifold M N . Since an y n ∈ M N satisﬁes the normalization condition, the previous constrain t is no longer necessary . Because the minimizing density (i.e. the ground-state density) is in M N and th us satisﬁes the constrain t g [ n ] = 0, Eq. 24 gives the same solution as Eq. 9 . Essentially , we ha ve v astly reduced the domain of the searc h from J N to M N . T o av oid confusion, we call 10 0.0 0.5 1.0 x 0 2 4 n ( x ) exact initial 1 2 3 4 FIG. 9. The ﬁrst few steps in a standard gradient descent solving the Euler equation in Eq. ( 9 ) using our MLA for the KE with N T = 100 starting from a sample training density . The dashed line shows the exact self-consistent solution. The noise in the bare functional deriv ative quickly causes large corresp onding errors in the density . the minimizing density of this equation the c onstr aine d optimal density . It may b e solv ed self-consistently in the same sense of solving the standard Euler equation. How- ev er, the g [ n ] which exactly gives the density manifold is unkno wn. In the next section, we dev elop an approxima- tion which attempts to reconstruct the density manifold from the training densities. D. Manifold reconstruction using principal comp onen t analysis Our aim is to reconstruct M N lo cally around a given densit y n ( x ), whic h is assumed to b e on the densit y mani- fold. A simple approach is to appro ximate M N as locally linear, using principal component analysis (PCA) to de- termine the tangent space empirically from the training densities. This will w ork as long as there are enough training densities co v ering the density manifold. First, w e deﬁne a weigh ted av erage densit y around density n : ¯ n ( x ) = 1 Ω N T X j =1 ω j n j ( x ) (25) This generalized a verage is weigh ted by the function ω ( k n − n 0 k ) that only dep ends on the distance from n ( x ) to n 0 ( x ), ω j = ω ( k n − n j k ), and Ω = P N T j =1 ω j . Note that n 0 ( x ) refers to the density n 0 ev aluated at x and not the deriv ative of n with resp ect to x . The lo calit y of the metho d comes from the choice of ω . F or standard PCA, the c hoice is ω ( r ) = θ ( R − r ), where θ is the Hea viside function, and R is the distance from n 0 to the m -th nearest training density . This equally w eights the nearest m training densities, and ignores all other training densities. This c hoice w as used in Ref. [ 29 ]. (a) 0.0 0.5 1.0 x 8 4 0 4 exact MLA projected exact projected MLA (b) 0.0 0.5 1.0 x 10 0 exact MLA projected exact projected MLA FIG. 10. The functional deriv ativ e of our MLA (green) cannot repro duce the exact deriv ative v ( x ) (blue dot dashed) ev alu- ated at the ground-state density , b ecause this information is not contained in the data. Ho wev er, b oth agree when pro- jected onto the tangen t of the data manifold M N at n (blac k and red dashed). Shown for N = 1, for (a) N T = 40 and (b) N T = 100, for a typical test sample. Here, w e choose a slightly smo other weigh ting function: ω ( r ) = (1 − r /R ) θ ( R − r ) (26) Next, PCA is p erformed b y sp ectral analysis of the empirical co v ariance op erator [ 58 ], based on the w eighted a verage v alue around n ( x ). W e deﬁne the centered neigh- b orho o d b y ˜ n j ( x ) = n j ( x ) − ¯ n ( x ). In this problem, den- sities are represented on a grid with N G = 500 p oints, so let n = ( n ( x 1 ) , . . . , n ( x N G )) > b e the vector representa- tion of n ( x ). The cov ariance matrix Γ ∈ R N G × N G is Γ = 1 Ω N T X j =1 ω j n j n > j , (27) with eigendecomp osition Γ u j = λ j u j . (28) The eigen v alues are ordered such that λ j > λ j +1 . The eigen vectors u j are called princip al c omp onents (PCs), and giv e the directions of maxim um v ariance in the data. 11 n j g [ n ]=0 Gradient descent ˜ n M N J N FIG. 11. Cartoon illustrating the diﬃculty in solving for the self-consisten t density with our MLA. Pictured are the density manifold M N (curv ed solid line), the training densities n j ∈ M N (blac k circles), and the exact self-consistent density ˜ n (red square). Here g is a function that is identically zero on M N and p ositive elsewhere. Thus M N is deﬁned implicitly b y g [ n ] = 0. The shaded area, called the interpolation region, sho ws where the MLA is accurate. The solution of Eq. 9 via exact gradient descent is given by the red dashed line, which b ecomes unstable and so on lea ves the shaded area. R T PCA [ n ] n j M N J N n m =3 FIG. 12. Carto on showing the density manifold M N (curv ed line) that is con tained in J N , the training densities n j for j = 1 , . . . , N t (blac k circles). Also shown are the density n ∈ M (blue square) and the PCA approximation to tangent space of M N at n , T PCA ( n ) (dashed line). This tangent plane is a lo cal approximation to M N . W e deﬁne the v ariance lost in keeping d PCs as η = 1 − P d j =1 λ j . P N G j =1 λ j . In this case, there is little to no v ariance in directions orthogonal to the tangen t space of M N , and maxim um v ariance in directions aligned with the tangen t space. Th us, the ﬁrst d PCs form a basis for the tangent space, where d is the dimensionality of the densit y manifold (and tangent space). The pro jection op erator on to this basis is: P [ n ] = d X j =1 u j u > j . (29) The tangen t space using PCA is given by T PCA [ n ] = { n | (1 − P [ n ])( n − ¯ n ) = 0 } . (30) Finally , we choose the PCA approximation to the con- strain t g [ n ] in Eq. ( 24 ) as the squared distance from n to tangen t plane T PCA [ n ]: g PCA [ n ] = k (1 − P [ n ]) ˜ n k 2 . (31) The PCA approximate density manifold M PCA is then deﬁned implicitly b y g PCA [ n ] = 0. The process is il- lustrated in Fig. 12 . In the next section we develop a pro jected gradien t descent metho d to solve Eq. ( 24 ). E. Pro jected gradien t descen t algorithm F or a given ML approximation to the KE functional, E ML [ n ] = T ML [ n ] + V [ n ] , (32) the algorithm to minimize the functional in Eq. ( 24 ) to ﬁnd a constrained optimal density is as follows (see Fig. 13 ). Cho ose an initial guess for the density , n 0 ∈ M N (e.g., a training densit y): 1. Ev aluate the functional deriv ative δ E ML [ n ] δ n ( x ) = δ T ML S [ n ] δ n ( x ) + v ( x ) . (33) at n = n t . 2. Compute the lo cal PCA pro jection op erator P [ n t ] from Eq. ( 29 ). 3. Pro ject the functional deriv ative on to the tangen t space (see Fig. 13 ), and take a step: n 0 t ( x ) = n t ( x ) −  ˆ P [ n t ] δ E ML [ n ] δ n ( x )     n = n t , (34) where  is a constant suc h that 0 <  ≤ 1. If con vergence is unstable, reduce  , trading stability for sp eed of con vergence. 4. T o ensure the constraint remains satisﬁed, we sub- tract the (weigh ted) mean of the training densities in the lo cal neigh b orho o d: n t +1 ( x ) = n 0 t ( x ) − (1 − ˆ P [ n 0 t ])( n 0 t − ¯ n [ n 0 t ]) . (35) W e iterate these steps un til conv ergence is achiev ed. W e measure conv ergence by setting a maxim um itera- tion step and tolerance threshold. If the total energy diﬀerence is smaller than tolerance within max iteration step, the density is con verged. If no solution is found,  is reduced. 12 Projection step Correction step ˜ n 0 t ˜ n t ˜ n t +1 M N g [ n ]=0 FIG. 13. Schematic of the pro jected gradient descen t. The functional deriv ativ e is pro jected on to the tangen t space of the data manifold M N at n t (dashed line). Next, a step is taken along the pro jected functional deriv ativ e to n 0 t in the direction of low er energy . Finally , g [ n ] is minimized orthogonal to the tangen t space to ensure the minimization stays on M N . F. Errors on constrained optimal densities With this new constrained minimization procedure via a pro jected gradien t descen t, we solv e for the constrained optimal densit y for eac h test sample. W e rep ort the er- rors in the total energy and KE relativ e to the exact densit y in T able II I . In general, we expect these errors to b e worse on the MLA ev aluated on exact densities—by roughly a factor of 10. How ev er, errors on constrained optimal densities decrease at the same rate with more training data, so an accuracy of 1 k cal/mol in KE is ac hieved with 150 training samples for N = 1, now on constrained optimal densities. Additionally , errors are of similar magnitude for multiple particles. In the last row of T able II I , we combine the training data from eac h N (100 training densities p er N v alue) into one model. This com bined MLA giv es roughly the same error as each in- dividual mo del. This is b ecause, due to the locality of the Gaussian kernel, the training densities from eac h N are well separated (orthogonal in feature space) and the individual mo dels are unaﬀected. In the pro jected gradien t descen t, there are tw o PCA parameters that must b e chosen: m , the num b er of near- est neigh b ors and d , the num b er of PCs to form the pro- jection operator. Fig. 14 sho ws the MAE ev aluated b y the constrained optimal densit y and v ariance lost as a function of the num b er of PCs d with m = 20. The MAE decreases initially as d increases as more PCs cap- ture the lo cal structure of the densit y manifold. As can b e seen, d = 4 or 5 gives an optimal reconstruction of the tangent space of the manifold. As d increases fur- ther, the noise that w as remov ed is re-introduced in to the pro jection, causing the gradient descen t algorithm to fail. F or d = 7, many of the constrained searc hes do not con verge. T able IV rep orts the errors of the mo del ev al- uated on constrained optimal densities for N T = 40 and N T = 100, giving a rough optimization of the PCA pa- rameters. Although the p oten tial whic h generates M N has 9 parameters in this case, we observe that the opti- mal choice of d is only 4. This is b ecause the data used 1 2 3 4 5 6 7 d 0 10 20 30 40 50 | ∆ T | 0 5 10 15 20 25 η | ∆ T | η FIG. 14. The MAE, | ∆T | = | T ML [ ˜ n ] − T [ n ] | , ev aluated on 100 constrained optimal densities (in kcal/mol) compared with the v ariance lost η as a function of the num b er of PCs d in the PCA pro jection, with m = 20 nearest neighbors. (a) @ @ @ d m 10 20 30 2 12 (98) 15 (100) 24 (100) 3 12 (100) 16 (100) 22 (100) 4 12 (98) 15 (100) 25 (100) 5 23000 (18) 130 (27) (0) (b) @ @ @ d m 10 20 30 40 3 4 . 1 (99) 3 . 2 (100) 2 . 7 (99) 2 . 8 (100) 4 1 . 7 (100) 1 . 4 (100) 1 . 4 (100) 1 . 7 (100) 5 1 . 6 (100) 1 . 3 (100) 1 . 5 (100) 2 . 0 (100) 6 1 . 7 (93) 2 . 1 (100) 1 . 7 (100) 2 . 2 (100) T ABLE IV. The error in the KE in k cal/mol ev aluated on constrained optimal densities using 100 densities for testing, with N = 1 for (a) N T = 40 and (b) N T = 100. The p er- cen tage of conv erged optimal densities is given in parenthese s. Here m is the num b er of nearest neigh b or densities used in PCA and d is num b er of PCs used in the pro jection. to build the mo del is only a small fraction of M N . If w e do not sample all relev ant directions on M N , then the mo del cannot learn the functional deriv ative in those directions. The PCA pro jection will comp ensate by re- mo ving those directions. Thus, the eﬀectiveness of our metho d dep ends on the sampling on the manifold. V. CONCLUSION In this w ork, w e hav e explored in m uch greater detail the methods presen ted in Ref. [ 29 ], in whic h ML methods w ere used to directly approximate the KE of a quantum system as a functional of the electron density , and used this functional in a mo diﬁed orbital-free DFT to obtain 13 highly accurate self-consisten t densities and energies. W e used a simple mo del as a pro of of principle, to in- v estigate how standard metho ds from ML can be applied to DFT. In particular, we tested a v ariet y of standard k er- nels used in ML, and ha ve found that the Gaussian kernel giv es the low est errors (the Cauch y kernel also ac hieves similar p erformance). All cross v alidation schemes that w ere tested ga ve similar predictions of hyperparameters that ac hieved low generalization error on the test set. Our results highlight the imp ortance of an appropriate c hoice of kernel, as some of the kernels tested ga ve strik- ingly bad p erformance. With the construction of the L 2 norm that w as used in the kernels, the method is basis set indep endent (as long as a complete basis is used). Ho wev er, the ML metho d is capable of learning accurate KEs using a sparse grid (i.e., an incomplete basis). Us- ing a sparse representation for the density without losing accuracy would sp eed up calculations further. These re- sults warran t further exploration and will b e the sub ject of future w ork. W e explained the origin of the noise in the functional deriv ative and developed a constrained searc h o ver the densit y manifold via a mo diﬁed Euler equation, eﬀec- tiv ely pro jecting out the noise. W e also in tro duced a lo cal appro ximation to the manifold using PCA, and solved for constrained optimal densities using a pro jected gradien t descen t algorithm. This w orked w ell for our protot yp e system, yielding highly accurate constrained optimal en- ergies and densities. A CKNOWLEDGMENTS The authors thank for the support from NSF Gran t No. CHE-1240252 (JS, KB). KRM thanks the BK21 Plus Program by NRF Korea, DFG and the Einstein F ounda- tion. Correspondence to Li Li and K.-R. M ¨ uller. [1] Paul Adrien Maurice Dirac, “Quantum mechanics of man y-electron systems,” Proceedings of the Roy al So ci- et y of London. Series A, Containing Papers of a Mathe- matical and Physical Character 123 , 714–733 (1929). [2] W alter Kohn, “Nob el lecture: Electronic structure of matter-w av e functions and density functionals,” Reviews of Mo dern Ph ysics 71 , 1253–1266 (1999). [3] P . Hohen b erg and W. Kohn, “Inhomogeneous electron gas,” Phys. Rev. B 136 , 864–871 (1964). [4] R. M. Dreizler and E. K. U. Gross, Density F unctional The ory: An Appr o ach to the Quantum Many-Bo dy Pr ob- lem (Springer, 1990). [5] Llewellen H Thomas, “The calculation of atomic ﬁelds,” in Mathematic al Pr o c e e dings of the Cambridge Philosoph- ic al So ciety , V ol. 23 (Cambridge Univ Press, 1927) pp. 542–548. [6] Enrico F ermi, “Eine statistische metho de zur b estim- m ung einiger eigenschaften des atoms und ihre anw en- dung auf die theorie des p erio dischen systems der ele- men te,” Zeitschrift f ¨ ur Physik 48 , 73–79 (1928). [7] Edward T eller, “On the stability of molecules in the thomas-fermi theory ,” Reviews of Mo dern Ph ysics (US) 34 (1962). [8] W. Kohn and L. J. Sham, “Self-consistent equations in- cluding exc hange and correlation eﬀects,” Ph ys. Rev. 140 , A1133–A1138 (1965) . [9] A. D. Beck e, “Density-functional exchange-energy ap- pro ximation with correct asymptotic b ehavior,” Phys. Rev. A 38 , 3098–3100 (1988) . [10] Chengteh Lee, W eitao Y ang, and Rob ert G. Parr, “De- v elopment of the colle-salv etti correlation-energy form ula in to a functional of the electron densit y ,” Phys. Rev. B 37 , 785–789 (1988) . [11] John P . Perdew, Kieron Burk e, and Matthias Ernzer- hof, “Generalized gradient approximation made simple,” Ph ys. Rev. Lett. 77 , 3865–3868 (1996) . [12] V.V. Karasiev and S.B. T ric key , “Issues and challenges in orbital-free densit y functional calculations,” Computer Ph ysics Communications 183 , 2519 – 2527 (2012) . [13] V alentin V. Karasiev, Randy S. Jones, Samuel B. T rick ey , and F rank E. Harris, “Recent adv ances in developing orbital-free kinetic energy functionals,” in New Develop- ments in Quantum Chemistry , edited b y Jos´ e Luis Paz and Antonio J. Hern´ andez (T ransw orld Research Net- w ork, Kerala, India, 2009) pp. 25–54. [14] F abien T ran and T omasz A. W esolowski, “Link b etw een the kinetic- and exchange-energy functionals in the gen- eralized gradien t approximation,” International Journal of Quantum Chemistry 89 , 441–446 (2002) . [15] Linda Hung and Emily A Carter, “Accurate simulations of metals at the mesoscale: Explicit treatment of 1 mil- lion atoms with quantum mechanics,” Chemical Physics Letters 475 , 163–170 (2009). [16] Miroslav Hodak, W enc hang Lu, and J Bernholc, “Hybrid ab initio k ohn-sham density functional theory/frozen- densit y orbital-free densit y functional theory sim ulation metho d suitable for biological systems,” The Journal of c hemical physics 128 , 014101 (2008). [17] C.F.v. W eizs¨ ack er, “Zur theorie der kernmassen,” Zeitsc hrift f ¨ ur Physik 96 , 431–458 (1935) . [18] Y.A. W ang and E.A. Carter, “Orbital-free kinetic-energy densit y functional theory ,” in The or etic al Methods in Condense d Phase Chemistry , edited by S.D. Sch wartz (Klu wer, NY, 2000). [19] V.V. Karasiev, D. Chakrab orty , and S.B. T ric key , “Progress on new approaches to old ideas: Orbital-free densit y functionals,” in Many-Ele ctr on Appr o aches in Physics, Chemistry, and Mathematics , edited b y L. Delle Site and V. Bach (Springer V erlag, Kluw er, NY, to ap- p ear). [20] V. V. Karasiev, R. S. Jones, S. B. T ric key , and F rank E. Harris, “Properties of constraint-based single-point ap- pro ximate kinetic energy functionals,” Phys. Rev. B 80 , 245120 (2009) . [21] V. V. Karasiev, R. S. Jones, S. B. T ric key , and F rank E. Harris, “Erratum: Prop erties of constraint-based single- p oin t approximate kinetic energy functionals [phys. rev. b 80, 245120 (2009)],” Phys. Rev. B 87 , 239903 (2013) . 14 [22] F abien T ran and T omasz A. W esolowski, “Link b etw een the kinetic- and exchange-energy functionals in the gen- eralized gradien t approximation,” International Journal of Quantum Chemistry 89 , 441–446 (2002) . [23] E. Chac´ on, J. E. Alv arellos, and P . T arazona, “Nonlo cal kinetic energy functional for nonhomogeneous electron systems,” Phys. Rev. B 32 , 7868–7877 (1985) . [24] P . Garc ´ ıa-Gonz´ alez, J. E. Alv arellos, and E. Chac´ on, “Nonlo cal kinetic-energy-density functionals,” Ph ys. Rev. B 53 , 9509–9512 (1996) . [25] P . Garc ´ ıa-Gonz´ alez, J. E. Alv arellos, and E. Chac´ on, “Nonlo cal symmetrized kinetic-energy density func- tional: Application to simple surfaces,” Ph ys. Rev. B 57 , 4857–4862 (1998) . [26] Lin-W ang W ang and Michael P . T eter, “Kinetic-energy functional of the electron densit y ,” Ph ys. Rev. B 45 , 13196–13220 (1992) . [27] Y an Alexander W ang, Niranjan Go vind, and Emily A. Carter, “Orbital-free kinetic-energy density functionals with a density-dependent kernel,” Phys. Rev. B 60 , 16350–16358 (1999) . [28] Junchao Xia, Chen Huang, Ilgy ou Shin, and Emily A. Carter, “Can orbital-free density functional theory sim- ulate molecules?” The Journal of Chemical Physics 136 , 084102 (2012) . [29] John C. Snyder, Matthias Rupp, Katja Hansen, Klaus- Rob ert M ¨ uller, and Kieron Burke, “Finding density functionals with machine learning,” Phys. Rev. Lett. 108 , 253002 (2012) . [30] John C. Snyder, Matthias Rupp, Katja Hansen, Leo Blo oston, Klaus-Rob ert M ¨ uller, and Kieron Burke, “Orbital-free bond breaking via mac hine learning,” J. Chem. Phys. 139 , 224104 (2013) . [31] Klaus-Rob ert M¨ uller, Sebastian Mik a, Gunnar R¨ atsch, Ko ji Tsuda, and Bernhard Sc h¨ olk opf, “An introduction to k ernel-based learning algorithms,” IEEE T rans. Neural Net work 12 , 181–201 (2001). [32] Igor Kononenk o, “Machine learning for medical diagno- sis: history , state of the art and p ersp ective,” Artiﬁcial In telligence in medicine 23 , 89–109 (2001). [33] W ei Huang, Y oshiteru Nak amori, and Shou-Y ang W ang, “F orecasting sto c k market mov emen t direction with sup- p ort v ector machine,” Computers & Op erations Researc h 32 , 2513–2522 (2005). [34] F abrizio Sebastiani, “Machine learning in automated text categorization,” A CM computing surv eys (CSUR) 34 , 1– 47 (2002). [35] Matthias Rupp, Alexandre Tk atchenk o, Klaus-Robert M ¨ uller, and O. Anatole von Lilienfeld, “F ast and ac- curate mo deling of molecular atomization energies with mac hine learning,” Phys. Rev. Lett. 108 , 058301 (2012) . [36] Katja Hansen, Gr´ egoire Mon tav on, F ranzisk a Biegler, Siamac F azli, Matthias Rupp, Matthias Scheﬄer, O. Anatole von Lilienfeld, Alexandre Tk atchenk o, and Klaus-Rob ert M ¨ uller, “Assessment and v alida- tion of machine learning metho ds for predicting molecular atomization energies,” Journal of Chemi- cal Theory and Computation 9 , 3404–3419 (2013) , h ttp://pubs.acs.org/doi/p df/10.1021/ct400195d . [37] Gregoire Monta von, Matthias Rupp, Vivek anand Go- bre, Alv aro V azquez-May agoitia, Katja Hansen, Alexan- dre Tk atchenk o, Klaus-Robert M¨ uller, and O Anatole v on Lilienfeld, “Mac hine learning of molecular electronic prop erties in chemical compound space,” New Journal of Ph ysics 15 , 095003 (2013) . [38] Zachary D. Pozun, Katja Hansen, Daniel Sheppard, Matthias Rupp, Klaus-Robert M ¨ uller, and Graeme Henk elman, “Optimizing transition states via kernel- based mac hine learning,” The Journal of Chemical Ph ysics 136 , 174101 (2012) . [39] Alb ert P . Bart´ ok, Mike C. Pa yne, Risi Kondor, and G´ abor Cs´ an yi, “Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons,” Ph ys. Rev. Lett. 104 , 136403 (2010) . [40] E. Hairer, P . Nørsett, P . Syvert Paul, and G. W anner, Solving or dinary diﬀer ential e quations I: Nonstiﬀ pr ob- lems (Springer, New Y ork, 1993). [41] Min-Cheol Kim, Eunji Sim, and Kieron Burke, “Under- standing and reducing errors in density functional calcu- lations,” Phys. Rev. Lett. 111 , 073003 (2013) . [42] Elliott H Lieb, “Density functionals for coulomb sys- tems,” in Ine qualities (Springer, 2002) pp. 269–303. [43] V.N. V apnik, The natur e of statistic al learning the ory (Springer V erlag, New Y ork, 1995). [44] B. Scholk opf, S. Mik a, C.J.C. Burges, P . Knirsch, K.- R. Muller, G. Ratsch, and A.J. Smola, “Input space v ersus feature space in kernel-based methods,” Neural Net works, IEEE T ransactions on 10 , 1000 –1017 (1999) . [45] G. Monta von, M. Braun, T. Krueger, and K.-R. Muller, “Analyzing lo cal structure in kernel-based learning: Ex- planation, complexit y , and reliabilit y assessmen t,” Signal Pro cessing Magazine, IEEE 30 , 62–74 (2013) . [46] Bernhard Sch¨ olkopf and Alexander Smola, L e arning with Kernels (MIT Press, Cambridge, 2002). [47] James Mercer, “F unctions of p ositive and negativ e type, and their connection with the theory of in tegral equa- tions,” Philosophical transactions of the roy al society of London. Series A, containing pap ers of a mathematical or physical character 209 , 415–446 (1909). [48] Nachman Aronsza jn, “Theory of reproducing kernels,” T ransactions of the American mathematical so ciety 68 , 337–404 (1950). [49] B. Sc h¨ olk opf, A. Smola, and K.R. M ¨ uller, “Nonlin- ear comp onent analysis as a kernel eigenv alue problem,” Neural computation 10 , 1299–1319 (1998). [50] T revor Hastie, Robert Tibshirani, and Jerome F riedman, The Elements of Statistical L e arning: Data Mining, In- fer enc e, and Pre diction , 2nd ed. (Springer, New Y ork, 2009). [51] A.J. Smola, B. Sch¨ olkopf, and K.-R. M ¨ uller, “The con- nection b etw een regularization operators and supp ort v ector kernels,” Neural Netw orks 11 , 637–649 (1998). [52] Matthias Rupp, Ewgenij Prosc hak, and Gisb ert Schnei- der, “Kernel approach to molecular similarity based on it- erativ e graph similarity ,” JChemInfModel 47 , 2280–2286 (2007). [53] Alexander Zien, Gunnar R¨ atsc h, Sebastian Mik a, Bern- hard Sc h¨ olk opf, Thomas Lengauer, and K-R M¨ uller, “Engineering supp ort vector machine kernels that recog- nize translation initiation sites,” Bioinformatics 16 , 799– 807 (2000). [54] M.L. Braun, J.M. Buhmann, and K.-R. M ¨ uller, “On relev ant dimensions in kernel feature spaces,” Journal of Mac hine Learning Research 9 , 1875–1908 (2008). [55] T omaso Poggio, Say an Mukherjee, Ryan Rifkin, Alexan- der Rakhlin, and Alessandro V erri, b , T ec h. Rep. AI Memo 2001-011, CBCL Memo 198 (Massach usetts Insti- tute of T ec hnology , 2001). 15 [56] S-I Amari, N Murata, K-R M ¨ uller, M Fink e, and HH Y ang, “Asymptotic statistical theory of ov ertrain- ing and cross-v alidation,” IEEE T ransactions on Neural Net works 8 , 985–996 (1997). [57] S. Lemm, B. Blankertz, T. Dickhaus, and K.-R. M¨ uller, “In tro duction to machine learning for brain imaging,” Neuroimage 56 , 387–399 (2011). [58] Christoph Bregler and Stephen M Omoh undro, Surfac e le arning with applic ations to lipr e ading (In ternational Computer Science Institute, 1994).

Understanding Machine-learned Density Functionals

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment