Physics-Aware Neural Implicit Solvers for multiscale, parametric PDEs with applications in heterogeneous media

P H Y S I C S - A W A R E N E U R A L I M P L I C I T S O L V E R S F O R M U L T I S C A L E , P A R A M E T R I C P D E S W I T H A P P L I C A T I O N S I N H E T E R O G E N E O U S M E D I A . ∗ Matthaios Chatzopoulos a , Phaedon-Stelios Koutsour elakis a,b, ∗ a T echnical Uni versity of Munich, Professorship of Data-dri ven Materials Modeling, School of Engineering and Design, Boltzmannstr . 15, 85748 Garching, Germany b Munich Data Science Institute (MDSI - Core member), Garching, Germany A B S T R AC T W e propose Physics-A ware Neural Implicit Solvers (P ANIS), a nov el, data-driven framework for learning surrogates for parametrized Partial Differential Equations (PDEs). It consists of a probabilis- tic, learning objecti ve in which weighted residuals are used to probe the PDE and pro vide a source of virtual data i.e. the actual PDE ne ver needs to be solv ed. This is combined with a physics-aw are implicit solver that consists of a much coarser , discretized v ersion of the original PDE, which provides the requisite information bottleneck for high-dimensional problems and enables generalization in out-of-distribution settings (e.g. different boundary conditions). W e demonstrate its capability in the context of random heterogeneous materials where the input parameters represent the material microstructure. W e extend the frame work to multiscale problems and sho w that a surrogate can be learned for the effecti ve (homogenized) solution without e ver solving the reference problem. W e further demonstrate how the proposed framework can accommodate and generalize several exist- ing learning objecti ves and architectures while yielding probabilistic surrogates that can quantify predictiv e uncertainty . K eywords Random Heterogeneous Materials · Data-driv en · Probabilistic surrogate · Deep Learning · Machine Learning · High-Dimensional Surrogates · V irtual Observables. 1 Introduction Parametric PDEs appear in a wide variety of problems of engineering relev ance, and their repeated solution under different parametric v alues in the context of many-query applications represents a major computational roadblock in achieving analysis and design objecti ves. Perhaps one of the most challenging applications, which lies at the core of this in vestigation, is encountered in the context of (random) heterogeneous media in which microstructural details determine their macroscopic properties [ 1 ]. These are found in a multitude of engineering applications, such as aligned and chopped ﬁber composites, porous membranes, particulate composites, cellular solids, colloids, microemulsions, concrete [ 1 ]. Their microstructural properties can v ary , most often randomly , at multiple length-scales [ 2 ]. Capturing this v ariability requires, in general, very high-dimensional representations and v ery ﬁne discretizations, which in turn imply a signiﬁcant cost for each solution of the governing PDEs in order to predict their response [ 3 ]. Being able to ef ﬁciently obtain accurate solutions under v arying microstructures represents a core challenge that can enable the solution of various forward analysis problems such as uncertainty quantiﬁcation [ 4 , 5 ]. More importantly , howe ver , it is of relev ance in the context of inv erse design where one attempts to identify (families of) microstructures that achiev e extremal or target properties [ 6 ]. While se veral dif ferent tools come into play , data-driven strategies, to which our contribution belongs, hav e risen into prominence in recent years [ 7 , 8 ] as in many cases they have produced high-throughput, forward-model surrogates which are essential for in verting the microstructure-to-property link [9]. ∗ Corr esponding author Email addr esses: matthaios.chatzopoulos@tum.de (Matthaios Chatzopoulos), p.s.koutsour elakis@tum.de (Phaedon-Stelios Koutsour elakis) One way to categorize pertinent methods is based on the learning objecti ves employed which range between purely data-based to physics-informed. In the ﬁrst category belong methods that rely on labeled data, i.e. input-output pairs [ 10 , 11 , 12 , 13 , 14 ] and cast the problem as a supervised learning task. Associated models and tools have reached a high-lev el of maturity and sophistication in the machine learning community and e xhibit a comparati vely faster and more stable training [ 15 ]. If the models employed, which usually take the form of a Deep Neural Network (DNN), are agnostic of the physical problem, their predicti ve accuracy is generally dependent on the amount of training data. Howe ver , unlik e typical machine learning applications, as e.g. in language or vision, where data is abundant and inexpensi ve, in parametric PDEs, data acquisition is de f acto expensi ve, and the reduction of the requisite number of PDE-solves one of the primary objecti ves [16]. On the other side of the spectrum lie ph ysics-informed learning protocols [ 17 , 18 , 19 ]. These employ the PDE itself, usually in the form of collocation-type residuals, in the training loss, although energy functionals ha ve also been utilized [ 20 , 21 ]. As such, they do not require any PDE-solves and exhibit better generalization performance, but empirical results ha ve sho wn that the use of labeled data can impro ve their performance [ 20 ]. In the works of [ 17 ], [ 22 ], [ 23 ], [ 21 ], [ 24 ], [ 25 ] the loss function used for training the model consisted of both labeled data and residuals of the go verning PDE. A detailed taxonomy [ 26 ] and revie w reg arding the application of deep, physics-informed models for solving pertinent problems can be found in [ 15 ]. The relative weights of the residual-based, data-driven and regularization terms in the loss are generally ad hoc and ﬁne-tuned through trial and error . Furthermore, no quantiﬁcation of the informational content of the residuals employed is directly av ailable, and in general, a large number of collocation points is needed. Another aspect of the problem that plays a piv otal role is the dimension of the parametric input. In cases where this is low to medium, generic, traditional architectures such as Gaussian Processes [ 27 ], exhibit good results. Dimensionality reduction of fers an av enue for ov ercoming difﬁculties in high-dimensional problems and se veral such strate gies hav e been dev eloped [ 28 , 29 , 30 , 31 ]. Since, in general, dimensionality reduction is performed in a separate step, if the lower -dimensional features identiﬁed are not affecting the response/properties, their utility is limited [ 32 ]. Furthermore, most of the attention has been directed to wards reducing the dimensionality of the entities in volved [ 33 ] rather than reducing, or better yet, coarse-graining, the physical models [34]. Recent advances in deep learning [ 35 ] hav e triggered a big effort to wards the utilization of Deep Neural Networks for solving ef ﬁciently such problems [ 36 , 37 , 38 ]. Combining DNNs with dimensionality reduction techniques has also been employed in problems in volving heterogeneous media as in [ 39 , 40 , 41 ]. When the input is inﬁnite-dimensional (i.e. a function), one of the latest and most promising advancements are the so-called Neural Operators [ 42 ], which attempt to approximate directly the map to the PDE-solution, i.e. a function to function mapping between Banach spaces. T ypical examples include the Deep Operator Network (DeepONet) [ 11 ] (and its physics-informed version in [ 18 ]), the Graph Neural Operator (GNO) [ 12 ], the Fourier Neural Operator (FNO) [ 13 , 43 ], the W av elet Neural Operator [ 44 ] and the Con volutional Neural Operator [ 45 ]. While such methods promise resolution-independent results, they sample the input function on a gi ven grid and employ a DNN to parametrize the PDE solution. As sho wn in [ 46 ] this can lead to systematic bias and aliasing errors which could be o vercome by using truncated (i.e. ﬁnite-dimensional) Chebyshev or F ourier series for both domain and co-domain. W e note ﬁnally that most deep-learning-centered strate gies hav e been based on transferring or improving NN architec- tures that parametrize the input-output map. Fewer ef forts have attempted to incorporate physical inducti ve biases [ 47 ] in those architectures by e.g. endo wing them with in variance/equiv ariance properties when those are av ailable [48]. Despite notable progress, very fe w models are capable of producing probabilistic predicti ve estimates [ 49 , 21 , 50 , 51 ]. This can be achiev ed by learning/approximating the posterior (gi ven labeled data) of the model parameters, which is a difﬁcult task given the millions or ev en billions of parameters that DNNs possess. Much fewer efforts attempt to learn directly a predictiv e posterior of the solution [ 21 ]. W e believe, nev ertheless, that this is very important for any data-dri ven scheme (ev en more so for ov er-parametrized models) gi ven that a ﬁnite number of data points can be employed and, in fact, reducing the requisite information extracted in the form of input-output pairs or otherwise from the PDE, is one of the main objecti ves. Information loss that can giv e rise to predicti ve uncertainty also arises due to the processes of dimensionality reduction or model compression/reduction, which should be quantiﬁed and propagated in the estimates. In this work, we re visit the way information from the go verning PDE is ingested as well as the architecture employed to capture the sought input-output map [ 20 ]. In section 2, we propose Physics-A ware Neural Implicit Solvers (P ANIS), a nov el, data-driv en framew ork which consists of a probabilistic, learning objecti ve in which weighted residuals (section 2.1) are used to probe the PDE and provide a source of virtual data [ 23 ], i.e. the actual PDE ne ver needs to be solved (section 2.2). The formulation enables us to recast the problem as one of probabilistic inference (section 2.3). W e sho w 2 that selecting at random a few of those residuals at each iteration is suf ﬁcient to learn a probabilistic surrogate. While, as we discuss, v arious existing DNN-based architectures can be readily inte grated into this frame work, we adv ocate instead a physics-aware implicit solv er (section 2.4) that consists of a (much) coarser , discretized version of the original PDE, which provides the requisite information bottleneck for high-dimensional problems and enables generalization in out-of-distribution settings (e.g. different boundary conditions). W e extend the framew ork to multiscale problems by proposing mP ANIS (section 3) and sho w that a surrog ate can be learned for the effecti ve (homogenized) solution without ev er solving the reference problem [ 52 ]. W e further demonstrate how the proposed frame work can accommodate and generalize se veral existing learning objecti ves and architectures while yielding probabilistic surrogates that can quantify predicti ve uncertainty (sections 2.5 3.3). W e substantiate its capability in the conte xt of random heterogeneous materials where the input parameters represent the material microstructure (section 4). Apart from comparati ve estimates with Physics Informed F ourier Neural Operators (PINO) [ 43 ] and assessing performance with respect to the dimension of the parametric input vector , we pay special attention to problems that in volv e predictions under extr apolative settings i.e. dif ferent microstructures or boundary conditions as compared to the ones used in training. In such problems, and despite being extremely lightweight (e.g. it has up to three-orders of magnitude fewer training parameters as compared to PINO), we show that (m)P ANIS can produce very accurate predicti ve estimates. 2 Methodology W e are concerned with parametric PDEs which can be generally written as: L ( c ( s ; x ) , u ( s )) = 0 , s ∈ Ω ((non)linear differential operator) B ( c ( s ; x ) , u ( s )) = 0 , s ∈ ∂ Ω (boundary-conditions operator) (1) where s denotes space, Ω ⊂ R d s the problem domain, c ( s ; x ) denotes an input function parametrized by x ∈ X ⊂ R d x (e.g. microstructure or material property ﬁeld) and u ( s ) 2 the sought solution to the boundary value problem abo ve. In the follo wing, we adopt a discretized representation of the solution, i.e. u ( s ) discretize − − − − − → y where y ∈ Y ⊂ R d y . W e assume that the adopted discretization is ﬁne enough to provide an acceptable approximation to the actual solution, which, in general, implies that dim ( y ) >> 1 . The v ector y could represent coef ﬁcients with respect to a gi ven set of functions { η i ( s ) } d y i (e.g. FE shape functions, Radial Basis Functions, Fourier , W a velet): u y ( s ) = d y X i =1 y i η i ( s ) , i = 1 , 2 , 3 , ..., d y (2) or the weight/biases of a neural netw ork. While emphasis has been directed recently on neural operators that approximate maps between inﬁnite-dimensional, Banach spaces, we note that as demonstrated in [ 46 ], spectral neural operators can be de veloped by employing truncated (i.e. ﬁnite-dimensional) Chebyshev or Fourier series (i.e. when η i ( s ) in Equation (2) take the form of Chebyshev polynomials or sine/cosines) which can be superior in ov ercoming systematic bias caused by aliasing errors. W e propose Ph ysics-A w are Neural Implicit Solvers (P ANIS), a frame work that casts the problem of de veloping surrogates for parametrized PDEs as one of probabilistic inference and which relies on the use of: • weighted residuals as virtual data (i.e. we nev er use labeled data in the form of solution pairs ( x , y ) ) which are ingested probabilistically with the help of a virtual likelihood, and • an implicit solver at the core of the probabilistic surrog ate constructed which provides the requisite information bottleneck in order to deal with high-dimensional problems (i.e. when d x , d y >> 1 ) and to generalize in out-of-distribution settings. The output of P ANIS is a pr obabilistic surrogate, which will be expressed in the form of a density q ( y | x ) that can predict the (discretized) solution for any input x . W e note that the predictiv e uncertainty in this case is not due to aleatoric stochasticity in the governing equations (although such cases can be accommodated as well) but of epistemic nature due to the use of partial, and generally small, data in constructing this surrogate as explained in the sequel. 2.1 Str ong and W eak form of a boundary-value pr oblem W e vie w the governing PDE as an inﬁnite source of information, which we probe with the help of weighted r esiduals . These constitute the (virtual) data with which we attempt to train our surrogate. In order to illustrate this concretely and 2 The solution u ( s ) implicitly depends on the parametric input x which we omit in order to simplify the notation. 3 clearly , we use as an example a linear, elliptic PDE. The methodological elements can then be generalized to other types of PDEs. In particular , we consider the boundary value problem: ∇ ( − c ( s ; x ) ∇ u ( s )) = g , s ∈ Ω u ( s ) = u 0 , s ∈ Γ u − c ( s ; x ) ∇ u ( s ) n = q 0 , s ∈ Γ q = ∂ Ω − Γ u (3) where c ( s ; x ) is a (random) conductivity/permeability/dif fusivity ﬁeld which depends on the parameters x . W e denote with Γ u and Γ q the parts of the boundary where Dirichlet and Neumann conditions are respectiv ely deﬁned. The symbol n indicates the outward, unit normal vector while the u 0 and q 0 denote the prescribed values. W e emplo y candidate solutions u y which are represented by the ﬁnite-dimensional y ∈ R d y as discussed abov e and which are assumed to a priori satisfy the Dirichlet BCs 3 . These are combined with weighting functions w ( s ) ∈ W which are assumed to be zero at the Dirichlet boundary i.e. w ( s ) | s ∈ Γ u = 0 . By employing integration-by-parts [53] and for each such w ( s ) we obtain the weighted residual: r w ( y , x ) = 0 = Z Γ q w q 0 d Γ + Z Ω ∇ w c ( s ; x ) ∇ u y d s − Z Ω w g d s . (4) W e note that depending on the choice of the weight functions w (at least) six methods (i.e. collocation, sub-domain, least-squares, (Petrov)-Galerkin, moments) arise as special cases [ 53 ]. As it will become apparent in the sequel, the weighted residuals are used as data sources and not in the traditional sense, i.e. to deri ve a discretized system of equations for the approximate solution of the problem. Hence, we would be at liberty to consider alternate or ev en non-symmetric versions with respect to u y and w as for example expressions with lower -order deriv atives of the candidate solution u y , which are obtained by applying further integration-by-parts [54]. 2.2 V irtual Observables and V irtual Likelihhod Rather than solving the gov erning equations multiple times for dif ferent values of x in order to obtain a labeled training dataset (i.e. pairs of ( x , y ) ) we treat weighted residuals as in Equation (4) as virtual observables [ 23 ]. In particular , giv en N distinct weighting function w j and the corresponding residuals r w j ( y , x ) , we assume that their values ˆ r j hav e been virtually observed and are equal to 0 . The virtual observables ˆ R = { ˆ r j = 0 } N j =1 imply a virtual likelihood which is assumed to be of the form: p ( ˆ R | y , x ) = Q N j =1 p ( ˆ r j = 0 | y , x ) = Q N j =1 λ 2 e − λ | r w j ( y , x ) | . (5) The role of the hyper-parameter λ is signiﬁcant as it controls ho w the likelihood of a pair ( x , y ) will decay if the corresponding residuals r w j ( y , x ) are dif ferent from zero. A reasonable guideline is to set λ − 1 equal to the tolerance v alue selected in a deterministic iterative solv er . W e note that alternativ e forms of the virtual likelihood are also possible (e.g. Gaussian), but the form abo ve was found to yield better results in the numerical e xperiments performed. The role of the virtual observ ables/likelihood is to provide information from the go verning equations without having to solve them. W e discuss in the sequel ho w this virtual likelihood can be used to learn a surrogate and how this can be carried out by ev aluating a few residuals at a time (i.e. without ev er solving the weak form). 2.3 Lear ning surrogates as pr obabilistic inference Let p ( x ) be a density on the parametric input. In the case of a random input, this is directly furnished by the problem deﬁnition. For deterministic problems, this could simply be a uniform density ov er the domain of possible values of x . W e complement this with a prior density p ( y | x ) . In the ensuing numerical e xperiments, a vague, uninformati ve such prior was selected. W e note that its role is not to provide a good approximation of the sought output y giv en an input x as this is to be achiev ed by the posterior discussed later . The combination of the aforementioned prior densities with the virtual likelihood of Equation (5) in the context of Bayes’ rule leads to the joint posterior p ( x , y | ˆ R ) as follo ws: 3 This requirement is consistent with traditional deriv ations but can be readily relaxed or remo ved in our formulation. 4 p ( x , y | ˆ R ) = p ( ˆ R | x , y ) p ( y | x ) p ( x ) p ( ˆ R ) . (6) W e emphasize that the posterior abov e is deﬁned on the joint space of inputs x and outputs y . Apart from the effect of the prior , it assigns the highest probability to pairs of x and y that achie ve 0 residuals as is the case when the y are a solution pair of the boundary value problem. Since the posterior is generally intractable, we advocate a V ariational-Inference scheme [ 55 ], i.e. we seek an approxima- tion from a parameterized family of densities q ψ ( x , y ) by minimizing the Kullback-Leibler di vergence with the exact posterior above. If ψ denotes the tunable parameters, this is equiv alent to maximizing the Evidence Lower BOund (ELBO) F ( ψ ) to the log-e vidence log p ( ˆ R ) [56], i.e.: log p ( ˆ R ) = log R p ( ˆ R | x , y ) p ( y | x ) p ( x ) d y d x ≥ * log p ( ˆ R | x , y ) p ( y | x ) p ( x ) q ψ ( x , y ) + q ψ ( x , y ) = F ( ψ ) , (7) where the brackets < . > q indicate an expectation with respect to q . Substitution of Equation (5) in the expression abov e leads to: F ( ψ ) = N log λ 2 − λ P N j =1  | r w j ( y , x ) |  q ψ ( x , y ) +  log p ( y | x ) p ( x ) q ψ ( x , y )  q ψ ( x , y ) = N log λ 2 − λ P N j =1  | r w j ( y , x ) |  q ψ ( x , y ) − K L ( q ψ ( x , y ) || p ( y | x ) p ( x )) . (8) Apart from the ﬁrst term, which is irrele vant when λ is ﬁxed, we note that the objecti ve abo ve promotes q ψ that minimize (on av erage) the absolute values of the N residuals considered while simultaneously minimizing the KL-di vergence from the prior density adopted. Remarks: • The KL-div ergence term in Equation (8) can be thought of as a regularization or penalty term, which nev ertheless arises naturally in the Bayesian framework adopted. In other physics-informed schemes that hav e been proposed [ 43 , 57 , 54 ], ad-hoc parameters are generally selected as relati ve weights of these two terms [ 58 ]. Similarly , alternati ve forms of virtual likelihoods can gi ve rise to the exact e xpressions found in competitiv e schemes. E.g. the use of a Gaussian likelihood would yield the sum of the squares of the N residuals in the ELBO. • Furthermore such schemes generally employ collocation-type residuals in the physics-informed term [ 59 , 18 , 43 ], which in our formulation arise as a special case when the weight functions considered tak e the form of Dirac-deltas i.e. w j ( s ) = δ ( s − s j ) where s j correspond to the collocation points. • More importantly , by ﬁnding the optimal q ψ we obtain a pr obabilistic surr ogate that is capable of providing probabilistic predictions of the solution y for any input x . Since the form of q ψ is critical to the accuracy of the proposed scheme we defer detailed discussions to Section 2.4. As a ﬁnal note we mention that if a degenerate q ψ is selected in the form of a Dirac-delta, then one obtains a deterministic surrogate as is the majority of the ones proposed in the literature [60, 11, 25]. The ev aluation and, more importantly , the maximization of the ELBO F ( ψ ) requires the computation of the expectations with respect to q ψ of the absolute values of the N weighted-residuals appearing in the virtual likelihood. W e note that the ev aluation of a weighted residual in volv es the computation of an integral over the problem domain (see e.g. Equation (4) ) but not the solution of the PDE. These integrations can be carried out using deterministic or (quasi) Monte-Carlo-based numerical integration schemes [ 61 , 62 ]. Furthermore, if the weight functions hav e support on a subset of the problem domain Ω , additional efﬁciencies can be achie ved in the numerical integration. Independently of the particulars, it is readily understood that as N increases, so does the information we extract from the gov erning PDE, but so does also the computational effort. In order to improv e the computational efﬁcienc y of these updates, we propose a Monte Carlo approximation of the pertinent term in the ELBO in Equation (8). In particular 5 N X j =1  | r w j ( y , x ) |  q ψ ( x , y ) ≈ N M M X m =1  | r w j m ( y , x ) |  q ψ ( x , y ) , where j m ∼ C at  N , 1 N  . (9) In essence, we sample at random and with equal probability M < N weight functions out of the N and use them to approximate the sum. The speciﬁcs for the set of N weight functions, which is subsampled, are described in subsection 4.1. As M increases, so does the Monte Carlo error decrease, but so does the computational ef fort increase. Since the Monte Carlo estimator is unbiased ev en for M = 1 , ev aluating a single residual is sufﬁcient to guarantee con vergence (see Algorithm 1). In combination with Equation (8) (we omit the ﬁrst term, which is independent of ψ ) , this yields the following Monte Carlo approximation to the ELBO: F ( ψ ) ≈ − λ N M P M m =1  | r w j m ( y , x ) |  q ψ ( x , y ) +  log p ( y | x ) p ( x ) q ψ ( x , y )  q ψ ( x , y ) where j m ∼ C at  N , 1 N  . (10) W e discuss in subsection 2.4 and in more detail in C ho w gradients of F can be computed and used in order to carry out the ELBO maximization. W e also note that if a parameterized prior , say p θ ( y | x ) were to be adopted, then the optimal θ could also be identiﬁed by maximizing the corresponding ELBO [ 21 ]. While such structured or informative priors could be highly beneﬁcial in accelerating inference and improving predicti ve accurac y , they are not explored in this paper . Finally , we note that e ven though the method described abov e for subsampling weighting functions and residuals is simple and ef ﬁcient, other more intelligent procedures could lead to weighting functions of superior informational content. One such idea would be to start with a null set of residuals (i.e. the prior) and successiv ely add one residual at a time (i.e. selecting a new w ) based e.g. on maximizing the KL-diver gence between the current posterior (or its approximation) and the posterior with the additional residual. Another scheme could in volve the parameterization of w (e.g. by some parameters ϕ ) and, subsequently , the iterati ve maximization of the ELBO w .r .t the parameters ψ followed by a minimization of the ELBO w .r .t. the parameters ϕ for ev ery optimization cycle (i.e. a saddle-point search). W e defer a more thorough in vestigation of these possibilities as well as their effect in reducing the computational ef fort for future work. 2.4 F orm of the Appr oximating Density q ψ and Implicit Solvers In selecting the form of q ψ and without loss of generality we postulate: q ψ ( x , y ) = q ψ ( y | x ) q ( x ) = q ψ ( y | x ) p ( x ) (11) i.e. we use the giv en density of the parametric input and attempt to approximate the input-to-output map with q ψ ( y | x ) where ψ denotes the tunable parameters. W e note that alternate decompositions might be advisable in the context of in verse problems but are not discussed in this work. W e note also that a deterministic formulation can be readily adopted by selecting a degenerate density in the form of a Dirac-delta, i.e.: q ψ ( y | x ) = δ ( y − µ ψ ( x )) . (12) The sought function µ ψ ( x ) in this case can take the form of any of the proposed NN-based architectures such as Fully Connected Neural Networks (FCNNs), Con volutional Neural Networks (CNNs) [ 20 ], Deep Neural Operators (DeepONets) [ 11 ], Graph Neural Operators (GNOs) [ 12 ] or Fourier Neural Operators (FNOs) [ 42 ], in which case ψ would correspond to the associated NN-parameters. While all these architectures have produced good results in a variety of problems, it is easily understood that as the variability and the dimension of the input x (as well as of the sought output y ) increase, so must the complexity of the aforementioned NN-based schemes increase. At a practical level, it is not uncommon that one must train for millions or billions of parameters ψ ([ 20 , 11 , 18 ]). This situation is much more pronounced in multiscale problems which are discussed in Section 3 where input and output de facto imply very high-dimensional representations. More importantly , as such interpolators are agnostic of the physical context, they face signiﬁcant problems in generaliz- ing [ 11 , 13 ]. For example, if out-of-distrib ution predictions are sought for a signiﬁcantly different input x as compared to the ones used in training, then the predicti ve accurac y can signiﬁcantly drop [ 18 ]. In addition, and in the conte xt of PDEs in particular , if one were to seek predictions under a dif ferent set of boundary conditions than the ones employed during training, most likely , these predictions would be signiﬁcantly of f and by a large margin (see Section 4). 6 Figure 1: Schematic illustration of the proposed, physics-aw are architecture for parametrizing the mean solution µ ψ ( x ) as described in subsection 2.4. One could claim that enlar ged datasets, bigger models, or progress in pertinent hardw are could alleviate or ov ercome these dif ﬁculties. Nev ertheless, none of these strategies addresses the fundamental, underlying question which is ho w to ﬁnd the requisite information bottlenecks that lead to drastic dimensionality reductions and lightweight surrogates and how to achie ve this in a problem-a ware manner that is capable of producing accurate predictions under e xtrapolativ e settings [49]. In view of these objectiv es and follo wing pre vious w ork [ 49 , 21 ], we propose incorporating coarse-grained (CG) versions of the go verning PDE as an implicit solver [ 63 , 64 ] in the surrogate constructed. W e elaborate on this in the subsequent sections, and more importantly , we extend it to multiscale problems, which pose a unique set of challenges. For the purposes of the V ariational Inference proposed, we employ a multi variate Gaussian of the form: q ψ ( y | x ) = N ( y | µ ψ ( x ) , Σ ψ ) (13) the mean and cov ariance of which are discussed in the sequel. In the context of materials problems and as in the model formulation of Equation (3) where the inputs x parametrize a, potentially multiscale, material property ﬁeld, we consider a (much) coarser , discretized version (see e.g. Figure 1) of the gov erning PDE where X , Y denote the respecti ve, discretized input and output. By construction, their dimension is much lo wer as compared to their reference, ﬁne-grained counterparts x , y . As such, the solution of the associated (non)linear algebraic equations to obtain the solution Y for any input X is much faster . As we have sho wn in [ 49 ], the X could be thought of as some effecti ve material properties which serve as the requisite information bottleneck to the problem. Furthermore, it is possible that a different (discretized) PDE could be used as a stencil for the coarse-grained model. W e complement the implicit, differentiable, and vectorized, input-output map Y ( X ) of the CG model, with two additional components: • a ﬁne-to-coarse input map denoted by X ψ x ( x ) that we express with a shallo w CNN parameterized by ψ . This consists merely of 2 to 3 con volution and decon volution layers, as well as a fe w , average pooling layers. T o impro ve stability we employ batch normalization where ver is required, while the only kind of acti vation function used was Softplus. The number of feature maps of each layer is limited, and the number of total trainable parameters is thr ee or ders of magnitude lower compared to other state-of-the-art models used on the same problems (more details in A). • a coarse-to-ﬁne output map denoted by y ( Y ) that attempts to reconstruct the reference, ﬁne-grained (FG) output y giv en the CG output Y . In our illustrations, this map was ﬁx ed (i.e. it had no trainable parameters), which has the desirable effect of forcing X to attain physically meaningful values in the conte xt of materials problems. In particular a linear map of the form y = AY was used where the matrix A was pre-computed so 7 Algorithm 1 P ANIS Training Algorithm 1: Select λ , N , M ; Initialize ψ ← ψ 0 ; 2: Generate a set of N weighting functions ▷ see subsection 2.3 3: ℓ ← 0 ▷ Initialize iteration counter 4: while F ψ not con verged do 5: Draw M weight functions at random out of the N ▷ see Equation (9) 6: Draw x ∼ p ( x ) 7: Generate y ← µ ψ x ( x ) + L y ε 1 + σ y ε 2 , ε 1 ∼ N  0 , I d ′ y  , ε 2 ∼ N  0 , I d y  8: Approximate F ψ ▷ Monte Carlo Approximation according to Equation (10) 9: Estimate ∇ ψ F ψ ▷ Backpropagation for estimating ∇ ψ F ψ 10: ψ ℓ +1 ← ψ ℓ + ρ ( ℓ ) ⊙ ∇ ψ F ψ ▷ SGA using AD AM for determining the learning rate ρ ( ℓ ) 11: ℓ ← ℓ + 1 as to minimize the mean-square error in the resulting solution ﬁelds (more details in B). A very important beneﬁt of the use of this implicit solv er and of the associated map is the direct imposition of boundary conditions on the CG model, i.e. on Y , which are automatically transferred to y . In this manner and we demonstrate in the numerical illustrations, our model can generalize under varying boundary conditions. The composition of these three functions leads to: µ ψ ( x ) = A Y  X ψ x ( x )  . (14) In the formulation employed in the ensuing numerical experiments, the sole trainable parameters are associated with the inner layer i.e. the ﬁne-to-coarse map from x to X . More complex formulations in volving trainable parameters in the other two layers could be readily accommodated. W e note that all three layers are dif ferentiable, and each ev aluation requires the solution of the CG model. Dif ferentiation of µ ψ with respect to ψ x as needed for training requires a backward pass which can be computed with Automatic Differentiation [ 65 , 66 ] and/or adjoint formulations of the gov erning equations of the CG model [67, 68]. W ith regards to the cov ariance matrix Σ ψ in Equation 13 and gi ven the high dimension of y , we advocate a lo w-rank approximation of the form: Σ ψ = L y L T y + σ 2 y I d y , (15) where L y is a rectangular matrix of dimension d y × d ′ y (with d ′ y << d y ) and σ 2 ψ ∈ R + is a scalar . The ﬁrst term is responsible for introducing correlation between the solution vector’ s entries whereas the second term accounts for the residual, global uncertainty in the solution. The lo w-rank approximation ensures that the number of unkno wn parameters in L y grows linearly with the dimension of y . In summary , the vector of trainable parameters ψ consists of: ψ =  ψ x , L y , σ 2 y  . (16) Maximization of the ELBO F with respect to ψ is carried out in the context of Stochastic V ariational Inference (SVI) [ 69 , 55 ]. This entails Monte Carlo approximations for the intractable expectations appearing in the gradient of ∇ ψ F and Stochastic Gradient Ascent (SGA). In our experiments, we used AD AM for the SGA because of its adaptiv e learning rates, momentum-based updates, and its ef fectiveness as an optimization algorithm [ 70 , 71 ]. Critical to achieving Monte Carlo estimators with low error is the use of the reparametrization trick [ 72 , 73 ], which for y sampled from q ( y | x ) implies: y = µ ψ x ( x ) + L y ε 1 + σ y ε 2 , ε 1 ∼ N  0 , I d ′ y  , ε 2 ∼ N  0 , I d y  , (17) where dim ( ε 1 ) = d ′ y , dim ( ε 2 ) = d y . In Algorithm 1 we summarize the main algorithmic steps. 2.5 P ANIS - Posterior Predicti ve Estimates Upon con vergence of the aforementioned SVI scheme and the identiﬁcation of the optimal ψ parameters, the approxi- mate posterior q ψ ( y | x ) in Equation (13) can be used for posterior predictiv e estimates. For each test input x , samples 8 Algorithm 2 Posterior Predictiv e Estimates for the solution (Equation (2)) for a test input x . 1: Compute u µ ( s ) ← P N η i =1 µ ψ ( x ) i η i ( s ) ▷ Posterior Mean of Solution Field ; 2: Compute u σ ( s ) ← q P N η i,j =1 η i ( s ) Σ ψ ,ij η j ( s ) ▷ Posterior St. De viation of Solution Field 3: u U ( s ) ← u µ ( s ) + 2 u σ ( s ) ▷ Upper Uncertainty Bound Field 4: u L ( s ) ← u µ ( s ) − 2 u σ ( s ) ▷ Lower Uncertainty Bound Field from q ψ ( y | x ) can readily be drawn as in Equation (17) . W e note that for each such sample, the solution of the CG model (i.e. the computation of Y for the corresponding X ψ x ( x ) ) would be required. For the representation of u y in Equation (2) (which is linear with respect to y ) posterior predicti ve estimates of the continuous PDE-solution u y ( s ) can be readily obtained as detailed in Algorithm 2 by making use of µ ψ ( x ) in (14) and Σ ψ in (15). In order to comparativ ely assess the performance of the proposed framew ork, for each of the problems in the numerical illustrations we consider a v alidation dataset D v = { x ( j ) , u ( j ) } N v j =1 consisting of N v input-solution pairs, which were obtained by using a con ventional FEM solver on a ﬁne mesh. Each vector u ( j ) refers to the discretized version of u ( j ) ( s ) , the speciﬁcs of which are described in Section 4. The follo wing metrics were computed: Coefﬁcient of Determination R 2 : The coef ﬁcient of determination is a widely used metric [ 74 ] to quantify the accurac y of point estimates, and it is deﬁned as follows R 2 = 1 − P N j j =1 ∥ u ( j ) − u ( j ) µ ∥ 2 2 P N j j =1 ∥ u ( j ) − ¯ u ( j ) ∥ 2 2 , (18) where u ( j ) µ is the mean discretized solution ﬁeld as predicted from the predicti ve posterior of our trained model for each x ( j ) in D v , ¯ u ( j ) = 1 N j P N j j =1 u ( j ) is the sample mean ov er the validation dataset and || . || 2 denotes the L 2 − norm ov er the problem domain Ω . The maximum score is R 2 = 1 , which is attained when the predictiv e mean estimates coincide with the true solutions. It is noted that negati ve R 2 values are also v alid since the second term is merely weighted (not bounded) by the empirical variance of the v alidation dataset. Relative L 2 Error ϵ : The relative L 2 error is commonly used as an e valuation metric for comparisons between state-of-the-art surrogate models for stochastic PDEs [20, 75, 18]. It is deﬁned as: ϵ = 1 N j N j X j =1 ∥ u ( j ) − u ( j ) µ ∥ 2 ∥ u ( j ) ∥ 2 . (19) The overwhelming majority of competitiv e methods produce deterministic predictions, i.e. point estimates, which prev ent a direct comparison with the predictiv e posterior q ψ ( y | x ) obtained by our framew ork. Ne vertheless, in the illustrations of Section 4, we report upper/lower posterior uncertainty bounds obtained as detailed in Algorithm 2. 3 Learning solutions to multiscale PDEs As mentioned in section 1, a challenging case in which data-dri ven (physics-informed or not) methods struggle, are multiscale problems. In the context of heterogeneous media such problems naturally arise as the scale of v ariability of the underlying microstructure becomes smaller . Capturing this would, in general, require very high-dimensional input x and output y vectors, and while e xisting neural architectures could be employed without alterations, they w ould require bigger and deeper nets (i.e. more trainable parameters) as well as, in general, more training data. In multiscale PDEs, we are interested in ﬁnding an effecti ve or homogenized model that captures the macroscale components of the solution. Classical homogenization techniques rely on very special features such as periodicity at the ﬁne-scale [ 1 ]. Sev eral numerical homogenization techniques [ 76 ] hav e been dev eloped which consistently upscale ﬁne-scale information which, although small in magnitude, can ha ve a signiﬁcant ef fect in the macroscale solution [ 77 ]. In the context of weighted residuals, setting these ﬁne-scale components of the solution to zero w ould lead to signiﬁcant errors in the coarse-scale components [78, 79]. While sev eral data-driven strate gies hav e been proposed for learning such effecti ve models [ 80 , 81 , 82 , 83 ], to the best of our knowledge all these rely on labeled data i.e. pairs of inputs x and outputs/solutions y of the multiscale PDE. In 9 the following, we propose multiscale P ANIS (mP ANIS), which is an extension of the previously described frame work that can learn such ef fectiv e models without e ver solving the multiscale PDE b ut solely by computing weighted residuals (and their deriv atives). The enabling component remains the physics-aware, implicit solver at the core of our surrogate, which can be learned by a small number of inputs and corresponding residuals. 3.1 Model extension f or multiscale PDEs (mP ANIS) W e additi vely decompose the solution-output y as: y = y c + y f , (20) where y c accounts for the coarse-scale features we are interested in and y f for the ﬁne-scale ﬂuctuations that are not of interest. In the context of the probabilistic models adv ocated, such a task would entail learning q ψ ( y c | x ) , and q ψ ( y f | x ) . On the basis of the pre vious single-scale model, we employ a q ψ ( y c | x ) of the form: q ψ ( y c | x ) = N ( y c | µ ψ ( x ) , Σ ψ ) , (21) where as in Equation (14) we make use of the implicit solver and the associated ﬁne-to-coarse X ψ x ( x ) and coarse- to-ﬁne AY (see section 2.4) maps such that µ ψ ( x ) = A Y  X ψ x ( x )  . Hence, the mean of y c liv es on the subspace spanned by the columns of the A matrix, and its reduced coordinates are determined by the output Y of the implicit solver . W e also introduce a learnable correlation on this subspace by adopting a d y × d y cov ariance matrix Σ ψ = A S ψ A T where the inner d Y × d Y cov ariance matrix S ψ is: S ψ = L Y L T Y + σ 2 Y I d Y . (22) The rectangular matrix L Y of dimension d Y × d ′ Y (with d ′ Y < d Y ) captures the principal directions where correlation appears and the scalar σ Y ∈ R + the residual uncertainty . W e emphasize that in this manner , the posterior uncertainty for y c will be concentrated in the subspace spanned by the columns of the A matrix. Giv en the aforementioned representation of y c , the ﬁne-scale features y f in Equation (20) are expressed as: y f = A ⊥ y ′ f , (23) where the columns of matrix A ⊥ span the orthogonal complement of the subspace spanned by the columns of A abov e. This matrix can be readily computed, and its use ensures that the y f complements y c in representing the solution y . Giv en the very high dimension of y ′ f , which is commensurate with that of the full solution y and which would pose signiﬁcant, if not insurmountable, challenges in learning q ψ ( y ′ f | x ) , we propose the follo wing form which is justiﬁed by the ability of the y c -part of the model to learn from very fe w inputs x . In particular , consider an empirical approximation of the input density p ( x ) consisting of K samples, i.e.: p ( x ) = 1 K K X k =1 δ ( x − x k ) . (24) W e propose emplo ying an approximate posterior: q ψ ( y ′ f | x k ) = δ ( y ′ f − y ′ f ,k ) . (25) This implies that for each atom x k in Equation (24) , we make use of a distinct Dirac-delta, centered at an (unkno wn) y ′ f ,k . The corresponding { y ′ f ,k } K k =1 become part of the vector of the unkno wn parameters ψ which are determined by maximizing the ELBO. While the dimension of each y ′ f ,k can be high, the number K of x k -samples can be small due to the capacity of the proposed model architecture to learn from small data. In summary , the vector of parameters ψ that would need to be found by maximizing the corresponding ELBO (see subsequent section) in the multiscale version of our framew ork consists of: ψ =  ψ x , L Y , σ 2 Y , { y ′ f ,k } K k =1  . (26) 3.2 Algorithmic implementation Giv en the representation of the solution y as in Equation (20) , the latent variables to be inferred in the V ariational Inference framew ork advocated are no w y c and y ′ f (Equation (23) ). The associated weighted residuals, which are used 10 as virtual observ ables, can be readily re-expressed with respect to y c and y ′ f as { r w j ( y c , y ′ f , x ) } N j =1 . In combination with priors p ( y c | x ) , p ( y ′ f | x ) 4 , we obtain analogously to Equation (8) the following ELBO 5 : F ( ψ ) = − λ N X j =1  | r w j ( y c , y ′ f , x ) |  q ψ ( x , y c , y ′ f ) + * log p ( y c | x ) p ( y ′ f | x ) p ( x ) q ψ ( x , y c , y ′ f ) + q ψ ( x , y c , y ′ f ) . (27) The particular form of q ψ ( y ′ f | x ) in Equation (25) in combination with the empirical approximation of p ( x ) in Equation (24) yield the following result for the ﬁrst term in the ELBO: N X j =1  | r w j ( y c , y ′ f , x ) |  q ψ ( x , y c , y ′ f ) = 1 K K X k =1 N X j =1  | r w j ( y c , y ′ f ,k , x k ) |  q ψ ( y c | x ) . (28) This lends itself naturally to Monte Carlo approximations both in terms of the index k (i.e. by subsampling only a subset of the K atoms x k in Equation (24) ) as well as in terms of the index j i.e. ov er the considered residuals as explained in Equation (9) . Expectations with respect to q ( y c | x ) are also approximated by Monte Carlo in combination with the reparametrization trick [72, 73], which according to Equation (21) can make use of y c samples generated as follows: y c = A  Y  X ψ x ( x ) + L Y ε 1 + σ Y ε 2  , ε 1 ∼ N  0 , I d ′ Y  , ε 2 ∼ N ( 0 , I d Y ) , (29) where dim ( ε 1 ) = d ′ Y , dim ( ε 2 ) = d Y . Details about the ELBO and the computation of its gradient with respect to ψ are contained in C. In Algorithm 3, we summarize the basic steps in volved. Remarks: • One of the common issues in pertinent formulations is the enforcement of Dirichlet boundary conditions [ 20 ]. In our framew ork, these could be enforced a priori in y , although depending on the representation of the solution, this is not always straightforw ard. A more elegant strate gy would be to introduce the BCs at certain boundary points as actual observables through an additional likelihood (e.g. a Gaussian) which would ensure that the posterior on y satisﬁes them (up to some prescribed precision). The disadv antage of both of these procedures is that the approximate posterior, either q ψ ( y | x ) or q ψ ( y c | x ) will also (approximately at least) enforce these Dirichlet BCs. Hence it would be useless in predicti ve settings where the PDE is the same b ut the Dirichlet BCs are different. In our formulation, Dirichlet BCs are enforced at the level of the implicit solver (i.e. on Y ) and are transferred to y through the appropriately ﬁne-tuned linear map y = A Y (see B). In the multiscale setting where y c inherits these Dirichlet BCs as described abov e, we set the corresponding y f (or y ′ f in Equation (23) ) equal to 0 a priori. As we demonstrate in the numerical illustrations, this enables us to produce extremely accurate predictions under BCs not encounter ed during training. • The hyperparameter λ appearing in the ELBO and which controls the tightness of the tube that is implicitly deﬁned in the joint ( x , y ) space around the 0 residual manifold, offers a natural w ay to temper computations. One can en vision starting with a small λ , which would correspond to an easier posterior to approximate as the residuals could be far from 0 (in the limit that λ = 0 , the posterior coincides with the prior). Subsequently , this could be, potentially adapti vely , increased until the target value is reached in order to attain posteriors (and approximations thereof) that are more tightly concentrated around the solution manifold. W e defer in vestigations in this direction to future work. • Adaptiv e learning strategies could also be employed in terms of the samples { x k } K k =1 employed during training. This is especially important in the multiscale setting as the dimension of the parameters { y ′ f ,k } K k =1 that need to be ﬁne-tuned is proportional to K . W e defer in vestigations along this direction to future work. 3.3 mP ANIS - Posterior Predicti ve Estimates Upon con vergence of the aforementioned SVI scheme and the identiﬁcation of the optimal ψ parameters, the approx- imate posterior q ψ ( y c | x ) in Equation (21) , i.e. the one accounting for the coarse-scale features of the full solution, can be used for posterior predictive estimates. For each test x , samples from q ψ ( y c | x ) can readily be drawn as in Equation (29) . W e note that for each such sample, the solution of the CG model (i.e. the determination of Y for the corresponding X ψ x ( x ) ) would be required. 4 As in the single-scale case, uninformativ e priors were used here in the form of N  0 , σ 2 I  , where σ 2 = 10 16 5 W e omit any constant terms 11 Algorithm 3 mP ANIS Training Algorithm. A set of random, ﬁxed to the number inputs x k = { x 1 , x 2 , ..., x K } is giv en: 1: Select λ , N , M ; Initialize ψ ← ψ 0 ; 2: Construct a set of N weighting functions ▷ see subsection 2.3 3: ℓ ← 0 ▷ Initialize iteration counter 4: while F ψ not con verged do 5: Draw M weight functions at random ▷ see Equation (9) 6: Draw a subset of the K atoms x k : x k s ∼ p ( x ) , k s < K ▷ see Equation (24) 7: Generate y c,k s ← A  Y  X ψ x ( x k s ) + L Y ε 1 + σ Y ε 2  , ε 1 ∼ N  0 , I d ′ Y  , ε 2 ∼ N ( 0 , I d Y ) 8: y ← y c,k s + A ⊥ y ′ f ,k s 9: Approximate F ψ ▷ Monte Carlo Approximation according to Equation (27) 10: Estimate ∇ ψ F ψ ▷ Backpropagation for estimating ∇ ψ F ψ 11: ψ ℓ +1 ← ψ ℓ + ρ ( ℓ ) ⊙ ∇ ψ F ψ ▷ SGA using AD AM for determining the learning rate ρ ( ℓ ) 12: ℓ ← ℓ + 1 Algorithm 4 mP ANIS Posterior Predictive Estimates for coarse-scale solution (Equation (2)) for a test input x 1: Compute u c,µ ( s ) ← P N η i =1  A  Y  X ψ x ( x )  i η i ( s ) ▷ Posterior Mean of Solution Field ; 2: Compute u c,σ ( s ) ← q P N η i,j =1 η i ( s ) Σ ψ ,ij η j ( s ) ▷ Posterior St. De viation of Solution Field 3: u c,U ( s ) ← u c,µ ( s ) + 2 u c,σ ( s ) ▷ Upper Uncertainty Bound Field 4: u c,L ( s ) ← u c,µ ( s ) − 2 u c,σ ( s ) ▷ Lower Uncertainty Bound Field In order to comparativ ely assess the performance of the proposed framework in the context of multiscale problems, we consider a v alidation dataset D v = { x j , u j } N v j =1 consisting of N v inputs and discrete solution pairs, which were obtained by using a con ventional FEM solver on a ﬁne mesh. From these we extract the coarse-scale part u j c = P d y i y j c,i η i of each u j by projecting on the subspace spanned by the columns of the A matrix (i.e. y j c =  A T A  − 1 A T y j ), where η i are the discretized equiv alent of the shape functions described in Equation (2). Howe ver , this is suf ﬁcient to approximate the solution well enough since proper uncertainty bounds will quantify the error . The detailed process for making predictions in the multiscale setup is shown in Algorithm 4. Other than neglecting the inﬂuence of the ﬁne-scale part, the procedure and the v alidation metrics are identical to the ones described in Section 2.5. 4 Numerical Illustrations This section contains several numerical illustrations of the performance of the proposed (m)P ANIS frame work on linear and nonlinear PDEs. It is compared with the state-of-the-art method of Physics Informed Fourier Neural Operators (PINO) [ 43 ]. In parallel, components of the model generically described in the previous sections are speciﬁed, and additional implementation details are provided. Our objectiv es are to illustrate: • the comparativ e performance under varying dimensions of the parametric input ( predicti ve accuracy ). • the very low number of training parameters of the proposed architectures in relation to competitors ( lightweight ). • the ability of the proposed frame work to generalize i.e. to provide accurate predictions in out-of-distribution settings as well as under different boundary conditions ( generalization ). • the ability to capture the low-frequenc y components of the solution in multiscale PDEs ( multiscale ). The general implementation of P ANIS and the dedicated coarse-grained solv er mentioned in subsection 2.4 is based on PyT orch [ 84 ]. T raining was conducted on an H100 Nvidia GPU and takes ∼ 10 min for the ELBO to con verge (see Figure 2) in the case of P ANIS and ∼ 2195 min in the case of PINO. The validation dataset in all the subsequent illustrations consists of N v = 100 input-solution pairs. Reference solutions were obtained using Fenics [ 85 ] and a uniform, ﬁnite element mesh with 57600 nodal points. 12 Figure 2: Con ver gence of the ELBO F for P ANIS, when trained on microstructures with V F = 50% and BCs u 0 = 0 (left). Illustration of 100 and 500 randomly selected RBF-based weight functions w j according to algorithm 1 (middle-right). Each one of these w j ’ s corresponds to a single weighted residual. A github repository containing the associated code and illustrativ e examples will become a vailable upon publication at https://github .com/pkmtum/P ANIS. 4.1 High Dimensional Dar cy Problem W e be gin with the Darcy ﬂo w equation: ∇ · ( − c ( s ; x ) ∇ u ( s )) = f , s ∈ Ω = [0 , 1] × [0 , 1] u ( s ) = u 0 , s ∈ Γ D (30) as this PDE has been considered by the majority of state-of-the-art methods [ 13 ] [ 43 ] [ 75 ] and the respecti ve code is generally av ailable for producing the following comparati ve results. In the following, we assume that f = 100 , u 0 = 0 , and the input x parametrizes the permeability ﬁeld c ( s ; x ) as described in the sequel. Input permeability c ( s ; x ) : The permeability ﬁelds are generated as cut-of fs from an underlying Gaussian ﬁeld. In particular we consider a zero-mean Gaussian ﬁeld G ( s ; x ) = P d x i =1 √ λ i x i v i ( s ) , where x i ∼ N (0 , 1) determine the input v ector x , λ i are the eigenv alues and v i ( s ) the eigenfunctions of the co variance kernel k ( s , s ′ ) = exp  − || s − s ′ || 2 2 l 2  where l determines the length scale [ 20 ]. By controlling the numbers of terms d x , we can control the dimension of the parametric input. W e generate c ( s ; x ) corresponding to a binary medium as follo ws: c ( s i , x ) =  1 , if G ( s i ; x ) ≥ t (phase 1) 1 C R , if G ( s i ; x ) < t (phase 2) (31) where the cut-off threshold is set as t = Φ − 1 ( V F ) where V F is the prescribed volume fraction of phase 1 and Φ is the standard Gaussian CDF (e.g. for V F = 0 . 5 , t = 0 ). The parameter C R deﬁnes the contrast ratio in the properties of the phases, which has a signiﬁcant impact on the medium’ s response as well as on the construction of pertinent surrogates. In general, the higher C R is, the more higher-order statistics af fect the response [ 1 ]. W e emplo yed a C R = 10 in the subsequent illustrations, although much smaller values are generally selected [75, 86]. As mentioned in subsection 2.4, the ﬁrst layer of P ANIS architecture (see Equation (14) ) in volves the ﬁne-to-coarse map denoted by X ψ x ( x ) that is realized as a shallo w CNN (Appendix A). W e note at this point that the input of the CNN is the conducti vity ﬁeld c ( s ; x ) (i.e. implicitly x ) obtained from Equation (31) and the output are the ef fective properties X . Some intuition about the role of c ( s ; x ) and X could be drawn from Figure 3, which visualizes some indicati ve pairs of c ( s ; x ) and X . For each full conducti vity ﬁeld c ( s ; x ) , the model learns a coarse-grained equi valent X , which, when combined with the rest of the model components, scores best in terms of the weighted residuals. In general, areas with high/low c ( s ; x ) yield areas with high/lo w X -values respecti vely , although the ranges can differ . T rial solutions and weighting functions: The solution u is represented as in Equation (2) with: η i ( s ) = exp  || s − s i, 0 || 2 2 ∆ l 2  (32) 13 Figure 3: Indicative pairs of the full permeability ﬁeld c ( s ; x ) (generated for l = 0 . 05 as described in section ?? ) and the learned coarse-grained input X . In total, d y = 4096 (i.e. dim ( y ) = 4096 ) functions were used. The center-points s i, 0 were located on a regular 64 × 64 grid ov er the problem domain. As mentioned earlier, BCs are enforced indirectly through the implicit solv er (i.e. through Y in Equation (14)). For the population of N weighting functions w j ( s ) that deﬁne the corresponding weighted residuals in Equation (5) , the same functions η j ( s ) abov e were employed with the follo wing differences for P ANIS and mP ANIS: • For P ANIS, N = 289 and the center -points s i, 0 were located on a regular grid 17 × 17 . • For mP ANIS, N = 4096 and the center -points s i, 0 were located on a regular grid 64 × 64 . These were multiplied with the function τ ( s ) = s 1 (1 − s 1 ) s 2 (1 − s 2 ) , which is merely used to enforce that the y are zero on the Dirichlet boundary 6 , i.e.: w j ( s ) = τ ( s ) η j ( s ) . (33) W e note that in the frame work proposed, the selection of the weighting function space is detached from the trial solutions and the discretization of u . Naturally , and as each w j furnishes dif ferent amounts of information about the solution, an intelligent selection could signiﬁcantly impact the efﬁcienc y of the method, but it is not inv estigated in this work. Numerical integrations ov er the problem domain for the ev aluation of the associated weighted residuals (Equation (4)) are performed using the trapezoidal rule on a regular 128 × 128 grid. Coarse-Grained Implicit Solver: The coarse-grained implicit solver , which lies at the center of the proposed architecture, is obtained by employing a regular , linear , ﬁnite element discretization grid of 16 × 16 triangular pairs for P ANIS and 8 × 8 for mP ANIS. Each of the resulting 512 (or 128 ) triangular elements is assumed to ha ve a constant permeability , which is summarily represented by the vector X (Equation (14) ). The solution of the corresponding discretized equations giv es rise to the nodal values which are represented by the vector Y (Equation (14)). A summary of the most important dimensions of the models considered is contained in T able 1. 4.1.1 Comparison f or Different V olume Fractions In the following illustrations the predictiv e accuracy of P ANIS will be demonstrated and compared against PINO for various volume fractions V F . Both models were trained for input ﬁelds c ( s ; x ) corresponding to V F = 50 % . Subsequently , their predictiv e performance was tested for c ( s ; x ) corresponding to a range of different VFs. The con vergence of the ELBO during training of P ANIS is depicted in Figure 2. Lastly , for all the subsequent illustrations presented in subsections 4.1 and 4.2 the hyperparameters of Algorithm 1 are set λ = 10 4 , N = 289 and M = 100 ( M = 200 for subsection 4.2). 6 Other such functions could be readily employed. 14 V ariable P ANIS mP ANIS PINO x 1024 1024 — y 4096 4096 — X 289 81 — Y 289 81 — (model parameters) ψ 7956 415112 13140737 ψ x 5065 12801 — W all Clock T ime (min) ≈ 10 ≈ 807 ≈ 2195 Resolution of c ( s ; x ) 128 128 128 Resolution of u ( s ) 128 128 128 T able 1: Most important dimensions of P ANIS, mP ANIS and PINO. V olume Fraction (VF) R 2 (PINO) R 2 (P ANIS) 10 % − 2 . 749 0 . 951 20 % 0 . 510 0 . 962 30 % 0 . 911 0 . 969 40 % 0 . 967 0 . 970 50 % (training) 0 . 985 0 . 971 60 % 0 . 988 0 . 964 70 % 0 . 983 0 . 963 80 % 0 . 968 0 . 953 90 % 0 . 911 0 . 928 T able 2: Predicti ve accurac y in terms of R 2 (the closer to 1 it is, the better) for P ANIS and PINO when tested on v arious V F values while trained only on V F = 50 % . W e note that PINO gi ves better predictions when it is used in a hybrid mode, i.e. when both data-dri ven and physics- informed terms are used in the loss function [ 43 ]. In the remainder of the paper , we focus purely on physics-informed versions for a fair comparison. Nevertheless, we have conducted one comparison between P ANIS and the purely data-based FNO (i.e. the data-based version of PINO) in Figure 4 to demonstrate the competiti veness of P ANIS even against well-established data-dri ven methods such as FNO. The performance of the two models (P ANIS and PINO) is similar for in-distribution predictions, as can be seen in Figure 5. The mean prediction in both cases is close to the ground truth, as seen from the coefﬁcient of determination R 2 . Howe ver , P ANIS pro vides uncertainty bounds for the predicted solution, in contrast to PINO, which yields only point estimates for the PDE solution. When both models are inquired in out-of-distrib ution conditions (i.e. different volume fraction V F ), P ANIS fully retains its predicti ve accuracy in contrast to PINO which struggles as seen in T able 2. The latter depicts the R 2 score under various V F s. Due to the physics-aware, implicit solver which lies at the center of its architecture, P ANIS is capable of retaining the predicti ve accuracy for e very V F examined, as opposed to PINO, whose accuracy decreases signiﬁcantly as the volume fraction V F deviates notably from the one used for training, i.e. V F = 50% . 4.1.2 Comparison f or Different Dimension of the parametric input x In addition, we examined the comparati ve performance of the aforementioned models for different dimensions d x of the parametric input x , which affects the permeability ﬁeld as explained in Section 4.1. As it can be seen from T able 3, although the predictiv e accuracy of both models decreases as the parametric input increases, it remains very high ev en when the parametric input is 1024 . W e emphasize, ho wev er , that the number of training parameters ψ for P ANIS is O (10 4 ) in contrast to PINO, which employs approximately O (10 7 ) (T able 1). The utilization of the coarse-grained implicit solver enables the reduction of complexity and leads to comparably good results with approximately three orders of magnitude fewer parameters. 4.1.3 Comparison f or Different Boundary Conditions In the following illustrations, we focus on a challenging test for the generalization capability of any data-dri ven scheme. In particular , we report the accuracy of the trained models when called upon to make predictions under different boundary conditions from the ones used to train the models. For this purpose, we train both models with the same conditions as in Equation (30) i.e. u 0 = 0 , and we test them for 2 v ery different Dirichlet boundary conditions. The 15 d x (dimension of x ) R 2 (PINO) R 2 (P ANIS) 64 0.988 0.982 256 0.987 0.979 1024 0.985 0.971 T able 3: Coefﬁcient of determination R 2 for PINO and P ANIS for v arious parametric inputs x , when trained on V F = 50 % , u 0 = 0 . Figure 4: Predictive accurac y of FNO and P ANIS when trained and tested on microstructures with V F = 50 % . The right column shows a one-dimensional slice of the solution along the vertical line from (0.5,0) to (0.5, 1.0). The shaded blue area corresponds to ± 2 posterior standard de viations computed as described in Algorithm 2. ﬁrst corresponds merely to an offset the BCs i.e. for u 0 = 10 . The comparison is conducted in terms of relative L 2 error ϵ (Equation (19) ) as shown in T able 4. W e can observ e that PINO fails to predict that the solution simply needs to be shifted by 10 which leads to a high ϵ -value. In contrast, P ANIS retains intact its predictive accurac y . The second boundary condition considered de viates much more from u 0 = 0 used for training, and in particular , we employed: u 0 ( s 1 , s 2 ) =        10 + 5 sin  π 2 s 2  , if s 1 = 0 , 0 ≤ s 2 ≤ 1 , 10 + 5 sin  π 2 ( s 1 + 1)  , if s 2 = 1 , 0 ≤ s 1 ≤ 1 , 10 − 5 sin  π 2 ( s 2 + 1)  , if s 1 = 1 , 0 ≤ s 2 ≤ 1 , 10 − 5 sin  π 2 ( s 1 )  , if s 2 = 0 , 0 ≤ s 1 ≤ 1 . (34) Indicati ve predictions provided by both models for this non-tri vial case are presented in Figure 6. In T able 4, we observe that as in the previous case, and due to the embedded physics, the performance of P ANIS is not affected by the change of boundary conditions, whereas the predictions of PINO deteriorate drastically . Boundary Condition T ype ϵ (PINO) ϵ (P ANIS) u 0 = 0 (training) 0 . 0381 0 . 0589 u 0 = 10 0 . 4581 0 . 0368 u 0 as in Equation 34 0 . 4763 0 . 0347 T able 4: Comparison of relativ e L 2 error ϵ (the smaller , the better) when tested under different boundary conditions. 4.2 Non-Linear P oisson Equation In this subsection, we assess the performance of the proposed model on a non-linear PDE. W e note that no adaptations or changes are needed as the sole conduit of information is the weighted residuals. As a result, the training algorithm 16 (a) Predictive accurac y of P ANIS and PINO when trained and tested on microstructures with V F = 50 % . (b) Out of distribution Prediction: Predictiv e accuracy of P ANIS and PINO when trained on microstructures with V F = 50 % and tested on V F = 10 % . Figure 5: Comparison between P ANIS and PINO. merely requires the computation of the residuals and their deri vati ves with respect to the model parameters (with automatic differentiation). W e consider the same conserv ation law , which in terms of the ﬂux vector q , can be written as: ∇ · q = f , s ∈ Ω = [0 , 1] × [0 , 1] (35) which is complemented by a non-linear constitutiv e law of the form: q ( s ) = − c ( s ; x ) e α ( u ( s ) − ¯ u ) ∇ u ( s ) . (36) and Dirchlet boundary condition u = u 0 , s ∈ Γ D as before. The ﬁeld c ( x , s ) is deﬁned as described in subsection 4.1, and α , ¯ u 7 are two additional scalar parameters (assumed to be independent of s ). It is noted that α (and secondarily ¯ u ) controls the degree of non-linearity . For α = 0 , Equation 35 will take the same form as in Equation 30, and as α increases, so does the nonlinearity in the gov erning PDE. The most signiﬁcant adjustment pertains to the implicit solver that lies at the center of the architecture proposed. Apart from being lower -dimensional and more ef ﬁcient to solve, one is at liberty to induce as much physical insight and complexity [ 49 ]. In this work, we adopt the same discretization explained before and a constituti ve law of the same 7 The values a = 0 . 05 and ¯ u = 5 were used in the subsequent numerical results. 17 Figure 6: Comparison between P ANIS and PINO when trained on V F = 50 % , u 0 = 0 and tested on V F = 50% , u 0 as described in Equation (34). V olume Fraction (VF) Boundary Condition T ype ϵ (P ANIS) 10 % u 0 = 0 ev erywhere 0 . 0541 20 % u 0 = 0 ev erywhere 0 . 0714 30 % u 0 = 0 ev erywhere 0 . 0770 40 % u 0 = 0 ev erywhere 0 . 0854 50 % (training) u 0 = 0 ev erywhere 0 . 0904 60 % u 0 = 0 ev erywhere 0 . 0966 70 % u 0 = 0 ev erywhere 0 . 1025 80 % u 0 = 0 ev erywhere 0 . 1054 90 % u 0 = 0 ev erywhere 0 . 1022 50 % u 0 = 10 ev erywhere 0 . 0449 50 % u 0 as in Equation 34 0 . 0442 T able 5: Relative L 2 error ϵ between P ANIS and the ground-truth of the non-linear PDE for various out-of-distribution cases. nonlinear form as in Equation (36) , where in each element of the coarse-grained model instead of c ( s ; x ) we hav e the learnable parameters denoted by X (i.e. a and ¯ u hav e the same values). W e note that one could adopt a dif ferent form for the constitutiv e law or make the corresponding a, ¯ u learnable as well. In order to solve the corresponding non-linear system of equations, i.e. to determine Y ( X ) , we employed Ne wton’ s iterative method. This was achie ved in a fully vectorized way , such that PyT orch can backpropagate through the computational graph for computation of the gradients needed for estimating the deriv ativ es of the ELBO. In order to further expedite the computations, we initialized the model parameters to the ones found for the linear case, i.e. a = 0 (section 4.1) and subsequently increased the a gradually per iteration until the ﬁnal value of a = 0 . 05 was reached. This provides a natural tempering mechanism that can expedite and stabilize con vergence. In addition, and if this is of interest, it enables one to automatically obtain surrogates for intermediate values of the constituti ve parameter a which could be used for predictive purposes or sensitivity analysis. In Figure 7, we show the predicti ve performance of P ANIS for two indicati ve samples. The predicti ve accuracy remains high as pre viously , demonstrating that the proposed frame work can also be used ef fectively in non-linear , parametric PDEs. In out-of-distribution settings as those obtained by altering the v olume fraction or the boundary conditions (see sections 4.1.1 and 4.1.3), P ANIS retains intact its prediction accuracy as it can be seen in T able 5. 4.3 Multiscale PDE In this subsection, we demonstrate the multiscale v ersion of the proposed framew ork (i.e. mP ANIS) as described in Section 3. T o this end, we consider the PDE of Equation (30) and material property ﬁelds obtained for l = 0 . 05 (instead of l = 0 . 25 used in the previous examples). As a result c ( s ; x ) exhibits ﬁner scale ﬂuctuations (see indicativ e examples shown in Figure 3). For all the subsequent illustrations presented in subsection 4.3 we set λ = 10 2 . 2 , N = 4096 and M = 1500 (see Algorithm 3), while the number of x -samples used in training was K = 100 . 18 Figure 7: Predictions obtained with P ANIS for the non-linear PDE of Equation (35) and microstructures generated with l = 0 . 25 , V F = 50 % and u 0 = 0 . In Figure 8, we sho w the results obtained by P ANIS and mP ANIS for indicati ve microstructures 8 as well as the R 2 score ov er the validation dataset. One readily notes that despite the fact that the true solution is dominated by lo w-frequency components, P ANIS fails to capture them because it ignores the ﬁner-scale ﬂuctuations. As explained in Section 3 these might be small but hav e a signiﬁcant ef fect on the corresponding weighted residuals. In contrast, under mP ANIS where such ﬂuctuations are accounted for , one is able to capture the low-frequenc y components of the solution as well probabilistic predictiv e estimates. Figure 8: Prediction of P ANIS and mP ANIS for tw o indicativ e inputs ( l = 0 . 05 , V F = 50 % and u 0 = 0 ). In Figure 9, we compare the predictive accuracy of mP ANIS with PINO. W e note that PINO tries to resolv e the full-order solution. Despite the fact that mP ANIS provides probabilistic solutions and that it employs roughly two or ders of magnitude fewer parameters and has a training time that is ≈ 3 times lo wer (wall clock time in the same machine - see T able 3), the tw o models hav e similar accuracy when compared to the full-order solution. W ith regards to the number of training parameters and as noted in Section 3, this depends on K i.e. the number of x − samples used in training. W e had ar gued therein that the model architecture is such that it saturates with a small K . The validity of this hypothesis is inv estigated in Figure 10 which depicts the evolution of the two error metrics employed as a function of K . One can see that both stabilize for K < 100 beyond which little improv ement can be expected. 8 Similar results are observed for an y such input microstucture 19 Figure 9: Comparison between P ANIS and PINO for conducti vity ﬁelds with l = 0 . 05 , V F = 50 % and u 0 = 0 ( In-distribution pr edictions ). Figure 10: Predictiv e accuracy of mP ANIS in terms of R 2 and relati ve L 2 error ϵ with respect to the number K of atoms x k (Equation (24)). W e observ e that mP ANIS saturates with fe wer than ∼ 50 such atoms. V olume Fraction (VF) Boundary Condition T ype ϵ (mP ANIS) ϵ (PINO) 10 % u 0 = 0 ev erywhere 0 . 2240 0 . 5710 20 % u 0 = 0 ev erywhere 0 . 1793 0 . 4593 30 % u 0 = 0 ev erywhere 0 . 1339 0 . 3321 40 % u 0 = 0 ev erywhere 0 . 1131 0 . 2049 50 % (training) u 0 = 0 ev erywhere 0 . 1134 0 . 1029 60 % u 0 = 0 ev erywhere 0 . 1112 0 . 1078 70 % u 0 = 0 ev erywhere 0 . 1450 0 . 1315 80 % u 0 = 0 ev erywhere 0 . 2156 0 . 1920 90 % u 0 = 0 ev erywhere 0 . 3438 0 . 9850 50 % u 0 = 10 ev erywhere 0 . 0634 0 . 5115 50 % u 0 as in Equation 34 0 . 0671 0 . 5225 T able 6: Comparison of relative L 2 error ϵ between P ANIS and PINO with the ground-truth for various out-of- distribution cases and l = 0 . 05 . Finally and with reg ards to out-of-distribution predictions i.e. for different V F s or dif ferent BCs, mP ANIS generally outperforms PINO and in most cases reported in T able 6 by a quite signiﬁcant mar gin despite being much more light-weight as explained abo ve. 20 5 Conclusions W e ha ve proposed a novel probabilistic, physics-aw are framework for constructing surrogates of parametric PDEs. The most notable and nov el characteristics of (m)P ANIS are: • it is based on a physics-inspired architecture in volving a coarse-grained, implicit solver that introduces a valuable informational bottleneck, which signiﬁcantly limits the number of trainable parameters and leads to robust generalization. • it ef ﬁciently learns the map between the parametric input and the solution of the PDE by using randomized weighted residuals as informational probes. The trained surrogate provides fully probabilistic predictions that reﬂect epistemic and model uncertainties, • it can successfully learn the dominant low-frequency component of the solution in challenging, multiscale PDEs by relying solely on the information from the weighted residuals. • it yields probabilistic predictions of similar accuracy with PINO and outperforms it in various out-of- distribution cases while being much more lightweight. The frame work introduced provides a multitude of possibilities both in terms of methodological aspects as well as potential applications, the most important of which we enumerate below . The ﬁrst pertains to the weighting functions employed. As discussed in section 2.3, weighted residuals are used as probes that extract information from the governing PDE. Naturally , and depending on the problem’ s particulars, their informational content can vary drastically b ut can nevertheless be quantiﬁed in the context of the probabilistic frame work adopted through the ELBO. One can therefore en vision impro ving drastically the efﬁcienc y of the method by using, adapti vely , weighting functions/residuals that maximize the informational gain. Another opportunity for adaptive learning arises in the conte xt of mP ANIS and the empirical approximation of the input density through K atoms x k (section 3). As mentioned therein, the number of training parameters that capture the ﬁne-scale ﬂuctuations of the solution i.e. { y ′ f ,k } K k =1 is proportional to K . One could, therefore, en vision progressiv ely increasing K by adaptiv ely incorporating new x k ’ s (and associated y ′ f ,k ) on the basis of their ELBO contribution, i.e. the information they furnish. On the methodological front, we ﬁnally note the possibility of using a tempering scheme by progressively increasing λ (section 2.2) or alternativ e forms of the virtual likelihoods which could accelerate con ver gence. In terms of applications, an important extension in volv es dynamical problems. While weighted residuals can be used again to incorporate the PDE in the learning objectives, their dependence on time poses challenges as well as of fers sev eral possibilities in terms of enforcement using, e.g. a (linear) multistep method or space-time ﬁnite elements [ 87 ]. Finally , the most important application pertains to in verse design, i.e. the identiﬁcation of microstructures x that achiev e target or extremal properties. The inexpensi ve surrogate de veloped, particularly in the multiscale setting, can signiﬁcantly accelerate this search not only because it can yield accurate predictions faster than the PDE-solver [ 88 ] but also because, through the use of latent variables ( X , Y in our case) one can obtain deriv atives of the response. Due to the discrete nature of the microstructure x (and irrespecti ve of the cost of the PDE-solv er), these deriv atives are not av ailable in the original formulation of the problem. References [1] Salvatore T orquato and Henry W Haslach Jr . Random heterogeneous materials: microstructure and macroscopic properties. Appl. Mech. Re v . , 55(4):B62–B63, 2002. [2] George Stefanou, Dimitrios Savv as, and Panagiotis Metsis. Random Material Property Fields of 3D Concrete Microstructures Based on CT Image Reconstruction. Materials , 14(6):1423, January 2021. Number: 6 Publisher: Multidisciplinary Digital Publishing Institute. [3] Y uksel C. Y abansu, Patrick Altschuh, Johannes Hötzer , Michael Selzer , Britta Nestler , and Surya R. Kalidindi. A digital workﬂo w for learning the reduced-order structure-property linkages for permeability of porous membranes. Acta Materialia , 195:668–680, August 2020. [4] Jitesh H Panchal, Surya R Kalidindi, and Da vid L McDowell. K ey computational modeling issues in inte grated computational materials engineering. Computer-Aided Design , 45(1):4–25, 2013. [5] Raymundo Arróyave and David L. McDo well. Systems Approaches to Materials Design: Past, Present, and Future. Annual Re view of Materials Researc h , 49(1):103–126, 2019. _eprint: https://doi.org/10.1146/annure v- matsci-070218-125955. 21 [6] Xian Y eo w Lee, Joshua R. W aite, Chih-Hsuan Y ang, Balaji Sesha Sarath Pokuri, Ameya Joshi, Aditya Balu, Chinmay Hegde, Baskar Ganapathysubramanian, and Soumik Sarkar . Fast in verse design of microstructures via generativ e inv ariance networks. Natur e Computational Science , 1(3):229–238, March 2021. [7] Ankit Agrawal and Alok Choudhary . Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science. APL Materials , 4(5):053208, May 2016. Publisher: American Institute of Physics. [8] Surya R Kalidindi. Hierar chical materials informatics: novel analytics for materials data . Elsevier , 2015. [9] Stefano Curtarolo, Gus L. W . Hart, Marco Buongiorno Nardelli, Natalio Mingo, Stefano San vito, and Ohad Levy . The high-throughput highway to computational materials design. Natur e Materials , 12(3):191–201, March 2013. Number: 3 Publisher: Nature Publishing Group. [10] Zijiang Y ang, Y uksel C. Y abansu, Dipendra Jha, W ei-k eng Liao, Alok N. Choudhary , Surya R. Kalidindi, and Ankit Agrawal. Establishing structure-property localization linkages for elastic deformation of three-dimensional high contrast composites using deep learning approaches. Acta Materialia , 166:335–345, March 2019. [11] Lu Lu, Pengzhan Jin, and George Em Karniadakis. Deeponet: Learning nonlinear operators for identifying dif ferential equations based on the universal approximation theorem of operators. arXiv pr eprint arXiv:1910.03193 , 2019. [12] Zongyi Li, Nikola K ovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar . Neural operator: Graph kernel network for partial differential equations. arXiv pr eprint arXiv:2003.03485 , 2020. [13] Zongyi Li, Nikola Ko vachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar . Fourier neural operator for parametric partial differential equations. arXiv pr eprint arXiv:2010.08895 , 2020. [14] Huaiqian Y ou, Quinn Zhang, Colton J. Ross, Chung-Hao Lee, and Y ue Y u. Learning deep Implicit F ourier Neural Operators (IFNOs) with applications to heterogeneous material modeling. Computer Methods in Applied Mechanics and Engineering , 398:115296, August 2022. [15] George Em Karniadakis, Ioannis G K evrekidis, Lu Lu, Paris Perdikaris, Sifan W ang, and Liu Y ang. Physics- informed machine learning. Natur e Reviews Physics , 3(6):422–440, 2021. [16] P . S. K outsourelakis, N. Zabaras, and M. Girolami. Special Issue: Big data and predictiv e computational modeling. Journal of Computational Physics , 321:1252–1254, September 2016. Publisher: Elsevier BV . [17] Maziar Raissi, Paris Perdikaris, and Geor ge Em Karniadakis. Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations. arXiv preprint , 2017. [18] Sifan W ang, Hanwen W ang, and P aris Perdikaris. Learning the solution operator of parametric partial dif ferential equations with physics-informed deeponets. Science advances , 7(40):eabi8605, 2021. [19] Arnaud V adeboncoeur , Iev a Kazlauskaite, Y anni Papandreou, Fehmi Cirak, Mark Girolami, and Omer Deniz Akyildiz. Random grid neural processes for parametric partial differential equations. In International Conference on Machine Learning , pages 34759–34778. PMLR, 2023. [20] Y inhao Zhu, Nicholas Zabaras, Phaedon-Stelios K outsourelakis, and Paris Perdikaris. Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantiﬁcation without labeled data. Journal of Computational Physics , 394:56–81, 2019. [21] Maximilian Rixner and Phaedon-Stelios K outsourelakis. A probabilistic generati ve model for semi-supervised training of coarse-grained surrogates and enforcing physical constraints through virtual observ ables. Journal of Computational Physics , 434:110218, 2021. [22] Y ibo Y ang and Paris Perdikaris. Adversarial uncertainty quantiﬁcation in physics-informed neural networks. Journal of Computational Physics , 394:136–152, 2019. [23] Sebastian Kaltenbach and Phaedon-Stelios K outsourelakis. Incorporating physical constraints in a deep proba- bilistic machine learning frame work for coarse-graining dynamical systems. Journal of Computational Physics , 419:109673, 2020. [24] Maziar Raissi and George Em Karniadakis. Hidden physics models: Machine learning of nonlinear partial differential equations. Journal of Computational Physics , 357:125–141, 2018. [25] Bing Y u et al. The deep ritz method: a deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics , 6(1):1–12, 2018. 22 [26] Sung W ook Kim, Iljeok Kim, Jonghwan Lee, and Seungchul Lee. Kno wledge integration into deep learning in dynamical systems: an ov erview and taxonomy . Journal of Mechanical Science and T echnology , 35:1331–1342, 2021. [27] Ilias Bilionis and Nicholas Zabaras. Multi-output local gaussian process re gression: Applications to uncertainty quantiﬁcation. J ournal of Computational Physics , 231(17):5718–5746, 2012. [28] Alﬁo Quarteroni, Andrea Manzoni, and Federico Negri. Reduced basis methods for partial differ ential equations: an intr oduction , volume 92. Springer , 2015. [29] Jan S Hesthav en, Gianluigi Rozza, Benjamin Stamm, et al. Certiﬁed r educed basis methods for parametrized partial differ ential equations , volume 590. Springer , 2016. [30] Bernard Haasdonk. Reduced basis methods for parametrized pdes–a tutorial introduction for stationary and instationary problems. Model r eduction and approximation: theory and algorithms , 15:65, 2017. [31] Adam P . Generale and Surya R. Kalidindi. Reduced-order models for microstructure-sensiti ve effecti ve thermal conductivity of wov en ceramic matrix composites with residual porosity . Composite Structur es , 274:114399, October 2021. [32] Sepideh Hashemi, Baskar Ganapathysubramanian, Stephen Case y , Ji Su, and Surya R. Kalidindi. Feature engi- neering for microstructure-property mapping in organic photov oltaics. arXiv:2111.01897 [cond-mat] , November 2021. arXi v: 2111.01897. [33] Surya R. Kalidindi. Feature engineering of material structure for AI-based materials knowledge systems. J ournal of Applied Physics , 128(4):041103, July 2020. Publisher: American Institute of Physics. [34] Bin W en and Nicholas Zabaras. A multiscale approach for model reduction of random microstructures. Computa- tional Materials Science , 63:269–285, October 2012. [35] Ian Goodfellow , Y oshua Bengio, and Aaron Courville. Deep Learning . MIT Press, 2016. http://www. deeplearningbook.org . [36] Jiequn Han, Arnulf Jentzen, and W einan E. Solving high-dimensional partial dif ferential equations using deep learning. Pr oceedings of the National Academy of Sciences , 115(34):8505–8510, 2018. [37] Justin Sirignano and K onstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial differential equations. J ournal of computational physics , 375:1339–1364, 2018. [38] Zhen Li and Zuoqiang Shi. Deep residual learning and pdes on manifold. arXiv pr eprint arXiv:1708.05115 , 2017. [39] Y ibo Y ang and Paris Perdikaris. Conditional deep surrogate models for stochastic, high-dimensional, and multi-ﬁdelity systems. Computational Mechanics , 64:417–434, 2019. [40] Y inhao Zhu and Nicholas Zabaras. Bayesian deep con volutional encoder–decoder networks for surrogate modeling and uncertainty quantiﬁcation. J ournal of Computational Physics , 366:415–447, 2018. [41] Shaoxing Mo, Y inhao Zhu, Nicholas Zabaras, Xiaoqing Shi, and Jichun W u. Deep con volutional encoder-decoder networks for uncertainty quantiﬁcation of dynamic multiphase ﬂo w in heterogeneous media. W ater Resources Resear ch , 55(1):703–728, 2019. [42] Nikola K ovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar . Neural operator: Learning maps between function spaces. arXiv pr eprint arXiv:2108.08481 , 2021. [43] Zongyi Li, Hongkai Zheng, Nikola K ov achki, David Jin, Haoxuan Chen, Burigede Liu, Kamyar Azizzadenesheli, and Anima Anandkumar . Physics-informed neural operator for learning partial dif ferential equations. arXiv pr eprint arXiv:2111.03794 , 2021. [44] T apas Tripura and Souvik Chakraborty . W avelet neural operator: a neural operator for parametric partial differential equations. arXiv pr eprint arXiv:2205.02191 , 2022. [45] Bogdan Raonic, Roberto Molinaro, Tim De Ryck, T obias Rohner, Francesca Bartolucci, Rima Alaifari, Siddhartha Mishra, and Emmanuel de Bézenac. Con volutional neural operators for robust and accurate learning of pdes. Advances in Neural Information Pr ocessing Systems , 36, 2024. [46] V . S. Fanaskov and I. V . Oseledets. Spectral Neural Operators. Doklady Mathematics , 108(2):S226–S232, December 2023. [47] Miles Cranmer , Alvaro Sanchez-Gonzalez, Peter Battaglia, Rui Xu, K yle Cranmer , David Spergel, and Shirley Ho. Discov ering Symbolic Models from Deep Learning with Inducti ve Biases. arXiv:2006.11287 [astr o-ph, physics:physics, stat] , June 2020. 23 [48] Jonas Köhler , Leon Klein, and Frank Noé. Equi variant ﬂo ws: exact likelihood generati ve learning for symmetric densities. In International confer ence on machine learning , pages 5361–5370. PMLR, 2020. [49] Constantin Grigo and Phaedon-Stelios K outsourelakis. A physics-aware, probabilistic machine learning frame work for coarse-graining high-dimensional systems in the Small Data regime. J ournal of Computational Physics , 397:108842, Nov ember 2019. [50] Shailesh Garg and Souvik Chakraborty . V ariational bayes deep operator network: A data-dri ven bayesian solv er for parametric differential equations. arXiv preprint , 2022. [51] Arnaud V adeboncoeur , Ömer Deniz Akyildiz, Iev a Kazlauskaite, Mark Girolami, and Fehmi Cirak. Fully probabilistic deep models for forward and in verse problems in parametric pdes. Journal of Computational Physics , 491:112369, 2023. [52] B. V an Bav el, Y . Zhao, M. G. R. Faes, D. V andepitte, and D. Moens. Efﬁcient quantiﬁcation of composite spatial variability: A multiscale framew ork that captures intercorrelation. Composite Structures , 323:117462, No vember 2023. [53] B A Finlayson, editor . The method of weighted r esiduals and variational principles, with application in ﬂuid mechanics, heat and mass tr ansfer , V olume 87 . Academic Press, New Y ork, February 1972. [54] Ehsan Kharazmi, Zhongqiang Zhang, and George E. M. Karniadakis. hp-VPINNs: V ariational physics-informed neural networks with domain decomposition. Computer Methods in Applied Mechanics and Engineering , 374:113547, February 2021. [55] John Paisle y , David Blei, and Michael Jordan. V ariational bayesian inference with stochastic search. arXiv pr eprint arXiv:1206.6430 , 2012. [56] David M Blei, Alp Kucuk elbir , and Jon D McAuliffe. V ariational inference: A revie w for statisticians. Journal of the American statistical Association , 112(518):859–877, 2017. [57] T im De Ryck, Ame ya D Jagtap, and Siddhartha Mishra. Error estimates for physics-informed neural networks approximating the navier –stokes equations. IMA Journal of Numerical Analysis , 44(1):83–119, 2024. [58] Aditi Krishnapriyan, Amir Gholami, Shandian Zhe, Robert Kirby , and Michael W Mahoney . Characterizing possible failure modes in physics-informed neural networks. Advances in Neural Information Pr ocessing Systems , 34:26548–26560, 2021. [59] Maziar Raissi, Paris Perdikaris, and Geor ge E Karniadakis. Physics-informed neural networks: A deep learning framew ork for solving forward and in verse problems in volving nonlinear partial dif ferential equations. Journal of Computational physics , 378:686–707, 2019. [60] Mohammad Amin Nabian and Hadi Meidani. A deep neural network surrogate for high-dimensional random partial differential equations. arXiv preprint , 2018. [61] W illiam J Morokoff and Russel E Caﬂisch. Quasi-monte carlo integration. Journal of computational physics , 122(2):218–230, 1995. [62] Y aohua Zang, Gang Bao, Xiaojing Y e, and Haomin Zhou. W eak adversarial netw orks for high-dimensional partial differential equations. Journal of Computational Physics , 411:109409, 2020. [63] Filipe De A vila Bkotlerelb ute-Peres, Thomas Economon, and Zico Kolter . Combining dif ferentiable pde solvers and graph neural networks for ﬂuid ﬂo w prediction. In ICML , pages 2402–2411, 2020. [64] Kiwon Um, Robert Brand, Y un Raymond Fei, Philipp Holl, and Nils Thuerey . Solv er-in-the-loop: Learning from differentiable ph ysics to interact with iterativ e pde-solvers. Advances in Neur al Information Pr ocessing Systems , 33:6111–6122, 2020. [65] Michael Bartholomew-Biggs, Ste ven Brown, Bruce Christianson, and Laurence Dixon. Automatic dif ferentiation of algorithms. J ournal of Computational and Applied Mathematics , 124(1-2):171–190, 2000. [66] Atilim Gunes Baydin, Barak A Pearlmutter , Alex ey Andreye vich Radul, and Jef frey Mark Siskind. Automatic differentiation in machine learning: a surve y . Journal of mac hine learning resear ch , 18(153):1–43, 2018. [67] Antony Jameson. Aerodynamic shape optimization using the adjoint method. Lectures at the V on Karman Institute, Brussels , 2003. [68] Evangelos M Papoutsis-Kiachagias and Kyriak os C Giannakoglou. Continuous adjoint methods for turb ulent ﬂows, applied to shape and topology optimization: industrial applications. Ar chives of Computational Methods in Engineering , 23(2):255–299, 2016. [69] Matthew D Hof fman, David M Blei, Chong W ang, and John Paisle y . Stochastic v ariational inference. Journal of Machine Learning Resear ch , 2013. 24 [70] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv pr eprint arXiv:1412.6980 , 2014. [71] Y ingjie T ian, Y uqi Zhang, and Haibin Zhang. Recent adv ances in stochastic gradient descent in deep learning. Mathematics , 11(3):682, 2023. [72] T ommi V atanen, T apani Raiko, Harri V alpola, and Y ann LeCun. Pushing stochastic gradient to wards second-order methods–backpropagation learning with transformations in nonlinearities. In Neural Information Pr ocessing: 20th International Confer ence, ICONIP 2013, Dae gu, K orea, No vember 3-7, 2013. Pr oceedings, P art I 20 , pages 442–449. Springer , 2013. [73] Durk P Kingma, T im Salimans, and Max W elling. V ariational dropout and the local reparameterization trick. Advances in neural information pr ocessing systems , 28, 2015. [74] Dabao Zhang. A coefﬁcient of determination for generalized linear models. The American Statistician , 71(4):310– 316, 2017. [75] Lu Lu, Xuhui Meng, Shengze Cai, Zhiping Mao, Somdatta Goswami, Zhongqiang Zhang, and George Em Karniadakis. A comprehensi ve and fair comparison of two neural operators (with practical e xtensions) based on fair data. Computer Methods in Applied Mec hanics and Engineering , 393:114778, 2022. [76] E W einan. Principles of multiscale modeling . Cambridge University Press, 2011. [77] Grigoris Pa vliotis and Andrew Stuart. Multiscale methods: averaging and homog enization . Springer Science & Business Media, 2008. [78] Enrique Sanchez-Palencia and André Zaoui. Homogenization techniques for composite media. Homogenization techniques for composite media , 272, 1987. [79] Marc GD Geers, V arvara G K ouznetsova, Karel Matouš, and Julien Yvonnet. Homogenization methods and multiscale modeling: nonlinear problems. Encyclopedia of computational mechanics second edition , pages 1–34, 2017. [80] Zhen Zhang, Y eonjong Shin, and George Em Karniadakis. GFINNs: GENERIC Formalism Informed Neural Networks for Deterministic and Stochastic Dynamical Systems. arXiv:2109.00092 [cs, math] , August 2021. arXiv: 2109.00092. [81] Quercus Hernandez, Alberto Badias, Francisco Chinesta, and Elias Cueto. Thermodynamics-informed graph neural networks. arXiv pr eprint arXiv:2203.01874 , 2022. [82] Filippo Masi and Ioannis Stefanou. Multiscale modeling of inelastic materials with thermodynamics-based artiﬁcial neural networks (tann). Computer Methods in Applied Mec hanics and Engineering , 398:115190, 2022. [83] Elias Cueto and Francisco Chinesta. Thermodynamics of learning physical phenomena. Ar chives of Computational Methods in Engineering , 30(8):4653–4666, 2023. [84] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury , Gregory Chanan, T rev or Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperativ e style, high-performance deep learning library . Advances in neural information pr ocessing systems , 32, 2019. [85] Martin S. Alnaes, Jan Blechta, Johan Hake, August Johansson, Benjamin Kehlet, Anders Logg, Chris N. Richard- son, Johannes Ring, Marie E. Rognes, and Garth N. W ells. The FEniCS project version 1.5. Ar chive of Numerical Softwar e , 3, 2015. [86] Kaushik Bhattacharya, Bamdad Hosseini, Nikola B K ov achki, and Andrew M Stuart. Model reduction and neural networks for parametric pdes. The SMAI journal of computational mathematics , 7:121–157, 2021. [87] Serge Dumont, Franck Jourdan, and T arik Madani. 4D Remeshing Using a Space-T ime Finite Element Method for Elastodynamics Problems. Mathematical and Computational Applications , 23(2):29, June 2018. Number: 2 Publisher: Multidisciplinary Digital Publishing Institute. [88] Maximilian Rixner and Phaedon-Stelios K outsourelakis. Self-supervised optimization of random material microstructures in the small-data regime. npj Computational Materials , 8(1):1–11, March 2022. Number: 1 Publisher: Nature Publishing Group. [89] Xavier Glorot and Y oshua Bengio. Understanding the difﬁculty of training deep feedforward neural netw orks. In Pr oceedings of the thirteenth international confer ence on artiﬁcial intelligence and statistics , pages 249–256. JMLR W orkshop and Conference Proceedings, 2010. 25 A Shallow CNN employed f or X ψ x ( x ) . The details for the CNN used in the case of P ANIS are shown in T able 7 and in T able 8 for mP ANIS. The only kind of activ ation function used is the Softplus activ ation function. These are used after each con volution/decon volution layer and before the batch normalization layers, except from the last decon volution layer . The weights of the con volu- tion/decon volution layers are initialized by using a Xavier uniform distrib ution [89]. Layers Featur e Maps Height × Width Number of Parameters Input — 129 × 129 — Con volution Layer (k3s2p1) 8 65 × 65 80 Batch Normalization — 65 × 65 16 A v erage Pooling Layer (k2s2) — 32 × 32 — Con volution Layer (k3s1p1) 24 32 × 32 1752 Batch Normalization — 32 × 32 48 A v erage Pooling Layer (k2s2) — 16 × 16 — Decon volution Layer (k4s1p1) 8 17 × 17 3080 Batch Normalization — 17 × 17 16 Decon volution Layer (k3s1p1) — 17 × 17 73 In total 40 — 5065 T able 7: Layers of the CNN used in P ANIS. Layers Featur e Maps Height × Width Number of Parameters Input — 129 × 129 — Con volution Layer (k3s1p1) 8 129 × 129 80 Batch Normalization — 129 × 129 16 A v erage Pooling Layer (k4s4) — 32 × 32 — Con volution Layer (k3s1p1) 16 32 × 32 1168 Batch Normalization — 32 × 32 32 A v erage Pooling Layer (k2s2) — 16 × 16 — Con volution Layer (k3s1p1) 32 16 × 16 4640 Batch Normalization — 16 × 16 64 A v erage Pooling Layer (k2s2) — 8 × 8 — Decon volution Layer (k3s1p1) 16 8 × 8 4624 Batch Normalization — 8 × 8 32 Decon volution Layer (k4s1p1) 8 9 × 9 2056 Batch Normalization — 9 × 9 16 Decon volution Layer (k3s1p1) — 9 × 9 73 In total 80 — 12801 T able 8: Layers of the CNN used in mP ANIS. B Coarse-to-Fine Output Map y ( Y ) In the coarse-grained (CG) model, the solution ﬁeld in space, i.e. y C G ( s ) , is represented as: u C G ( s ) = N X J =1 Y J H J ( s ) , (37) where N = dim ( Y ) and H J are the corresponding shape/basis functions. If e.g. s J are the nodal points and H J the usual FE shape functions then u C G ( s J ) = Y J . In the reference, ﬁne-grained (FG) model, the solution ﬁeld, i.e. u ( s ) , is represented as in Equation (2): u ( s ) = n X j =1 y j η j ( s ) , (38) 26 where n = dim ( y ) and η j are the corresponding shape/basis functions. The mapping between Y and y is selected so that: u C G ( s ) ≈ u ( s ) . (39) One way to quantify the dif ference/error is e.g.: E = 1 2 Z Ω ( u C G ( s ) − u ( s )) 2 d s . (40) One can readily ﬁnd that the minimizing E is equi valent to: y = a − 1 B Y , (41) where: • a is an n × n matrix with entries a j k = R Ω η j ( s ) η k ( s ) d s • B is an n × N matrix with entries B kJ = R Ω η k ( s ) H J ( s ) d s . The matrix A = a − 1 B in Equation (14) fully deﬁnes the coarse-to-ﬁne map and can be pre-computed. As a result, there are no parameters to ﬁne-tune which forces X (on which Y depends) to attain the physical meaning discussed in section 2.4. C ELBO T erms The computation of the gradient of F ( ψ ) with respect to ψ is based on the auto-ifferentiation capabilities of PyT orch [ 84 , 65 ]. After each term of the ELBO has been computed, PyT orch automatically back-propagates through the respectiv e graph to obtain the gradient with respect to the training parameters. Regarding the ELBO used for training P ANIS, by substituting Equation (11) in Equation (10) , we obtain the following simpliﬁed version of the latter: F ( ψ ) ≈ − λ N M P M m =1  | r w j m ( y , x ) |  q ψ ( x , y ) + ⟨ log p ( y | x ) ⟩ q ψ ( x , y ) − ⟨ log q ψ ( y | x ) ⟩ q ψ ( x , y ) ,where j m ∼ C at  N , 1 N  . (42) Giv en that we hav e access to R samples { x r } R r =1 and { y r } R r =1 from the approximate posterior described in Algorithm 1, we approximate the ﬁrst two terms by Monte Carlo. The second one inv olves the uninformati ve multi v ariate Gaussian prior N  y | 0 , σ 2 I  , where σ 2 = 10 16 (see subsections 2.3 and 3.2). The last term is the entropy of q ψ ( y | x ) which is calculated closed form exists since q ψ ( y | x ) is a multi variate Gaussian distrib ution. Thus, Equation 42 can be written (up to a constant) as: F ( ψ ) ≈ − λ N M R P M m =1 P R r =1 | r w j m ( y r , x r ) | − 1 2 Rσ 2 P R r =1 y T r y r + 1 2 log det ( Σ ψ ) ,where j m ∼ C at  N , 1 N  . (43) In the case of mP ANIS described in subsection 3.2, the same procedure was follo wed as above starting from Equations 27 and 28 by taking into account that: • The priors p ( y c | x ) and p ( y ′ f | x ) hav e the same form as the prior p ( y | x ) in the case of P ANIS pre viously mentioned. • The entropy term resulting from the degenerate density δ  y ′ f − y ′ f ,k  (see Equation (25) ) will ha ve a zero contribution on the ELBO. By subsampling a subset { x r } R r =1 ⊂ { x k } K k =1 and the respectiv e subsets { y c,r } R r =1 , { y ′ f ,r } R r =1 the equation with which estimations of the ELBO (up to a constant) are made in the case of mP ANIS is: F ( ψ ) ≈ − λ N M R P M m =1 P R i =1 | r w j m ( y c,r , y ′ f ,r , x r ) | − 1 2 Rσ 2 P R i =1  y T c,r y c,r + y ′ T c,r y ′ c,r  + 1 2 log det ( Σ ψ ) ,where j m ∼ C at  N , 1 N  . (44) As one can observe from Equations 43 and 44, when σ is big (as in the cases employed), the second term of the ELBO is approximately equal to zero, and thus it could be completely ignored. 27

Physics-Aware Neural Implicit Solvers for multiscale, parametric PDEs with applications in heterogeneous media

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment