Fully symmetric kernel quadrature
Kernel quadratures and other kernel-based approximation methods typically suffer from prohibitive cubic time and quadratic space complexity in the number of function evaluations. The problem arises because a system of linear equations needs to be sol…
Authors: Toni Karvonen, Simo S"arkk"a
FULL Y SYMMETRIC KERNEL QUADRA TURE ∗ TONI KAR VONEN † AND SIMO SÄRKKÄ † Abstract. Kernel quadratures and other kernel-based appro ximation methods typically suffer from prohibitive cubic time and quadratic space complexity in the num b er of function evaluations. The problem arises b ecause a system of linear equations needs to b e solved. In this article w e show that the weigh ts of a kernel quadrature rule can b e computed efficiently and exactly for up to tens of millions of no des if the kernel, integration domain, and measure are fully symmetric and the no de set is a union of fully symmetric sets. This is based on the observ ations that in such a setting there are only as many distinct weigh ts as there are fully symmetric sets and that these weigh ts can b e solved from a linear system of equations constructed out of row sums of certain submatrices of the full kernel matrix. W e present several numerical examples that show feasibility , b oth for a large number of no des and in high dimensions, of the developed fully symmetric kernel quadrature rules. Most prominent of the fully symmetric kernel quadrature rules we prop ose are those that use sparse grids. Key words. Numerical integration, kernel quadrature, Bay esian quadrature, repro ducing kernel Hilbert spaces, fully symmetric sets, sparse grids AMS sub ject classifications. 46E22, 47B32, 60G15, 65C05, 65C50, 65D30, 65D32 1. In tro duction. Let Ω b e a subset of R d , µ a measure on Ω , and f : Ω → R a function that is in tegrable with resp ect to µ . Computation of the in tegral µ ( f ) : = R Ω f d µ is a recurring problem in applied mathematics and statistics. In most cases, this in tegral has no readily av ailable analytical form and one m ust resort to a quadrature rule (or, o ccasionally , a cubature rule if d > 1 ) for its appro ximation. A quadrature rule Q is a linear functional of the form Q ( f ) : = n X i =1 w i f ( x i ) ≈ Z Ω f d µ, where x i ∈ R d are the no des and w i ∈ R are the weigh ts. The no des and w eights are often chosen so that the quadrature approximation is exact whenever the in tegrand is a low-degree p olynomial [ 12 , 11 ]—suc h methods are called classical or p olynomial quadrature rules in this article (we reserve the term Gaussian for rules that use n no des to in tegrate p olynomials up to degree 2 n − 1 exactly). Another possibility is to use Mon te Carlo or quasi Monte Carlo methods [ 8 ]. Here w e study kernel quadr atur e rules that are, for arbitrary fixed nodes, optimal in the repro ducing kernel Hilb ert space (RKHS) induced by a user-specified positive- definite k ernel. In this setting, optimality is measured in terms of the worst-case error (or, equiv alently , the av erage-case error [ 55 , 45 ]). Kernel quadrature rules go bac k at least to the work of Larkin [ 32 , 33 ] in the 1970s. Lately , these rules hav e b een a sub ject of renew ed interest b ecause they can b e used for numerical integration on scattered data sets [ 3 , 57 ] and they carry a probabilistic interpretation as p osterior means for Gaussian pro cesses assigned to the in tegrand [ 48 ]. The probabilistic interpretation, equiv alent to the RKHS formulation we use, is in teresting b ecause it allo ws for statistical mo delling ∗ Submitted to arXiv on March 18, 2017. Revised on Octob er 11, 2017 and on January 6, 2018. Accepted for publication in SIAM Journal on Scientific Computing on January 3, 2018. F unding: This work was supp orted by Aalto ELEC Doctoral Sc ho ol as w ell as A cademy of Finland pro jects 266940 and 273475. † Department of Electrical Engineering and Automation, Aalto Universit y , Esp oo, Finland ( toni.k arvonen@aalto.fi , simo.sarkk a@aalto.fi ). 1 2 TONI KAR VONEN AND SIMO SÄRKKÄ of error in numerical in tegration and is one of main motiv ators b ehind this article. The abov e topics, including the probabilistic interpretation, are review ed in Section 2 . An adv antage of k ernel quadrature rules is that the nodes are not prescrib ed as opp osed to polynomial quadrature rules (p olynomial rules with arbitrary nodes could probably b e dev elop ed along the lines of de Bo or and Ron interpolation [ 13 , 39 ]). The flexibilit y comes with the price of having to solv e the weigh ts from a linear system of n equations, a task of cubic time and quadratic space complexity . It would not be practical to tabulate the weigh ts beforehand as they dep end on the k ernel, in tegration domain, and measure. Partially due to the computational cost, only low num b ers of no des hav e been used in k ernel quadrature and muc h of the literature is concerned with optimal selection of the no des. See [ 32 , 48 , 49 , 37 , 58 ] for some optimal no de configurations, [ 46 ] for an algorithm to generate suc h no des in one dimension, and [ 5 , 4 ] for other non-optimal alternatives. Efficien t computation of the optimal nodes in higher dimensions is an op en problem and not the topic of this article. Instead, w e w ant to find no des for which the w eights can be computed easily and fast. There is not muc h w ork on extending applicability of k ernel quadrature to inte- gration problems where it is necessary to use a large n umber of no des due to high dimensionalit y or high lev el of accuracy that is desired. O’Hagan [ 48 , 49 ] has prop osed some computationally b eneficial pro duct grid (n umber of no des gro ws quickly in dimension and when the grid is refined) and simplex (only d + 1 nodes) designs that are to o inflexible to b e of muc h use in many situations. Most exciting work is by Oettershagen [ 46 ] who has recently shown that the standard approac h to sparse grid quadrature can be used to ac hieve quadratic time complexit y . F urthermore, sev eral fast and appro ximate kernel-based methods hav e b een dev elop ed in scattered data appro ximation, statistics, and machine learning literature (a comp endium can b e found in, e.g., [ 6 , Supplement C]). Ho wev er, accuracy of quadrature rules is often strongly dep enden t on the weigh ts ha ving b een computed exactly and approximate w eights also give rise to some philosophical ob jections if they are to b e used for statistical mo delling of error of an integral approximation. In this article w e show that if certain structure is imp osed on the no de set, then the k ernel quadrature weigh ts can b e computed exactly and in a very simple manner. Our approac h is based on ful ly symmetric sets [ 18 , 19 ] whic h are p oint sets that can b e obtained from a given v ector through permutations and sign c hanges of its co ordinates. In Section 3 we show that some symmetricity assumptions on Ω and µ (see Assumption 3.4 ) lead, for a large class of k ernels that includes all isotropic k ernels, to tractable computation of the w eights if the no de set is a union of fully symmetric sets. Dep ending on the dimension, the weigh ts can be computed for sets of this type that con tain up to tens of millions of nodes. The crucial observ ation under our assumptions is that there are only as many distinct weigh ts as there are fully symmetric sets making up the no de set. The ful ly symmetric kernel quadr atur es we construct exhibit the follo wing adv antageous prop erties: – The algorithm (see Section 3.4 ) for exact computation of the w eigh ts is exceedingly simple and easy to implemen t. – If there are J fully symmetric sets containing n no des in total, only J n k ernel ev aluations are needed. In all situations w e c an envision, J is at most a few h undred while n can, as men tioned, go up to millions ( Section 5.4 con tains an example where J = 832 and n = 15 , 005 , 761 ). The w eights are solv ed from a system of J linear equations and only a J × J matrix needs to b e stored. – The no de selection sc heme remains quite flexible and the n umber of no des do es not grow to o fast with the dimensions (see Equation (3.2) and T able 3.1 ) FULL Y SYMMETRIC KERNEL QUADRA TURE 3 as happ ens when, for example, full Cartesian grids are used. The smallest non-trivial fully symmetric sets contain 2 d points. In Section 4 w e discuss a n umber of possible w ays of selecting the fully symmetric sets. Sparse grids [ 7 ], popular in polynomial-based high-dimensional quadrature, are maybe the most obvious and promising choice. F or kernel quadratures that use Clenshaw–Curtis sparse grids [ 41 ] we also pro vide some theoretical conv ergence guaran tees in Theorem 4.2 . Even though kernel quadrature rules on sparse grids can b e efficiently constructed without the use of fully symmetric sets [ 46 ], our approac h app ears to b e computationally comp etitiv e. In an y case, sparse grids serve as a straigh tforward no de selection sc heme for show casing that our algorithm indeed works. The fast w eight algorithm for computing the w eights is not easily extended to fitting of the k ernel parameters that often ha v e considerable effect on accuracy of the in tegral approximation. Our exp erimen ts show that fully symmetric kernel quadratures are feasible but w e hav e to resort to ad ho c solutions for fitting the kernel parameters or kno w them b eforehand. Developmen t of efficien t fitting pro cedures is left for future researc h. This is discussed in Section 5.1 . Finally , it is worth remarking that this article is not the first instance of fully symmetric sets b eing u sed in conjunction with kernel quadrature. Arguably the simplest non-trivial fully symmetric k ernel quadrature rule (this rule appears briefly in Section 4.3 ) has seen use in appro ximate filtering of non-linear systems [ 58 , 50 , 51 ], but without an efficien t w eigh t computation algorithm. 2. Kernel quadrature. This section reviews the basics of quadrature rules in repro ducing kernel Hilb ert spaces. W e also briefly discuss connections to probabilistic mo delling of n umerical algorithms. See [ 32 , 6 , 46 ] for pro ofs and additional references. Standard references on reproducing k ernel Hilbert spaces are [ 1 , 2 ]. 2.1. Quadrature in repro ducing k ernel Hilb ert spaces. A k ernel k : Ω × Ω → R is said to b e p ositiv e-definite if the n × n k ernel Gram matrix [ K ] ij : = k ( x i , x j ) is p ositiv e-definite for every n ≥ 0 and an y distinct x 1 , . . . , x n ∈ Ω . Ev ery con tinuous p ositiv e-definite k ernel defines a unique repro ducing k ernel Hilbert space H of functions f : Ω → R through the prop erties (i) k ( · , x ) ∈ H for every x ∈ Ω and (ii) point wise ev aluations of any f ∈ H can b e represented in terms of inner pro duct with the kernel: h f , k ( · , x ) i H = f ( x ) . The latter of these is called the repro ducing prop erty . The integral op erator µ and the quadrature rule Q are b ounded linear functionals on H under the non-restrictive assumption R Ω p k ( x , x ) d µ ( x ) < ∞ . The w orst-case error (WCE) e ( Q ) of a quadrature rule Q is defined in terms of the dual norm (2.1) e ( Q ) : = k µ − Q k H ∗ = sup k f k H ≤ 1 | µ ( f ) − Q ( f ) | that can b e also written as e ( Q ) = k µ [ k ( · , x )] − Q [ k ( · , x )] k H . Why this is a reason- able measure of error of the quadrature rule is apparen t after an application of the repro ducing prop ert y and the Cauc hy–Sc hw arz inequalit y: | µ ( f ) − Q ( f ) | = h f , µ [ k ( · , x )] − Q [ k ( · , x )] i H ≤ e ( Q ) k f k H for f ∈ H . That is, if the integrand b elongs to the RKHS, con vergence in the usual sense of diminishing in tegration error is implied b y con vergence to zero of the W CE. Relationship b etw een the k ernel and its induced RKHS is further discussed in Section 4.4 where w e also provide tw o conv ergence theorems for the w orst-case error. 4 TONI KAR VONEN AND SIMO SÄRKKÄ The quadrature rule that, for arbitrary fixed distinct no des X = { x 1 , . . . , x n } , minimises the worst-case error ( 2.1 ) is called the k ernel quadrature rule and denoted b y Q k . This rule is unique and the optimal weigh ts w = ( w 1 , . . . , w n ) can be solv ed from (2.2) k ( x 1 , x 1 ) · · · k ( x 1 , x n ) . . . . . . . . . k ( x n , x 1 ) · · · k ( x n , x n ) | {z } = K w 1 . . . w n = k µ ( x 1 ) . . . k µ ( x n ) | {z } = k µ ( X ) , where the k ernel Gram matrix K is p ositive-definite—and hence non-singular—and k µ ( x ) : = R Ω k ( x , x 0 ) d µ ( x 0 ) is the kernel mean, an ob ject of m uch indep enden t in ter- est [ 38 ]. The k ernel quadrature rule and its w orst-case error are Q k ( f ) = n X i =1 [ K − 1 k µ ( X )] i f ( x i ) = y T K − 1 k µ ( X ) , e ( Q k ) 2 = Z Ω Z Ω k ( x , x 0 ) d µ ( x ) d µ ( x 0 ) − k µ ( X ) T K − 1 k µ ( X ) = µ ( k µ ) − Q k ( k µ ) , (2.3) where y = ( f ( x 1 ) , . . . , f ( x n )) . An optimal kernel quadr atur e rule minimises the worst-case error also ov er the no de set (of fixed cardinalit y). Such rules cannot b e constructed efficiently at the momen t in dimensions larger than one. W e discuss structurally constrained v ersions in Section 4.3 . 2.2. Probabilistic interpretation. The probabilistic interpretation of kernel quadrature as Bayesian quadr atur e is a part of the emergent field pr ob abilistic numeric al c omputing [ 14 , 49 , 25 , 9 ], origins of which can be traced at least back to the work of Larkin [ 33 ]. This in terpretation is a ma jor motiv ator behind the present article. In Bay esian quadrature, the integrand f is typically mo delled as a Gaussian pro cess [ 47 , 53 ] (prompting the alternative term Gaussian pr o c ess quadr atur e ) with the co v ariance kernel k . With the node locations x 1 , . . . , x n and function ev aluations f ( x 1 ) , . . . , f ( x n ) considered the “data” D , the posterior f | D is a Gaussian pro cess with the mean and cov ariance E [ f ( x ) | D ] = y T K − 1 k ( X , x ) , C [ f ( x ) , f ( x 0 ) | D ] = k ( x , x 0 ) − k ( x , X ) T K − 1 k ( x , X ) , where [ k ( x , X )] i = k ( x , x i ) . Because µ is a linear operator, this induces the Gaussian p osterior distribution µ ( f ) | D on the in tegral with the mean E [ µ ( f ) | D ] and v ariance V µ ( f ) | D that turn out to b e precisely Q k ( f ) and e ( Q k ) 2 from the preceding section. The worst-case error can b e therefore interpreted as a measure of n umerical uncertaint y ov er the integral appro ximation and then exploited in for instance uncertain t y quan tification and allo cation of limited computational resources in computational pip elines [ 9 ]. Clear exp ositions of this probabilistic viewp oin t to n umerical in tegration are [ 48 , 37 , 6 ] and the metho dology is quite popular in mac hine learning (see, e.g., [ 52 , 24 ]). 3. F ully symmetric k ernel quadrature. This is the main section of the article. W e in tro duce fully symmetric sets, their connection to m ultiv ariate quadrature rules and pro v e our main result, Theorem 3.6 , on computational benefits of doing kernel quadrature with no de sets that are unions of fully symmetric sets. FULL Y SYMMETRIC KERNEL QUADRA TURE 5 Fig. 3.1: Examples of fully symmetric sets in two and three dimensions. Left: the fully symmetric sets [0 , 0] , [1 , 0] , and [1 . 2 , 0 . 8] in R 2 . Right: the fully symmetric set [1 , 0 . 5 , 0 . 2] that consists of 48 elements in R 3 . T able 3.1: Cardinalities, as computed from Equation (3.2) , of fully symmetric sets generated by m = 1 , . . . , 9 distinct non-zero generators for dimensions d = 2 , . . . , 9 . Dimension m 2 3 4 5 6 7 8 9 1 4 6 8 10 12 14 16 18 2 8 24 48 80 120 168 224 288 3 48 192 480 960 1,680 2,688 4,032 4 384 1,920 5,760 13,440 26,880 48,384 5 3,840 23,040 80,640 215,040 483,840 6 46,080 322,560 1,290,240 3,870,720 7 645,120 5,160,960 23,224,320 8 10,321,920 92,897,280 9 185,794,560 3.1. F ully symmetric sets. A fully symmetric set is a p oint set in R d that is obtained from a giv en vector through p erm utations and sign changes of its co ordinates. Let Π d b e the set of all p erm utations q = ( q 1 , . . . , q d ) of the in tegers 1 , . . . , d and S d the set of all vectors of the form s = ( s 1 , . . . , s d ) with each s i either 1 or − 1 . Then, giv en d non-negative scalars λ λ λ = ( λ 1 , . . . , λ d ) called gener ators , the p oin t set (3.1) [ λ λ λ ] = [ λ 1 , . . . , λ d ] : = [ q ∈ Π d [ s ∈ S d ( s 1 λ q 1 , . . . , s d λ q d ) ⊂ R d is the fully symmetric set generated by the gener ator ve ctor λ λ λ . With m the num- b er of non-zero generators, m 0 the n umber of zero generators (i.e. m = d − m 0 ), and m 1 , . . . , m l m ultiplicities of distinct non-zero generators so that P l i =1 m i = m , cardinalit y of the fully symmetric set ( 3.1 ) is (3.2) #[ λ 1 , . . . , λ d ] = 2 m d ! m 0 ! · · · m l ! . An alternative wa y of writing Equation (3.1) is via p ermutation matric es as [ λ λ λ ] = S P P λ λ λ , where the the union is o ver all d × d p erm utation and sign change matrices P . These are matrices that hav e on each row and column exactly one element that is either 1 or − 1 and the rest are zero. Any element of a fully symmetric set can b e obtained from any other via linear transformation b y an appropriate p erm utation 6 TONI KAR VONEN AND SIMO SÄRKKÄ matrix. Ho w Equation (3.1) works and what the resulting p oin t sets look like is illustrated in tw o and three dimensions in Example 3.2 and Figure 3.1 . Note that all elemen ts of a fully symmetric set are equidistant from the origin whic h is to say that if x ∈ [ λ 1 , . . . , λ d ] , then k x k 2 = k λ λ λ k 2 = λ 2 1 + · · · + λ 2 d . W e also need the concept of a fully symmetric function. Definition 3.1. A function f : Ω → R is fully symmetric if it is c onstant in every ful ly symmetric set. That is, with λ λ λ any gener ator ve ctor, it holds that f ( x ) = f ( x 0 ) for any x , x 0 ∈ Ω ∩ [ λ λ λ ] . Alternatively, f ( Px ) = f ( x ) for any x ∈ Ω and any p ermutation and sign change matrix P such that Px ∈ Ω . Example 3.2. In R 3 , the non-zero and distinct gener ators λ 1 and λ 2 gener ate the ful ly symmetric set [ λ 1 , λ 2 , 0] = ( λ 1 , λ 2 , 0) , ( − λ 1 , λ 2 , 0) , ( λ 1 , − λ 2 , 0) , ( − λ 1 , − λ 2 , 0) , ( λ 2 , λ 1 , 0) , ( − λ 2 , λ 1 , 0) , ( λ 2 , − λ 1 , 0) , ( − λ 2 , − λ 1 , 0) , (0 , λ 1 , λ 2 ) , (0 , − λ 1 , λ 2 ) , (0 , λ 1 , − λ 2 ) , (0 , − λ 1 , − λ 2 ) , (0 , λ 2 , λ 1 ) , (0 , − λ 2 , λ 1 ) , (0 , λ 2 , − λ 1 ) , (0 , − λ 2 , − λ 1 ) , ( λ 1 , 0 , λ 2 ) , ( − λ 1 , 0 , λ 2 ) , ( λ 1 , 0 , − λ 2 ) , ( − λ 1 , 0 , − λ 2 ) , ( λ 2 , 0 , λ 1 ) , ( − λ 2 , 0 , λ 1 ) , ( λ 2 , 0 , − λ 1 ) , ( − λ 2 , 0 , − λ 1 ) that has 2 2 × 3! / (1! × 1! × 1!) = 24 elements. In terms of p ermutation matric es, the element ( − λ 1 , 0 , λ 2 ) is − λ 1 0 λ 2 = − 1 0 0 0 0 1 0 1 0 λ 1 λ 2 0 . The metho d w e hav e used to generate fully symmetric sets out of user-sp ecified generator vectors is detailed in Algorithm 1 in Section 3.4 . There are many other p ossibilities; we do not claim that the one presen ted is the optimal implementation. 3.2. F ully symmetric quadrature rules. The notation f [ λ λ λ ] = f [ λ 1 , . . . , λ d ] stands for the sum of ev aluations of f at the p oin ts of the fully symmetric set: f [ λ λ λ ] : = X x ∈ [ λ λ λ ] f ( x ) . A ful ly symmetric quadr atur e rule is a quadrature rule of the form Q ( f ) = X λ λ λ ∈ Λ w λ λ λ f [ λ λ λ ] = X λ λ λ ∈ Λ w λ λ λ X x ∈ [ λ λ λ ] f ( x ) , where Λ is a given finite collection of distinct generator vectors λ λ λ . Such a rule uses only #Λ distinct weigh ts, eac h corresponding to often a very large n um b er of no des. In Theorem 3.6 w e establish conditions under which a k ernel quadrature rule is fully symmetric. This will yield significant computational sa vings b ecause only #Λ (instead of n = P λ λ λ ∈ Λ #[ λ λ λ ] ) distinct weigh ts need to be computed. F ully symmetric quadrature rules are prominen t among classical p olynomial quadrature rules, work on them going back to [ 35 , 36 ]. T o the b est of our knowledge, the most general and efficien t constructions hav e been giv en b y Genz [ 18 ] for the uniform distribution on a square and by Genz and Keister [ 19 ] for Gaussians on the FULL Y SYMMETRIC KERNEL QUADRA TURE 7 whole real space (a case studied also in [ 34 ]). See, for example, the review [ 11 ] for more examples and discussion on the highly related in v ariant theory . T o achiev e high algebraic order of precision, the classical fully symmetric quadrature rules rely on symmetry of the underlying measure and adv an tageous prop erties of p olynomials when in tegrated with resp ect to suc h measures. In con trast to the k ernel quadrature rules w e are ab out to construct, the aforementioned rules do not p ermit free selection of the fully symmetric sets that are to b e used. Man y of the p opular sparse grid rules are also fully symmetric [ 44 , 43 ]. W e mak e use of this useful fact in Section 4.2 for construction of sparse grid k ernel quadrature rules whose weigh ts can b e computed efficien tly . 3.3. F ully symmetric kernels. W e can now introduce the class of kernels that this article is concerned with as well as the necessary assumptions on the in tegration domain and measure. Definition 3.3. Supp ose that Px ∈ Ω for any x ∈ Ω and any p ermutation and sign change matrix P . A kernel k is fully symmetric if k ( Px , Px 0 ) = k ( x , x 0 ) for any such matrix P and any x , x 0 ∈ Ω . This class of kernels includes (i) isotropic k ernels, (ii) pro ducts of an isotropic k ernel k 1 of the form k ( x , x 0 ) = Q d i =1 k 1 ( | x i − x 0 i | ) , and (iii) sums of an isotropic k ernel k 1 of the form k ( x , x 0 ) = P d i =1 k 1 ( | x i − x 0 i | ) . Some p olynomial kernels 1 of the form k ( x , x 0 ) = P p i =1 P i ( x ) P i ( x 0 ) for suitable multiv ariate polynomials P i are also fully symmetric. F or example, the selection P 1 ≡ 1 and P i = x 2 i − 1 for i = 2 , . . . , p = d + 1 results in a fully symmetric kernel. See [ 58 , 30 ] for some results on how quadrature rules for such kernels are related to classical quadrature rules. Assumption 3.4. W e assume that (i) The inte gr ation domain Ω ⊂ R d is invariant under p ermutations and sign changes of c o or dinates of its elements. That is, Ω = P Ω = { P ω : ω ∈ Ω } for any p ermutation and sign change matrix P . (ii) The me asur e µ is ful ly symmetric in the sense that its density f µ (w.r.t. the L eb esgue me asur e) is a ful ly symmetric function. (iii) The kernel k is p ositive-definite and ful ly symmetric. This assumption holds, for example, for Ω = [ − 1 , 1] d equipp ed with the uniform measure and Ω = R d equipp ed with the Gaussian measure as w ell as for man y other cases of interest. The numerical examples in Section 5 are for these tw o cases. Lemma 3.5. The kernel me an k µ is ful ly symmetric under Assumption 3.4 . Pr o of. Let x and x 0 b e elemen ts of the same fully symmetric set. That is, x 0 = Px for some p erm utation and sign c hange matrix P . A c hange of v ariables yields Z Ω k ( x 0 , z ) f µ ( z ) d z = Z Ω | det P | k Px , Pz ) q Pz d z = Z Ω k ( x , z ) f µ ( z ) d z , where w e hav e used the fact that | det P | = 1 . That is, k µ ( x 0 ) = k µ ( x ) . 3.4. F ully symmetric k ernel quadrature. Let [ λ λ λ 1 ] , . . . , [ λ λ λ J ] be distinct fully symmetric sets generated by λ λ λ 1 , . . . , λ λ λ J ∈ R d . If the no de set X is the union X = ∪ J j =1 [ λ λ λ j ] of these fully symmetric sets, then a kernel quadrature rule using this 1 Strictly speaking, these kernels are not p ositive-definite as the kernel matrix do es not remain non-singular for any num b er of distinct p oin ts. See also Remark 3.7 . 8 TONI KAR VONEN AND SIMO SÄRKKÄ K 11 K 12 K 13 K 21 K 31 K 22 K 22 K 32 K 33 × w λ λ λ 1 w λ λ λ 2 w λ λ λ 2 w λ λ λ 2 w λ λ λ 2 w λ λ λ 3 w λ λ λ 3 w λ λ λ 3 w λ λ λ 3 = µ k ( λ λ λ 1 ) µ k ( λ λ λ 2 ) µ k ( λ λ λ 2 ) µ k ( λ λ λ 2 ) µ k ( λ λ λ 2 ) µ k ( λ λ λ 3 ) µ k ( λ λ λ 3 ) µ k ( λ λ λ 3 ) µ k ( λ λ λ 3 ) S 11 S 12 S 13 S 21 S 22 S 23 S 31 S 32 S 33 × w λ λ λ 1 w λ λ λ 2 w λ λ λ 3 = µ k ( λ λ λ 1 ) µ k ( λ λ λ 2 ) µ k ( λ λ λ 3 ) Fig. 3.2: Illustration of Theorem 3.6 for a no de set that is a union of three fully symmetric sets: one containing one element and tw o containing four elements. All ro w sums of the matrices K ij , defined in Equation (3.5) , are equal to S ij . no de set is a fully symmetric quadrature rule in the sense of Section 3.2 . F urthermore, its J distinct weigh ts can be computed extremely efficiently when compared to naively solving the linear system ( 2.2 ) of # X = n equations. This is formalised in the following theorem. Figure 3.2 illustrates the simplified weigh t computation pro cess in the case of a no de set that is a union of three fully symmetric sets. Theorem 3.6. Supp ose that Ω , µ , and k satisfy Assumption 3.4 . If the no de set X is a union of J distinct ful ly symmetric sets [ λ λ λ 1 ] , . . . , [ λ λ λ J ] , then the kernel quadr atur e rule Q k is ful ly symmetric: Q k ( f ) = J X j =1 w λ λ λ j f [ λ λ λ j ] . F urthermor e, the J weights w λ λ λ 1 , . . . , w λ λ λ J c orr esp onding to the ful ly symmetric sets c an b e solve d fr om the non-singular line ar system of J e quations (3.4) S 11 · · · S 1 J . . . . . . . . . S J 1 · · · S J J w λ λ λ 1 . . . w λ λ λ J = k µ ( λ λ λ 1 ) . . . k µ ( λ λ λ J ) , wher e S ij = X x ∈ [ λ λ λ j ] k ( x i , x ) for any x i ∈ [ λ λ λ i ] . Pr o of. Let X = ∪ J j =1 [ λ λ λ j ] b e ordered suc h that all elemen ts of a single fully symmetric set appear consecutively and the fully symmetric sets themselv es are in ascending order in terms of their index j . W e denote n i = #[ λ λ λ i ] and enumerate eac h fully symmetric set as [ λ λ λ i ] = { x i 1 , . . . , x i n i } . By Lemma 3.5 , the kernel mean is fully symmetric and, conse- quen tly , the k ernel mean vector k µ ( X ) ∈ R n , n = n 1 + · · · + n J , is k µ ( X ) = k µ ([ λ λ λ 1 ]) , . . . , k µ ([ λ λ λ J ]) , FULL Y SYMMETRIC KERNEL QUADRA TURE 9 where k µ ([ λ λ λ j ]) = ( k µ ( λ λ λ j ) , . . . , k µ ( λ λ λ j )) ∈ R n j . That is, k µ ( X ) con tains only J distinct elemen ts that o ccur in blocks of n j . Consider then the k ernel matrix K that can be partitioned into J 2 submatrices K ij of dimensions n i × n j , each containing all the k ernel ev aluations k ( x i , x j ) for x i ∈ [ λ λ λ i ] and x j ∈ [ λ λ λ j ] : (3.5) K = K 11 · · · K 1 J . . . . . . . . . K J 1 · · · K J J , where K ij = k ( x i 1 , x j 1 ) · · · k ( x i 1 , x j n j ) . . . . . . . . . k ( x i n i , x j 1 ) · · · k ( x i n i , x j n j ) . An y row of any submatrix K ij can b e obtained from any other of its rows b y a p erm utation of elemen ts of the row. T o confirm this, consider any distinct ro ws p, p 0 ≤ n i of K ij . There exists a p erm utation and sign change matrix P suc h that x i p 0 = Px i p b ecause fully symmetric sets are closed under such transformations. Note that P is non-singular and its in verse P − 1 is also a p erm utation and sign c hange matrix. Since the k ernel is fully symmetric, for an y l ≤ n j w e ha ve k x i p , x j l = k P − 1 x i p , P − 1 x j l = k x i p 0 , P − 1 x j l , where P − 1 x j l ∈ [ λ λ λ j ] . This means that for ev ery l there is an elemen t on the row p 0 that equals the l th element of the p th row. That is, the ro ws are p erm utations of each other. Consequently , the row sums S ij : = P x ∈ [ λ λ λ j ] k ( x i , x ) of K ij do not dep end on x i ∈ [ λ λ λ i ] . Consider the J × J matrix [ S ] ij = S ij comp osed of the ro w sums defined abov e. Then the equation Sa = b for some v ectors a , b ∈ R J implies that K 11 · · · K 1 J . . . . . . . . . K J 1 · · · K J J a 1 . . . a J = b 1 . . . b J , where a i = ( a i , . . . , a i ) ∈ R n i and b i = ( b i , . . . , b i ) ∈ R n i , because b i = J X j =1 a j S ij = J X j =1 a j n j X l =1 [ K ij ] pl = J X j =1 a j n j X l =1 k x i p , x j l for every p ≤ n i . The matrix S is non-singular, for if it w ere singular there would exist a non-zero v ector a ∈ R J suc h that Sa = 0 . But b y the ab o ve argumen t this would imply that K is singular which is not the case because the kernel k is positive-definite. All this implies that if ( w λ λ λ 1 , . . . , w λ λ λ J ) is the unique solution to the linear system of equations ( 3.4 ), then w = ( w λ λ λ 1 , . . . , w λ λ λ J ) ∈ R n , where w λ λ λ j = ( w λ λ λ j , . . . , w λ λ λ j ) ∈ R n j , m ust be the solution to Kw = k µ ( X ) . That is, w eights for nodes in each fully symmetric set are equal and the kernel quadrature rule is fully symmetric. This concludes the pro of. Remark 3.7. The or em 3.6 also applies to kernels whose kernel matrix is p ositive- definite only for every c ol le ction of m ≤ p distinct p oints for some p > 0 if the total numb er n of no des do es not exc e e d p . The p olynomial kernels briefly mentione d in Se ction 3.3 ar e examples of such kernels. 10 TONI KAR VONEN AND SIMO SÄRKKÄ The full algorithm for fully symmetric kernel quadrature is presen ted in high-level pseudo code in Algorithm 1 below. W e expect that J , the num b er of fully symmetric sets, is rarely more than a few h undred (the example in Section 5.4 has J = 832 but this results in n ≈ 15 , 000 , 000 ) so solving the weigh ts from the linear system ( 3.4 ) is not a computational bottleneck. Instead, it is usually the J n k ernel ev aluations that tak e the most time. Algorithm 1 F ully symmetric k ernel quadrature Construct the ful ly symmetric sets 1. Select J distinct generator v ectors λ λ λ j ∈ R d with non-negativ e elements. F or eac h j = 1 , . . . , J construct the fully symmetric set [ λ λ λ j ] : 2. Sort λ λ λ j in descending order. 3. Iden tify the unique non-zero elements u ∈ R d u , d u ≤ d , of λ λ λ j and their m ultiplicities m ∈ N d u . Denote Σ m = P d u l =1 m l . That is, λ λ λ j = e u 1 , . . . , e u d u , 0 ( d − Σ m ) × 1 ∈ R d , where e u l = ( u l , . . . , u l ) ∈ R m l for l = 1 , . . . , d u . 4. Construct all d a p ossible v ectors a i ∈ N d u suc h that a i l ≤ m l for each l = 1 , . . . , d u . 5. Set [ λ λ λ j ] = ∅ . F or eac h i = 1 , . . . , d a : 6. Construct the vector λ λ λ j i = e u s 1 , . . . , e u s d u , 0 ( d − Σ m ) × 1 ∈ R d , where, for each l = 1 , . . . , d u , the first a i l elemen ts of e u s l ∈ R m l are − u l and the rest are u l . The v ector λ λ λ j i essen tially corresp onds to one possible sign combination ( s 1 λ j 1 , . . . , s d λ j d ) in Equation (3.1) . 7. Compute the collection U of all unique p erm utations of λ λ λ j i and app end it to the fully symmetric set: [ λ λ λ j ] = [ λ λ λ j ] ∪ U . Compute the kernel quadr atur e weights 8. Construct an empty matrix S ∈ R J × J . F or eac h ( i, j ) ∈ { 1 , . . . , J } 2 : 9. Select an y x i ∈ [ λ λ λ i ] and set [ S ] ij = P x ∈ [ λ λ λ j ] k ( x i , x ) . 10. Solv e the J distinct w eights w λ λ λ = ( w λ λ λ 1 , . . . , w λ λ λ J ) from the linear system of equations Sw λ λ λ = b , where b l = k µ ( λ λ λ l ) . Compute the quadr atur e appr oximation 11. Compute Q k ( f ) ≈ R Ω f d µ as Q k ( f ) = J X j =1 w λ λ λ j f [ λ λ λ j ] = J X j =1 w λ λ λ j X x ∈ [ λ λ λ j ] f ( x ) . 4. Selection of the fully symmetric sets. This section presents three different approac hes for constructing the node set as a union of fully symmetric sets. Of these the sparse grids of Section 4.2 are the most promising alternative. W e also discuss con vergence prop erties of some of the kernel quadrature rules we construct in Section 4.4 . FULL Y SYMMETRIC KERNEL QUADRA TURE 11 W e exp ect there to exist many other comp etitiv e schemes as one of the main adv an tages of fully symmetric k ernel quadrature is that there are no restrictions in selecting the generator v ectors. 4.1. Random generators. Arguably , the simplest approach, b oth conceptually and algorithmically , is to dra w a n umber of generator v ectors randomly from the underlying distribution. How ev er, unless additional constrain ts are enforced, all the generators will b e distinct and non-zero, resulting in unrealistic n umbers of in tegrand ev aluations needed if d ≥ 6 , as seen from T able 3.1 . One could heuristically set some generators to zero to reduce the n um b er of no des but it is not en tirely clear how this should be done. In an y case, for at least d < 6 , the random generator approac h seems realistic. W e call this metho d the ful ly symmetric kernel Monte Carlo (FSKMC). Theorem 4.1 provides theoretical con vergence guarantees and Section 5.3 demon- strates that the FSKMC can also w ork in practice. Nevertheless, the method do es not seem very promising as it comes across that a large n umber of random generator v ectors, and thus an even larger num b er of no des, is required to capture the underlying distribution. This approac h bears some similarit y to stochastic radial and spherical integration rules dev elop ed in [ 20 , 21 ]. These rules are less flexible due to the usual constraints of in tegrating lo w-degree polynomials exactly and more in v olved in their implementation. 4.2. Sparse grids. An iterated quadrature rule of degree m based on a regular Cartesian pro duct grid requires m d no des—a num ber that quickly b ecomes imprac- tically large. Sparse grids that originate in the w ork of Smolyak [ 56 ] are “sparsified” pro duct sets widely used in n umerical in tegration [ 41 , 23 , 43 , 44 ]. See also the general surv ey by Bungartz and Grieb el [ 7 ] and [ 26 ] for a wealth of financial applications. Recen tly , Oettershagen [ 46 ] has shown that the standard approac h to sparse grids is also applicable to fast computation of the weigh ts of k ernel quadrature rules. This approac h is differen t from ours, that is based on iden tifying the fully symmetric sets a sparse grid is a union of, and specific to sparse grids. Other sparse grid based kernel metho ds app ear in [ 17 , 22 , 15 , 59 ]. The construction of sparse grids that we present in this section is not the most general p ossible as we work in the fully symmetric framew ork. More general constructions are contained in some of the aforementioned references. W e assume that Ω = [ − a, a ] d for a p ossibly infinite a > 0 . Let X 1 = { 0 } and X i ⊂ X i +1 ⊂ [ − a, a ] for i > 1 be finite, nested and symmetric (i.e. if x ∈ X i , then − x ∈ X i ) p oin t sets. Then the sp arse grid of level q ≥ 1 is the set H ( q , d ) : = [ | α | = d + q X α 1 × · · · × X α d , where α ∈ N d is a d -dimensional m ulti-index with the elemen ts α i = α ( i ) and | α | = α 1 + · · · + α d . Note that the largest X i that is needed for a sparse grid of lev el q is X q +1 . As the basis sets X i are nested and symmetric, it is fairly easy to see that H ( q , d ) is union of fully symmetric sets and can b e explicitly written so: H ( q , d ) = [ | α | = d + q α i ≥ α i +1 [ q ∈ Π d X α ( q 1 ) × · · · × X α ( q d ) = [ | α | = d + q α i ≥ α i +1 [ q ∈ Π d [ s ∈ S d [ λ λ λ ∈ R d λ j ∈ X α ( q j ) λ j ≥ 0 ( s 1 λ 1 , . . . , s d λ d ) 12 TONI KAR VONEN AND SIMO SÄRKKÄ = [ | α | = d + q α i ≥ α i +1 [ λ 1 , . . . , λ d ] : λ j ∈ X α j and λ j ≥ 0 for j = 1 , . . . , d , where the restriction α 1 ≥ α i +1 eliminates a large num b er of permutations that w ould b e otherwise duplicated when generating fully symmetric sets. That is, Theorem 3.6 applies to sparse grids. W e call the resulting k ernel quadrature rules the sp arse grid kernel quadr atur e rules . W e are left with selection of the nested point sets X i . In polynomial-based sparse grid quadrature rules these sets come coupled with univ ariate quadrature rules whose w eights are used to construct the final sparse grid weigh ts but we are under no such restrictions. An ob vious idea for selecting X i w ould be to sequentially minimise the w orst-case error in one dimension—pro vided that the k ernel is one for whic h this mak es sense, for example any of the three examples of fully symmetric k ernels giv en in Section 3.3 . Discussion on different sequential kernel quadrature metho ds (usually kno wn as se quential Bayesian quadr atur es ) can be found in for example [ 10 , 27 , 24 , 5 ]. A differen t approach app ears in [ 46 ]. Ho wev er, o wing to difficulties in setting the kernel length-scale, we do not emplo y this selection sc heme. Instead, we use (i) the Clenshaw–Curtis p oin t sets, rather standard in sparse grid literature, for Ω = [ − 1 , 1] d and the uniform measure and (ii) nested sets formed out of Gauss–Hermite no des in the Gaussian case: (i) F or i > 1 and m i = 2 i − 1 + 1 , the nested Clensha w–Curtis sets are X i = { x i 1 , . . . , x i m i } with x i j = − cos π ( j − 1) m i − 1 ∈ [ − 1 , 1] . The p oin ts x i j are the roots and extrema of Cheb yshev polynomials. The corresp onding sparse grid kernel quadrature rule is called Clenshaw–Curtis sp arse grid kernel quadr atur e (CCSGKQ). Numerical results for this kernel quadrature are giv en in Section 5.4 and conv ergence for sufficien tly smooth functions is the topic of Theorem 4.2 . (ii) In the Gauss–Hermite sp arse grid kernel quadr atur e (GHSGKQ) we use the classical Gauss–Hermite no des that are the ro ots of the Hermite polynomials H p ( x ) = ( − 1) p exp x 2 / 2 d p d x p exp − x 2 / 2 . Giv en a lev el q , w e generate the 2 q + 1 symmetric ro ots of H 2 q +1 and for i = 1 , . . . , q + 1 select X i = the 2 i − 1 smallest ro ots by absolute v alue. The num b er of no des, in terms of the lev el q , grows significantly slow er than with Clenshaw–Curtis sparse grids. A n umerical experiment in v olving a financial problem is given in Section 5.5 . As is usual for quadrature rules on the whole of R d , there are no theoretical conv ergence guarantees for the GHSGKQ. Because H ( q , d ) is not a subset of H ( q + 1 , d ) for the Gauss–Hermite p oin ts, these grids are only suitable for cases where the n umber of no des that can b e used is determined beforehand based on for example the computational budget a v ailable. Note that these grids are completely differen t from several other sparse grids in the literature that use no des of Gaussian quadrature rules. FULL Y SYMMETRIC KERNEL QUADRA TURE 13 H (7 , 2) , Clenshaw–Curtis H (6 , 3) , Clenshaw–Curtis H (11 , 2) , Gauss–Hermite H (10 , 3) , Gauss–Hermite Fig. 4.1: Examples of sparse grids with Clenshaw–Curtis and Gauss–Hermite no des. The numbers of no des in the sparse grids are 705 (upp er left), 1,073 (upp er righ t), 265 (low er left), and 1,561 (low er right). Compare these to the cardinalities of the corresp onding full grids that are 129 2 = 16 , 641 ; 65 3 = 274 , 625 ; 23 2 = 529 ; and 21 3 = 9 , 261 . F our sparse grids based on these tw o p oint sequences are depicted in Figure 4.1 . There is a large arra y of other p ossibilities a v ailable in the literature. F or example, Gerstner and Grieb el [ 23 ] use Gauss–Patterson no des and Genz and Keister [ 19 ] ha v e dev elop ed a nested v ersion of the Gauss–Hermite rule. The rule (ii) can also b e trivially extended for other integration domains and measures if a different sequence of orthogonal polynomials is used (e.g. Legendre or Chebyshev p olynomials on [ − 1 , 1] ). 4.3. W orst-case error minimisation with resp ect to the generators. The third metho dology for choosing the fully symmetric sets is that of principled worst-case error minimisation. Supp ose that one, based on for example the num b er of no des desired, fixes a num ber of generators of a fully symmetric kernel quadrature rule to zero or sets some equality constrain ts. Then the w orst-case error e ( Q k ) of the k ernel quadrature rule can be minimised with resp ect to the generator vectors ob eying these constraints. Esp ecially in higher dimensions, this is a task v astly simpler than trying to minimise the error ov er a node set of unconstrained geometry . Optimal k ernel quadrature rules under certain structural constraints ha ve b een previously exp erimen ted with at least by O’Hagan [ 48 , 49 ]. As a simplistic and somewhat arbitrary example, suppose that one desires a goo d fully symmetric k ernel quadrature rule ha ving ab out 80 no des in Ω ⊂ R 3 . A rule of the form (4.1) Q λ λ λ k ( f ) = w 1 f [0 , 0 , 0] + w 2 f [ λ 1 , λ 2 , λ 3 ] + w 3 f [ λ 3 , λ 3 , λ 3 ] + w 4 f [ λ 4 , λ 4 , λ 5 ] has 1 + 48 + 8 + 24 = 81 no des if the generators λ λ λ = ( λ 1 , . . . , λ 5 ) are distinct and non- zero. Optimal generators λ λ λ ∗ = ( λ ∗ 1 , λ ∗ 2 , λ ∗ 3 , λ ∗ 4 , λ ∗ 5 ) in the sense of minimal w orst-case 14 TONI KAR VONEN AND SIMO SÄRKKÄ error w ould then b e λ λ λ ∗ = arg min λ λ λ ∈ Ω ,λ i > 0 e ( Q λ λ λ k ) . That is, Q λ λ λ ∗ k has the smallest W CE among all rules of the form ( 4.1 ) . The optimal generators cannot b e in general solved analytically nor do es this minimisation problem app ear to be con vex. In some very simple cases minimisation is trivial. Consider rules of the form Q λ k ( f ) = w 1 f [0 , . . . , 0] + w 2 f [ λ, 0 , . . . , 0] in Ω ⊂ R d . If the k ernel is Gaussian with length-scale ` and µ the standard Gaussian measure (see Section 5.2 ), the task of finding the optimal generator λ ∗ reduces to λ ∗ = arg min λ> 0 e ( Q λ k ) = arg max λ> 0 " e − λ 2 / 2(1+ ` 2 ) − e − λ 2 / 2 ` 2 1 − e − λ 2 /` 2 # . That is, the optimal generator is dimension-independent and can be easily computed. Ho wev er, with increasing dimension and constan t length-scale this results to a neg- ativ e w eight for the origin which often impairs n umerical stabilit y . It is somewhat questionable if such a rule is actually “goo d” (removing the origin of course yields p ositiv e w eigh ts). W e do not attempt to construct efficien t fully s ymmetric rules using the tec hnique describ ed abov e in this article. The topic, alongside with no de selection for sparse grids via sequential W CE minimisation briefly discussed in Section 4.2 , is left for future researc h. 4.4. Con vergence analysis. In this section we provide con vergence theorems for the fully symmetric kernel Monte Carlo and the Clenshaw–Curtis sparse grid kernel quadrature. The theorems are straightforw ard corollaries of some well-kno wn results in the literature. F or stating the results, w e need to introduce the following three standard function classes. W e assume that Ω = [ − a, a ] d and a < ∞ in this section. The general principle on the conv ergence results for kernel quadrature is that the rates obtained are at least as go o d as those for an y other metho d using the same nodes if the in tegrand belongs to the RKHS induced b y the kernel. This should b e quite clear from the definition of kernel quadrature. With α ∈ N d a multi-index and f : Ω → R a sufficiently smo oth function, the deriv a- tiv e op erator is D α f = ∂ | α | f /∂ x α 1 1 · · · ∂ x α d d . By α ≤ r w e mean that α 1 , . . . , α d ≤ r . The function classes w e need are (i) The Sobolev space W r 2 is a Hilb ert space defined as W r 2 : = f ∈ L 2 ( µ ) : D α f ∈ L 2 ( µ ) exists for all | α | ≤ r with the norm k f k W r 2 = P | α |≤ r k D α f k L 2 ( µ ) . (ii) The class C r is the class of functions that hav e bounded deriv ativ es: C r : = f : Ω → R : k D α f k ∞ < ∞ for all | α | ≤ r . This space is equipped with the norm k f k C r = max {k D α f k ∞ : | α | ≤ r } . (iii) The class F r is the class of functions that hav e bounded mixed deriv atives: F r : = f : Ω → R : k D α f k ∞ < ∞ for all α ≤ r . FULL Y SYMMETRIC KERNEL QUADRA TURE 15 This space is equipped with the norm k f k F r = max {k D α f k ∞ : α ≤ r } . F or relations of the ab ov e function classes to repro ducing kernel Hilb ert spaces induced b y differen t k ernels, see for example [ 60 , Chapter 10]. T wo norms k·k 1 and k·k 2 of an arbitrary vector space are norm-e quivalent if there are positive constants C 1 and C 2 suc h that C 1 k x k 1 ≤ k x k 2 ≤ C 2 k x k 1 for all elements x of the vector space. Recall from Section 2 that H is the RKHS induced b y the kernel k . Then H is norm-equiv alent to the Sob olev space W r 2 if r > d/ 2 and the F ourier transform F ( ω ) of the k ernel k deca ys at the rate (1 + k ω k 2 ) − r . This holds if the k ernel is for example of the Matérn class with ν ≥ r − 1 / 2 . In a similar manner, H is norm-equiv alent to F r if the kernel is a pro duct of one-dimensional Matérn k ernels. The following conv ergence theorems are simple consequences of results a v ailable in the literature. In these theorems Q k,n stands for an n -p oin t k ernel quadrature rule with its t yp e sp ecified by a sup erscript. Extensions to the missp ecified setting analysed in [ 28 , 29 ] ma y be possible. Theorem 4.1 (Con vergence of FSKMC). L et Ω = [ − a, a ] d with a < ∞ . If H is norm-e quivalent to the Sob olev sp ac e W r 2 with r > d/ 2 , then (4.2) E e ( Q FSKMC k,n ) = O n − r/d + ε for any ε > 0 . The exp e ctation ab ove is with r esp e ct to the joint distribution of the r andom gener ator ve ctors. Pr o of. The w orst-case error is decreasing in the num b er of nodes so we know that e ( Q FSKMC k,n ) ≤ e ( Q GEN k,n ) where the rule Q GEN k,n uses only the J generator vectors as its no des. The rate ( 4.2 ) is realised b y the regular kernel Monte Carlo [ 6 , Theorem 1] under the norm-equiv alence assumption. Therefore, E e ( Q FSKMC k,n ) ≤ E e ( Q GEN k,n ) = O J − r/d + ε = O n − r/d + ε b ecause there is a dimension-dep enden t upp er b ound 2 d d ! (see Equation (3.2) ) for the n umber of no des one fully symmetric set can con tain. It is clear that the ab o v e rate is extremely crude with resp ect to d as the dimension also enters through the multiplicativ e factor 2 d d ! . It is likely that this factor can be eliminated or diminished with more careful analysis. Theorem 4.2 (Con vergence of CCSGK Q). L et Ω = [ − a, a ] d with a < ∞ and µ b e the uniform me asur e on Ω . If H is norm-e quivalent to C r , then (4.3) e ( Q CCSGKQ k,n ) = O n − r/d (log n ) ( d − 1)( r/d +1) . If H is norm-e quivalent to F r , then (4.4) e ( Q CCSGKQ k,n ) = O n − r (log n ) ( d − 1)( r +1) . Pr o of. The rates ( 4.3 ) and ( 4.4 ) hold for the standard Clenshaw–Curtis sparse grid quadrature [ 41 , 42 ] if the worst-case error ( 2.1 ) is ov er the unit balls of C r and F r , resp ectively . Because k ernel quadrature rules hav e minimal w orst-case errors in the induced RKHS among all quadrature rules with fixed no des, the con vergence rates follo w from the assumptions of norm-equiv alence. 5. Numerical examples and computational asp ects. This section con- tains three n umerical examples for the fully symmetric k ernel Monte Carlo, the 16 TONI KAR VONEN AND SIMO SÄRKKÄ Clensha w–Curtis sparse grid k ernel quadrature and the (mo dified) Gauss–Hermite sparse grid kernel quadrature as well as discussion on some computational as- p ects. The examples and algorithms, implemented in MA TLAB, are av ailable at https://github.com/tskarvone/fskq . Numerous classical sparse grid quadrature metho ds for MA TLAB are implemen ted in the Sparse Grid In terp olation T oolb ox [ 31 ]. P arts of our co de make use of this to olbox. W e emphasise the examples are not meant to demonstrate sup eriorit y of fully symmetric k ernel quadrature to other n umerical in tegration metho ds. Comparisons to other metho ds are merely to show that fully symmetric kernel quadratures can achiev e roughly comparable accuracy . Rather, we aim to show that fully symmetric sets mak e it possible to apply kernel quadrature rules to large-scale and high-dimensional situations that hav e b een out of the scop e of these quadrature rules before. 5.1. Cho osing the length-scale parameter. A ccuracy of an y appro ximation based on a stationary k ernel is heavily dep enden t on the length-scale parameter ` > 0 whose effect is via k ` ( x − x 0 ) = k (( x − x 0 ) /` ) , see for example the Gaussian kernel ( 5.1 ) . Cho osing in some sense the best v alue of this parameter efficien tly is an important topic of research b oth in scattered data appro ximation literature [ 54 , 16 ] and statistics and mac hine learning [ 53 , Chapter 5]. See also [ 6 , Section 4.1] for discussion in the con text of kernel quadrature. Unfortunately , w e ha ve not been able to come up with a wa y to exploit the fully symmetric structure of the no de set in an y of the existing parameter fitting metho ds, suc h as marginal lik eliho o d maximisation or cross-v alidation. Consequently , in large- scale applications that go b ey ond the limits of naive metho ds based on in verting the k ernel matrix one has to resort to ad ho c techniques to fit the length-scale. In the examples below we either use few enough nodes that naiv e computations are p ossible ( Section 5.3 ), in tegrate a function whose length-scale is known b eforehand ( Section 5.4 ), or set the length-scale somewhat heuristically ( Section 5.5 ). W e recognise that the lac k of a principled method for choosing the length-scale is a significant shortcoming and hope to fix this in the future. When the length-scale is c hanged, the in terpretation of the w orst-case error as an indicator of accuracy of the quadrature rule is confounded b ecause the RKHS norm k·k H dep ends on the length-scale. This occurs in Sections 5.3 and 5.5 . Nevertheless, if one follo ws the paradigm presen ted in Section 2.2 the WCE still carries a meaningful probabilistic interpretation as the in tegral p osterior standard deviation (STD). As suc h, it is plotted in all the examples. How ever, one should not draw to o many conclusions from these plots as we hav e not made an y effort to fit the kernel scale parameter (i.e. the constan t multiplier of the k ernel). 5.2. Closed-form kernel means. In k ernel quadrature, one needs to be able to ev aluate the k ernel mean k µ ( x i ) = R Ω k ( x i , x ) d µ ( x ) at the nodes x i . A num ber of k ernel-measure pairs that yield tractable kernel means are tabulated in [ 6 ]. It is also p ossible to ev aluate the kernel mean n umerically [ 57 ]. In fact, when fully symmetric sets are used, numerical ev aluation ma y b e quite viable as the k ernel mean needs to b e ev aluated only at eac h generator vector instead of each no de. All our examples use the standard Gaussian k ernel (5.1) k ( x , x 0 ) = exp − k x − x 0 k 2 2 ` 2 with length-scale ` > 0 and unit scale parameter (this parameter only affects mag- nitude of the worst-case error). The integration domain Ω and measure µ are FULL Y SYMMETRIC KERNEL QUADRA TURE 17 1/48 10/480 20/960 30/1440 40/1920 50/2400 60/2880 0 . 2 0 . 3 0 . 4 Number of fully symmetric sets ( J )/ T otal num ber of samples ( n ) Integral appro ximation KMC FSKMC 1 25 50 0.05 0.1 Number of fully symmetric sets ( J ) STD (WCE) KMC FSKMC 1 25 50 0 . 2 0 . 4 0 . 6 0 . 8 Number of fully symmetric sets ( J ) Length-scale KMC length-scale Fig. 5.1: Numerical integration of the function ( 5.2 ) with resp ect to the standard normal distribution. The upp er figure sho ws integral approximations by the kernel Monte Carlo (KMC) and the fully symmetric kernel Monte Carlo (FSKMC) as a function of the num b er J of fully symmetric sets and the total num b er n of Monte Carlo samples for ten realisations. Lo wer figures display the worst-case errors (standard deviations) and the length-scales fitted. Each fully symmetric set contains 48 no des (see T able 3.1 ). The underlying black line is the v alue of the in tegral that is approximately 0 . 389 . The generator vectors for the FSKMC hav e been generated indep enden tly of the KMC samples. Both methods use the same length-scale that has b een fit using the MC samples of the KMC. either (i) the whole of R d and the standard Gaussian measure with the densit y ϕ ( x ) = (2 π ) − d/ 2 exp ( − k x k 2 / 2) or (ii) the h yp ercub e [ − 1 , 1] d and the (normalising) uniform measure. In the former case the kernel mean and its integral (needed for computing the WCE using Equation (2.3) ) are k µ ( x ) = Z R d k ( x , x 0 ) ϕ ( x 0 ) d x 0 = ` 2 1 + ` 2 d/ 2 exp − k x k 2 2(1 + ` 2 ) , µ ( k µ ) = Z R d k µ ( x ) ϕ ( x ) d x = ` 2 2 + ` 2 d/ 2 and in the latter k µ ( x ) = 2 − d Z [ − 1 , 1] d k ( x , x 0 ) d x 0 = π ` 2 8 d/ 2 d Y i =1 " erf x i + 1 ` √ 2 − erf x i − 1 ` √ 2 # , µ ( k µ ) = 2 − d Z [ − 1 , 1] d k µ ( x ) d x = π ` 2 8 d/ 2 r 2 ` 2 π e − 2 /` 2 − 1 + 2 erf √ 2 /` ! d , where erf ( x ) = π − 1 / 2 R x − x exp( − t 2 ) d t is the standard error function. 5.3. Example 1: Random generators. Our first example is just a pro of of concept to demonstrate that the fully symmetric kernel Monte Carlo (FSKMC) from 18 TONI KAR VONEN AND SIMO SÄRKKÄ Section 4.1 indeed works (though not necessarily that w ell). W e try numerically in tegrating the non-radial function (5.2) f ( x ) = exp sin(5 k x k ) 2 − ( x 2 1 + 0 . 5 x 2 2 + 2 x 4 3 ) o ver R 3 and with resp ect to the standard normal distribution. Results for the k ernel Mon te Carlo (KMC) [ 52 , 6 ], where the no des to b e used in kernel quadrature are drawn randomly , and the fully symmetric kernel Mon te Carlo are presented in Figure 5.1 . F or both the KMC and FSKMC, the kernel length-scale was fit b y the metho d of maxim um lik eliho od (see [ 53 , Chapter 5]) using the MC samples of KMC. W e ha ve also experimented with fitting the FSKMC length-scale using the randomly generated fully symmetric sets, but in this case the fitted length-scale was markedly larger and the in tegral approximations muc h w orse. It is clear that the FSKMC fares worse. How ever, the FSKMC has a tremendous adv an tage in computational scalabilit y in the num ber of nodes. In general, when the n umber of no des exceeds some tens of thousands, k ernel quadrature metho ds based on naiv ely solving the w eights from the linear system ( 2.2 ) , such as the KMC, become infeasible due to their the cubic time and quadratic space complexity . In contrast, fully symmetric k ernel quadratures such as FSKMC remain feasible: only J n (recall that J ≤ n is the n umber of fully symmetric sets) kernel ev aluations and solving a linear system of J equations, as opp osed to n 2 and n , respectively , of naiv e metho ds, are required. F or instance, using FSKMC with 1,000 fully symmetric sets (i.e. 48,000 no des) in this example would require 48,000,000 kernel ev aluations and solving a linear system of 1,000 equations, neither of which is a computational challenge, while the KMC w eigh ts for 48,000 no des cannot be computed on a standard computer. The difference b ecomes even more pronounced in higher dimensions where fully symmetric sets contain significantly more p oin ts. The next tw o examples demonstrate the superior scalabilit y of fully symmetric kernel quadratures. 5.4. Example 2: A priori known length-scale. This simple example demon- strates that sparse grid k ernel quadrature based on fully symmetric sets is n umerically stable, consisten t and works w ell for an extremely large num b er of nodes. W e w ork in the domain Ω = [ − 1 , 1] d , d = 11 , equipp ed with the normalising uniform measure. The in tegrand is (5.3) f ( x ) = exp − k x − x f k 2 2 ` 2 f with ` f = 0 . 8 and x f is a v ector of 11 ev enly spaced p oin ts on the in terv al [0 . 2 , 0 . 5] (with the end points included). The in tegral w e seek to appro ximate is 2 − d Z [ − 1 , 1] d f ( x ) d x = π ` 2 f 8 d/ 2 d Y i =1 " erf x f ,i + 1 ` f √ 2 − erf x f ,i − 1 ` f √ 2 # ≈ 0 . 0392 . W e use the Gaussian kernel with ` = ` f = 0 . 8 and the Clenshaw–Curtis sparse grid k ernel quadrature (CCSGK Q). Results for the relativ e error | µ ( f ) − Q k ( f ) | /µ ( f ) and the kernel w orst-case error (or standard deviation) are shown in Figure 5.2 for the levels q = 1 , . . . , 9 , last of them corresp onding to the total of 15,005,761 nodes. T able 5.1 con tains a breakdo wn of the computational times required. W e also displa y results for the k ernel Mon te Carlo using up to 12,000 no des. F or this many no des, the time tak en b y the CCSGKQ is negligible while the KMC is noticeably slowing down. FULL Y SYMMETRIC KERNEL QUADRA TURE 19 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 − 8 10 − 5 10 − 2 Relative error KMC CCSGKQ 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 − 4 10 − 3 10 − 2 10 − 1 Number of no des STD (WCE) KMC CCSGKQ Fig. 5.2: Relative error | µ ( f ) − Q k ( f ) | /µ ( f ) (upper) and the worst-case error (lower; standard deviation) for integration of the function ( 5.3 ) on the 11-dimensional hypercub e with the kernel Monte Carlo quadrature (KMC) and the Clenshaw–Curtis sparse grid k ernel quadrature (CCSGKQ). The num ber of no des for the KMC v aried from 1,000 to 12,000 (with increments of 1,000) and CCSGKQ used levels from 1 to 9. These corresp onded to 23; 265; 2,069; 12,497; 63,097; 280,017; 1,129,569; 4,236,673; and 15,005,761 no des and 2, 4, 8, 17, 36, 79, 172, 379, and 832 fully symmetric sets. T able 5.1: Computational times in seconds for the KMC (left) and CCSGKQ (right) in Example 2. The columns indicate the time taken by kernel ev aluations (kernel), computing the weigh ts from a linear system of equations (weigh ts), and constructing the fully symmetric sets (FSS). The MA TLAB code was run on a laptop sp orting Intel Core i5-6300 2.40 GHz processor and 8 GB of RAM. Computational times (seconds) for KMC / CCSGKQ n kernel weigh ts 1k 0.08 0.01 2k 0.15 0.07 3k 0.32 0.20 4k 0.54 0.47 5k 0.79 0.85 6k 1.17 1.42 7k 1.49 2.21 8k 1.92 3.20 9k 2.43 4.47 10k 2.98 6.03 11k 3.62 8.21 12k 4.45 10.61 J / n kernel w eights FSS 2 / 23 < 0 . 01 < 0 . 001 0.07 4 / 265 < 0 . 01 < 0 . 001 0.07 8 / 2k < 0 . 01 < 0 . 001 0.04 17 / 12k 0.02 < 0 . 001 0.04 36 / 63k 0.14 < 0 . 001 0.10 79 / 280k 1.26 0.003 0.27 172 / 1.1m 10.91 0.004 0.82 379 / 4.2m 90.41 0.004 2.63 832 / 15m 760.00 0.072 8.61 It is seen that for a similar num b ers of no des the tw o metho ds are roughly equiv alent. When the level, and consequently the num b er of no des, increases, the CCSGK Q b ecomes more accurate whic h sho ws that the no des are selected w ell enough and that the weigh ts are b eing computed correctly . F or the highest lev els 8 and 9, the sparse grids consisted of 379 and 832 fully symmetric sets or 4,236,673 and 15,005,761 nodes, resulting in 379 × 4 , 236 , 673 = 1 , 605 , 699 , 067 and 832 × 15 , 005 , 761 = 12 , 484 , 793 , 152 k ernel ev aluations needed to compute the weigh ts. It is not possible to compute the 20 TONI KAR VONEN AND SIMO SÄRKKÄ KMC w eigh ts for this man y no des. 5.5. Example 3: Zero coup on b onds. This example demonstrates that fully symmetric k ernel quadrature rules are also feasible in high dimensions. W e use the to y example from [ 40 ] (see also [ 26 , Section 6.1]) that is concerned with pricing zero coup on b onds through simulation of a discretised sto chastic differen tial equation mo del. The mo del is conv enien t for our purposes as there is a closed-form solution that serv es as a baseline and finer discretisations corresp ond to higher in tegration dimensions. Consider the sto c hastic differen tial equation (going b y the name V asicek model) d r ( t ) = κ θ − r ( t ) d t + σ d W ( t ) , where W ( t ) is the standard Bro wnian motion and κ , θ , and σ are positive parameters. W e wan t to solve this SDE at the time t = T . The Euler–Maruyama discretisation with the uniform step size ∆ t = T /d is r k = r k − 1 + κ ( θ − r k − 1 )∆ t + σ x k , k = 1 , . . . , d, where x k ∼ N (0 , ∆ t ) are independent and r 0 is a free parameter. The quantit y w e are in terested in is the Gaussian integral P (0 , T ) : = E exp − ∆ t d − 1 X k =0 r k = exp( − ∆ tr 0 ) Z R d − 1 exp − ∆ tf √ ∆ t x ϕ ( x ) d x , (5.4) where f ( x ) = P d − 1 k =1 r k . This in tegral admits a closed-form solution (5.5) P (0 , T ) = exp − ( γ + β d r 0 ) T d with β k = P k i =1 (1 − κ ∆ t ) j − 1 and γ = P d − 1 k =1 ( β k κθ ∆ t − ( β k σ ∆ t ) 2 / 2) . As can b e seen, the n um b er d of discretisation steps con trols the integration dimension that is d − 1 . In the integration exp erimen t, w e set κ = 0 . 1817303 , θ = 0 . 0825398957 , σ = 0 . 0125901 , r 0 = 0 . 021673 , T = 5 . These v alues are equal to those used in [ 40 , 26 ]. W e consider n umerical in tegration of ( 5.4 ) for d = 10 , . . . , 300 using the Gauss–Hermite sparse grid kernel quadrature (GHSGK Q) with q = 2 . W e use the Gaussian k ernel with the somewhat heuristic c hoice ` = d of the length-scale. The cen tral node (i.e. the origin) tended to ha ve a fairly large negative weigh t so it w as remo ved to improv e numerical stabilit y . Results for the relative error and the w orst-case error (standard deviation) are depicted in Figure 5.3 . F or comparison, we hav e also included a Monte Carlo estimate (KMC is feasible only for dimensions somewhat less than 100 so it w as excluded). The results show that the GHSGKQ is able to maintain an accuracy that is generally b etter than that of the standard MC. This indicates that fully symmetric k ernel quadratures hav e potential also in very high dimensions. Note the dimension- adaptiv e metho ds used in [ 26 ] w ould be more accurate in this example. It is probable that fully symmetric k ernel quadratures could be com bined with these metho ds. FULL Y SYMMETRIC KERNEL QUADRA TURE 21 19 / 264 99 / 19 , 800 199 / 79 , 600 299 / 179 , 400 10 − 6 10 − 5 10 − 4 10 − 3 10 − 2 Integration dimension / Number of GHSGKQ no des Relative error (GHSGKQ) Relative error (MC) STD (WCE) Fig. 5.3: Relative error and the w orst-case error (standard deviation) of the Gauss–Hermite sparse grid k ernel quadrature (GHSGKQ) for the zero coupon bond setup of Section 5.5 . V alue of the integral ( 5.4 ) , as computed from Equation (5.5) , is b et ween 0.81 and 0.815 for all dimensions. Also depicted is an integral estimate by the standard Monte Carlo using the same number of p oin ts as the GHSGKQ in each dimension. 6. Conclusions and discussion. W e introduced fully symmetric k ernel quadra- ture rules and sho wed that their w eights can be computed exactly with a v ery simple algorithm under some assumptions on the in tegration domain and measure and the k ernel. W e also prop osed using sparse grids in conjunction with this algorithm and pro vided some simple theoretical conv ergence analysis for Clenshaw–Curtis sparse grids. In the schemes presented, the no des can b e selected in a comparatively flexible manner. Three numerical exp erimen ts demonstrated that the approac h is sound and can cope both with a v ery large num b er of no des and high-dimensional domains. Ev en with the tremendous computational simplifications provided by the fully symmetric sets, kernel quadrature rules remain computationally more demanding than most classical quadrature rules. In the end, the decision on which metho d to use is highly dep enden t on the computational complexit y of ev aluating the integrand. Extremes where the rules we ha ve dev elop ed are not necessarily useful are easy to iden tify: (i) in Section 5.4 it is clearly absurd that an in tegrand as cheap to ev aluate as the k ernel is ev aluated 15 million times while 12 billion k ernel ev aluations are used to compute the weigh ts, (ii) whereas when the integrand, b eing for example a complex computer sim ulation, is v ery expensive the computational o verhead from non- symmetric and lik ely more accurate k ernel quadrature rules is going to b e negligible. Consequen tly , we believe that the metho d presented in this article is b est suited for “moderately” exp ensiv e in tegrands in the case when high accuracy is required or probabilistic modelling of uncertaint y in the integral estimate desired. This is of course somewhat ambiguous. Precise (and useful) analysis is complicated by , among others, the facts that w e do not know how the accuracy of a fully symmetric k ernel quadrature rule compares to that of a non-symmetric one (e.g. Theorem 4.1 is only ab out rates, not the asso ciated constan t co efficien ts) and that it is difficult to accoun t for the v alue—nor is it easy to decide how muc h one should v alue this measure to b egin with—one places on the uncertaint y measure. 22 TONI KAR VONEN AND SIMO SÄRKKÄ Besides what was discussed in the preceding paragraph, there is a num b er of topics that could b e pursued in the future. These include – Dev eloping principled metho ds for choosing the kernel length-scale for large- scale problems. – Prop er probabilistic approac h to large-scale in tegration problems. W e antici- pate that muc h can b e gained in pursuing this direction. – As discussed in Sections 4.2 and 4.3 , there is muc h ro om for improv emen t via optimisation of the fully symmetric sets. – Ro ws of the submatrices K ij in Equation (3.5) typically con tain sev eral non-distinct elemen ts. Minor computational impro vemen ts might b e possible. A ckno wledgemen ts. W e thank F rançois-Xa vier Briol, Jon Co c k ayne, Chris Oates, Jen s Oettershagen, and Filip T ronarp for discussion and constructive comments. Suggestions b y the anonymous reviewers help ed to impro ve many parts of the article. REFERENCES [1] N. Ar onszajn , Theory of repr o ducing kernels , T ransactions of the American Mathematical Society , 68 (1950), pp. 337–404. [2] A. Berlinet and C. Thomas-A gnan , Repr o ducing Kernel Hilb ert Spac es in Pr ob ability and Statistics , Springer, 2004. [3] A. Yu. Bezhaev , Cub atur e formulae on sc atter ed meshes , Russian Journal of Numerical Analysis and Mathematical Mo delling, 6 (1991), pp. 95–106. [4] F.-X. Briol, C. J. Oa tes, J. Cocka yne, W. Y. Chen, and M. Girolami , On the sampling pr oblem for kernel quadr atur e , in 34th International Conference on Machine Learning, vol. 70 of Pro ceedings of Machine Learning Research, 2017, pp. 586–595. [5] F.-X. Briol, C. J. Oa tes, M. Girolami, and M. A. Osborne , Fr ank-Wolfe Bayesian quadr atur e: Prob abilistic inte gration with the or etic al guar ante es , in Adv ances in Neural Information Pro cessing Systems, vol. 25, 2015, pp. 1162–1170. [6] F.-X. Briol, C. J. Oa tes, M. Girolami, M. A. Osborne, and D. Sejdinovic , Pr ob- abilistic inte gr ation: A role for statisticians in numeric al analysis? , (2016). Preprint, [7] H.-J. Bungar tz and M. Griebel , Sp arse grids , Acta Numerica, 13 (2004), pp. 147–269. [8] R. E. Caflisch , Monte Carlo and quasi-Monte Carlo metho ds , Acta Numerica, 7 (1998), pp. 1–49. [9] J. Cocka yne, C. O a tes, T. Sulliv an, and M. Gir olami , Bayesian prob abilistic numeric al metho ds , (2017). Preprint, [10] T. D. Cook and M. K. Cla yton , Se quential Bayesian quadratur e , tech. rep ort, Department of Statistics, Universit y of Wisconsin, 1998. [11] R. Cools , Constructing cub atur e formulae: The scienc e b ehind the art , A cta Numerica, 6 (1997), pp. 1–54. [12] P. J. Da vis and P. Rabinowitz , Methods of Numeric al Integr ation , Academic Press, 2nd ed., 1984. [13] C. de Boor and A. R on , On multivariate p olynomial interp olation , Constructiv e Approxima- tion, 6 (1990), pp. 287–302. [14] P. Diaconis , Bayesian numeric al analysis , in Statistical decision theory and related topics IV, vol. 1, Springer-V erlag New Y ork, 1988, pp. 163–175. [15] Z. Dong, E. H. Georgoulis, J. Levesley, and F. Ust a , F ast multilevel sp arse Gaus- sian kernels for high-dimensional appr oximation and inte gr ation , (2015). Preprint, [16] G. E. F assha uer and J. G. Zhang , On cho osing “optimal” shap e par ameters for RBF appr oximation , Numerical Algorithms, 45 (2007), pp. 345–368. [17] J. Gar cke , A dimension adaptive sp arse grid c ombination te chnique for machine le arning , ANZIAM Journal, 48 (2006), pp. 725–740. [18] A. Genz , F ul ly symmetric interp olatory rules for multiple inte gr als , SIAM Journal on Numerical Analysis, 23 (1986), pp. 1273–1283. [19] A. Genz and B. D. Keister , F ul ly symmetric interp olatory rules for multiple inte gr als over infinite r e gions with Gaussian weight , Journal of Computational and Applied Mathematics, 71 (1996), pp. 299–309. FULL Y SYMMETRIC KERNEL QUADRA TURE 23 [20] A. Genz and J. Monahan , Sto chastic inte gration rules for infinite r egions , SIAM Journal on Scientific Computing, 19 (1998), pp. 426–439. [21] A. Genz and J. Monahan , A sto chastic algorithm for high-dimensional inte grals over un- b ounde d r e gions with Gaussian weight , Journal of Computational and Applied Mathematics, 112 (1999), pp. 71–81. [22] E. H. Georgoulis, J. Levesley, and F. Subhan , Multilevel sparse kernel-b ase d interp olation , SIAM Journal on Scientific Computing, 35 (2013), pp. A815–A831. [23] T. Gerstner and M. Griebel , Numeric al inte gr ation using sparse grids , Numerical Algo- rithms, 18 (1998), pp. 209–232. [24] T. Gunter, M. A. Osborne, R. Garnett, P. Hennig, and S. J. Rober ts , Sampling for infer enc e in pr ob abilistic mo dels with fast Bayesian quadratur e , in Adv ances in Neural Information Pro cessing Systems, vol. 24, 2014, pp. 2789–2797. [25] P. Hennig, M. A. Osborne, and M. Gir olami , Pr ob abilistic numerics and unc ertainty in c omputations , Pro ceedings of the Roy al So ciet y of London A: Mathematical, Physical and Engineering Sciences, 471 (2015). [26] M. Hol tz , Sp arse Grid Quadr atur e in High Dimensions with Applications in Financ e and Insur anc e , no. 77 in Lecture Notes in Computational Science and Engineering, Springer, 2011. [27] F. Huszár and D. Duvenaud , Optimal ly-weighte d her ding is Bayesian quadratur e , in 28th Conference on Uncertaint y in Artificial Intelligence, 2012, pp. 377–385. [28] M. Kanaga w a, B. K. Sriperumbudur, and K. Fukumizu , Conver gence guar ante es for kernel-b ase d quadr atur e rules in misspe cifie d settings , in Adv ances in Neural Information Processing Systems, vol. 29, 2016, pp. 3288–3296. [29] M. Kana ga w a, B. K. Sriperumbudur, and K. Fukumizu , Conver genc e analysis of deterministic kernel-b ase d quadr atur e rules in missp e cifie d settings , (2017). Preprin t, [30] T. Kar vonen and S. Särkkä , Classic al quadr atur e rules via Gaussian pr o c esses , in 27th IEEE International W orkshop on Machine Learning for Signal Pro cessing, 2017. [31] A. Klimke and B. W ohlmuth , Algorithm 847: spinterp: Pie c ewise multiline ar hier ar chic al sp arse grid interpolation in MA TLAB , ACM T ransactions on Mathematical Softw are, 31 (2005), pp. 561–579. [32] F. M. Larkin , Optimal appr oximation in Hilb ert sp ac es with r epr o ducing kernel functions , Mathematics of Computation, 24 (1970), pp. 911–921. [33] F. M. Larkin , Gaussian me asur e in Hilb ert sp ac e and applic ations in numeric al analysis , Rocky Mountain Journal of Mathematics, 2 (1972), pp. 379–422. [34] J. Lu and D. L. D armof al , Higher-dimensional inte gr ation with Gaussian weight for applic a- tions in pr ob abilistic design , SIAM Journal on Scientific Computing, 26 (2004), pp. 613–624. [35] J. N. L yness , Symmetric inte gr ation rules for hyp er cub es I. Err or c o efficients , Mathematics of Computation, 19 (1965), pp. 260–276. [36] J. McNamee and F. Stenger , Construction of ful ly symmetric numeric al inte gr ation formulas , Numerische Mathematik, 10 (1967), pp. 327–344. [37] T. Minka , Deriving quadr atur e rules fr om Gaussian pr o cesses , tech. rep ort, Statistics Depart- ment, Carnegie Mellon Universit y , 2000. [38] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf , Kernel mean emb e dding of distributions: A r eview and beyond , F oundations and T rends R in Machine Learning, 10 (2017), pp. 1–141. [39] A. Nara y an and D. Xiu , Sto chastic c ol lo c ation metho ds on unstructure d grids in high dimensions via interp olation , SIAM Journal on Scientific Computing, 34 (2012), pp. A1729– A1752. [40] S. Ninomiy a and S. Tezuka , T owar d r e al-time pricing of c omplex financial derivatives , Applied Mathematical Finance, 3 (1996), pp. 1–20. [41] E. Nov ak and K. Ritter , High dimensional inte gr ation of smooth functions over cubes , Numerische Mathematik, 75 (1996), pp. 79–97. [42] E. No v ak and K. Ritter , The curse of dimension and a universal metho d for numeric al inte gr ation , in Multiv ariate Approximation and Splines, Birkhäuser Basel, 1997, pp. 177–187. [43] E. No v ak and K. Ritter , Simple cub atur e formulas with high p olynomial exactness , Con- structive approximation, 15 (1999), pp. 499–522. [44] E. Nov ak, K. Ritter, R. Schmitt, and A. Steinbauer , On an interp olatory metho d for high dimensional inte gr ation , Journal of Computational and Applied Mathematics, 112 (1999), pp. 215–228. [45] E. Nov ak and H. Woźniak owski , T r actability of Multivariate Problems, volume II: Standar d Information for F unctionals , no. 12 in EMS T racts in Mathematics, European Mathematical 24 TONI KAR VONEN AND SIMO SÄRKKÄ Society , 2010. [46] J. Oettershagen , Construction of Optimal Cub atur e Algorithms with Applic ations to Ec ono- metrics and Uncertainty Quantific ation , PhD thesis, Institut für Numerische Simulation, Universität Bonn, 2017. [47] A. O’Hagan , Curve fitting and optimal design for pr ediction , Journal of the Roy al Statistical Society . Series B (Metho dological), 40 (1978), pp. 1–42. [48] A. O’Hagan , Bayes–Hermite quadr atur e , Journal of Statistical Planning and Inference, 29 (1991), pp. 245–260. [49] A. O’Ha gan , Some Bayesian numeric al analysis , Bayesian Statistics, 4 (1992), pp. 345–363. [50] J. Prüher and O. Straka , Gaussian pr o c ess quadr atur e moment tr ansform , (2017). Preprint, [51] J. Pr üher, F. Tronarp, T. Kar v onen, S. Särkkä, and O. Straka , Student- t pr o c ess quadr atur es for filtering of non-line ar systems with he avy-taile d noise , in 20th International Conference on Information F usion, 2017. [52] C. E. Rasmussen and Z. Ghahramani , Bayesian Monte Carlo , in Adv ances in Neural Information Pro cessing Systems, vol. 15, 2002, pp. 505–512. [53] C. E. Rasmussen and C. K. I. Williams , Gaussian Pr o c esses for Machine Le arning , Adaptive Computation and Machine Learning, MIT Press, 2006. [54] S. Ripp a , An algorithm for sele cting a go o d value for the p ar ameter c in r adial b asis function interp olation , Adv ances in Computational Mathematics, 11 (1999), pp. 193–210. [55] K. Ritter , A ver age-Case Analysis of Numeric al Pr oblems , no. 1733 in Lecture Notes in Mathematics, Springer, 2000. [56] S. A. Smol y ak , Quadr atur e and interp olation formulas for tensor pr o ducts of c ertain classes of functions , Doklady Ak ademii Nauk SSSR, 4 (1963), pp. 240–243. [57] A. Sommariv a and M. Vianello , Numeric al cub atur e on sc atter e d data by r adial b asis functions , Computing, 76 (2006), pp. 295–310. [58] S. Särkkä, J. Har tikainen, L. Svensson, and F. Sandblom , On the r elation b etween Gaussian pr o cess quadratur es and sigma-p oint metho ds , Journal of Adv ances in Information F u sion, 11 (2016), pp. 31–46. [59] F. Ust a and J. Levesley , Multilevel quasi-interp olation on a sp arse grid with the Gaussian , Numerical Algorithms, (2017). [60] H. Wendland , Scatter e d Data Appr oximation , no. 17 in Cambridge Monographs on Applied and Computational Mathematics, Cambridge Universit y Press, 2010.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment