Bayesian treed Gaussian process models with an application to computer modeling

Ba y esian T reed Gaussian Pro cess Mo dels with a n Application to Computer Mo deling Rob ert B. Gra macy and Herb ert K. H. Lee ∗ Abstract Motiv ated b y a computer exp e riment for the design of a r oc ke t b o oster, this pap er explores nonstationary mo d eli ng metho dolo gies that couple stationary Gaussian pro cesses with treed partitioning. P artitioning is a simple but eﬀectiv e metho d for d ea ling with n o nstationarit y . Th e metho dolo gical dev elopmen ts and statist ical computing details whic h mak e this approac h eﬃcien t are describ ed in detail. In addition to p ro vidin g an analysis of the ro c k et b o oster sim ulator, our approac h is demonstrated to b e eﬀectiv e in other aren a s. Key w ords: computer sim ulator, recursiv e p a rtitioning, nons t ationary spatial mo del, nonp a rametric r e gression, h et eroscedasticit y ∗ Rob ert Gr amacy is Lecturer, Statistical Lab oratory , Univ ersity of Ca m bridge, UK (Ema il: bo bb y@statslab.ca m.a c.uk) and Herber t Lee is Asso ciate Profess or, Department of Applied Ma the- matics and Statistics, Univ ersity of Ca lifornia, Santa Cr uz, CA 950 64 (Email: herbie@ams.uc s c.edu). The author s would like to thank William Macready for or iginating the co llabora tion with NASA and for his help with the pro ject, Thomas P ulliam and E dw ard T ejnil for their help with the NASA da ta, T amara Bro deric k fo r her ca reful reading and detailed comments, and the editor, asso ciate editor, and t wo anonymous referees for their helpful comments and suggestions. This work w as partially supp orted by N ASA awards 08008 -002-011- 0 00 and SC 2003028 NAS2-03144, Sandia Nationa l Labs gra nt 4 9 6420, and National Science F oundation gra n ts DMS 02337 10 and 0504851 . 1 1 In tro ducti on Muc h of mo dern engineering design is no w done through computer mo deling, whic h is b oth faster and more cost-eﬀectiv e than building small-scale mo dels , particularly in the earlier stages of design w hen more scop e for ch anges is desired. As design pro ce eds, increasingly sophisticated sim ulat o rs may b e created. Our w ork here w as motiv ated b y a sim ulator of a prop osed ro c k et b o oster. NASA relies hea vily on sim ulators for design, as wind tunnel exp erime nts are quite expensiv e and still not fully realistic of the range of ﬂigh t exp eriences . In particular, one of the hig hly critical p oin ts in t ime for a ro c k et b o o s ter is the momen t that it re-en ters the atmosphere. Suc h conditions are diﬃcult to recreate in a wind tunnel, and it is obviously imp ossib le to r un a standard ph ysical exp eriment. Th us to learn a bout the b eha vior of the prop osed ro c ke t b o oster, NASA uses computer sim ulation. Sim ulators can in v olve large amounts o f ph ysical mo delin g, requiring the n umeri- cal solution of complex systems of diﬀeren tial equations. The NASA sim ulator w a s no exception, typically requiring b et w een ﬁv e and t w en t y hours for a single run. Th us, NASA was in terested in creating a statistical mo del o f t he sim ulator itself, an emulator or surr o gate mo del , in the terminology of computer mo deling. The standard appro ac h in the literature for em ulation is to mo del the sim ulator output with a stationary smo oth Gaussian pro cess (GP) (Sacks et al., 19 8 9 ; Kennedy and O’Hagan, 2001; Santne r et al., 2003). How ev er, this appro a c h pro v ed to b e inadequate for the NASA data. In partic- ular, w e were faced with problems of nonstationarity , heteroscedasticit y , and the size of the dataset. Th us we in tro duce here an expansion of GPs, based on the idea o f Bay esian partition mo dels (Chipman et al., 2002; Denison et al., 2002), which is able to address these issues. GPs are conceptually straightforw a rd, can easily accommo date prior knowle dge in the for m of co v ariance functions, and can return estimates of predictiv e conﬁdence, whic h 2 w ere desired by NASA. How ev er, w e highlight three disadv an tages of the standa r d fo rm of a GP whic h w e had to confron t on this dataset, and expect to encoun ter on a wide range of other applications. Fir s t, inference on the GP scales p o orly with the n um b er of data p oin ts, N , ty pically r equiring computing time in O ( N 3 ) f or calculating inv erses of N × N co v a riance matrices. Second, GP mo dels are usually stationary in that the same co v aria nc e structure is used throughout the en tire input space, whic h ma y b e to o strong of a mo deling assumption. Third, the estimated predictiv e error (as opp ose d to the predictiv e mean v alue) of a statio nary mo del do es not directly dep end on the lo cally observ ed response v alues. Rather, the predictiv e error at a p oin t dep ends only on the lo cations of the nearby o bs erv at io ns a nd on a global measure o f error that uses all of the discrepancies b et w een observ ations and predictions without regard for their distance from the p oin t of interes t (b ecause of the stationarity assumption). (Section 4.3 prov ides more details, in pa r ticu lar note tha t Eq. (12) do es not dep end on z .) In man y real-w orld spatial and stochastic problems, such a unifor m mo deling of uncertain t y will not b e desirable. Instead, some regions of the space will tend to exhibit larger v a riabilit y than others. On the other ha nd, fully no ns tationa r y Ba y esian GP mo dels (e.g., Higdon et al., 1999; Sc hmidt and O’Hag a n , 2003) can b e diﬃcult to ﬁt, and are not computationally tractable for more than a relativ ely small n um b er of datap oin ts. F urther discussion of nonstationary mo dels is deferred until the end o f Section 3.2. All of these shortcomings can b e addressed by partitioning the input space in to re- gions, and ﬁtting separate stationary G P mo dels within eac h region (e.g., Kim et al., 2005). Partitioning pro vides a relativ ely straightforw ard mechanis m f or creating a no n- stationary mo de l, a nd can ameliorate some of the computational demands b y ﬁtting mo dels to less data. A Bay esian mo del av eraging approach allo ws for the explicit es- timation of predictiv e uncertain ty , whic h can no w v ary b ey o nd the constraints o f a stationary mo del. Finally , an R pack age with implemen tations of all of the mo dels dis- 3 cussed in this pap er is a v a ilable at http://www. cran.r-project.org/web/packages/tgp/index.html . W e note tha t b y partitioning, w e do not hav e a n y theoretical guarantee of contin uity in the ﬁtted func- tion. How eve r, as w e demonstrate in sev eral examples, Ba y esian mo del a v eraging yields mean ﬁtted functions that are t ypically quite smo oth in practice, giving ﬁts that are indistinguishable f rom contin uous functions except when the data call fo r the contrary . Indeed the abilit y t o accurately mo del p ossible discon tinuities is a side b eneﬁt of this approac h. The rest of the pap er is or g anize d as follows. Section 2 describ es the motiv ating data in further detail. Section 3 provid es some bac kground material. Section 4 com bines stationary GPs and treed partit io ning to create tree d G Ps , implemen ting a tractable nonstationary mo del fo r no npa r a me tric regression. In Section 5 we return to the analysis of the ro c ke t b oo ster data, and in Section 6 w e conclude with some discussion. 2 The Langley Glid e-Bac k Bo oster Sim ulation The Langley Glide-Bac k Bo oste r (LGBB) is a proposed ro c ket bo oster under design at NASA. Standard ro c k et b o osters are created to b e reusable, assisting in the launc h pro cess and then parac h uting bac k to the Earth after their f uel has b een exhausted. Their return path is planned so that they fall in to the o cean, where they can b e reco v ered and reused. The LGBB represen ts a new direction in b o oster design, as it w ould hav e wings and a tail, lo oking somewhat s imilar to the space sh uttle. The ide a is that it w ould gracefully glide back down, rather than plummeting into the o cean. The dev elopmen t o f the b oo s ter is b eing done primarily through the use of computer sim ulators. The particular mo del (Rog e rs et al., 20 0 3 ) with whic h we w ere in v olve d is based on computational ﬂuid dynamics simulators that n umerically solv e the relev a nt in viscid Euler equations ov er a mesh of 1.4 million cells. Eac h run of the Euler solv er for 4 a given set of parameters can t ak e 5-20 hours on the NASA computers. The sim ulator is theoretically deterministic, but the solv er is t ypically started with random initial conditions and do es not alw a ys n umerically conv erge. There is a n automated c hec k for con v ergence whic h is mostly a cc urate, but some runs are mark ed as accepted despite their false con v ergence, or they con v erge to a clearly inferior lo cal mode. F or those runs that fail the automated con v ergence c hec k, the solv er is restarted at a diﬀe rent set of randomly c hosen initial conditions. Our NASA collab orators ha ve commen ted that input conﬁgurations arbit r a rily close to one ano the r can fail t o achiev e the same estimated con v ergence, ev en af t er satisfying the same stopping criterion. Th us neither simple interpolation of the data no r a Ga us sian pro cess mo del without a n error term will b e adequate, as smo othing will b e neces sary to reduce the impact of the ina ccurate runs. The sim ulator mo dels the forces felt by the ve hicle at the momen t it is r e- e ntering the atmosphere. As a f ree b o dy in space, there are six degrees of freedom, so the six relev an t forces a r e lift, drag, pitc h, side-force, y a w, and roll. F or this pro j ect, the in terest fo cused just on the lift force, a s that is the most imp ortan t one for k eeping a ve hicle aloft. The inputs to the sim ulator are the sp ee d of the v ehicle at re-en try (measured by Mac h n um b er), the angle of attac k (the alpha angle), and the sideslip ang le ( t he b eta angle). Th us the pr ima r y goal is to mo del the lift force as a function of sp eed , alpha, and b eta. The sideslip angle is quan tized in the exp erimen ts, so it is run only at six particular lev els. Sp eed ranges from Mac h 0 t o 6, and the angle o f attac k, alpha, v aries from negativ e ﬁv e to thirty degrees. The sim ulator w as run at 3041 lo cations, ov er a com bination of three hand-designed grids. The ﬁrst grid w as relativ ely coarse and w as equally spaced ov er the whole region o f in terest. Two success iv ely ﬁner grids on smaller regions primarily around Mac h one w ere further run, as the initial run sho w ed that the most in teresting part of the input space was generally aro und t he sound barrier. This 5 mak es sense b ecause the physic s in the sim ulator comes from t w o completely diﬀeren t regimes, a subsonic regime for sp eeds less than Mac h one, and a sup ersonic regime for sp ee ds greater than Mac h one. What happ ens close to and along the b oundary is the most diﬃcult part of the sim ula tion. Mach (speed) 1 2 3 4 5 6 alpha (angle of attack) 0 10 20 30 lift 0.0 0.5 1.0 1.5 lift=f(mach,alpha,beta=0,) Mach (speed) 1 2 3 4 5 6 alpha (angle of attack) 0 10 20 30 lift 0.0 0.5 1.0 lift=f(mach,alpha,beta=0.5) Mach (speed) 1 2 3 4 5 6 alpha (angle of attack) 0 10 20 30 lift 0.0 0.5 1.0 lift=f(mach,alpha,beta=1) Mach (speed) 1 2 3 4 5 6 alpha (angle of attack) 0 10 20 30 lift 0.0 0.5 1.0 lift=f(mach,alpha,beta=2) Mach (speed) 1 2 3 4 5 6 alpha (angle of attack) 0 10 20 30 lift 0.0 0.5 1.0 lift=f(mach,alpha,beta=3) Mach (speed) 1 2 3 4 5 6 alpha (angle of attack) 0 10 20 30 lift 0.0 0.5 1.0 lift=f(mach,alpha,beta=4) Figure 1: Interpolatio n of lift b y sp eed and angle of a ttac k for all sideslip lev els. Note that for lev els 0.5 and 3 (cen t er), Mac h ranges only in (1 , 5) and (1 . 2 , 2 . 2). The upp er-left plot in Figure 1 sho ws an in terp olation of the sim ulator output for the lift surface as a function of sp eed and angle of attac k, when the sideslip angle is zero. The primary feature of t his plot is the large ridg e whic h app ears at Mac h one and larger angles of attac k. The transition from subsonic to sup ersonic is a sharp one, and it is not clear whether one would w an t to use a contin uous mo del or t o in tro duce a discon tin uit y . While m uch of the surface is quite smo oth, parts of the surface, particularly a round Mac h one, are less smo oth. So it is clear that the standard computer mo deling assumption 6 of a stationary pro cess will not work we ll here. W e will need a metho d that allo ws for a no ns tationa r y form ulation, yet tha t can still pro duce uncertain ty estimates, and is computationally feasible to ﬁt on a dataset of this size. One other feature of the data that appears in Figure 1 is the issue of n umerical con v ergence. In the upp er-left corner of the upp er-left plot (high angle of attac k, low sp ee d), there is a single spik e that lo oks out o f place. Our collab orators at NASA b eliev e this to b e a result of a false con v ergence b y the simu lator , so w e would wan t our surrogate mo del to smo oth this one p oin t out. This stands in con trast to most computer mo deling problems, where o ne w ants to in terp olate the deterministic sim ulator without smo othing. Here w e require smo othing to comp ensate f o r problems with the sim ulator. The other plots of Figure 1 sho w the issue of f alse conv ergence more strongly for other sideslip settings. In the cen ter plots (lev els one-half and three), there are noisy depressions in the surface for mo derate sp eed and high angle of a t tac k. Because this feature is not seen in the other plots by sideslip angle, one may suspect that this region could b e sho wing mor e nume rical instabilit y than signal. Th us, there is a need to com bine information across the lev els in order to smo oth out n umerical problems with the simulator. Not e that b ec ause no subsonic inputs were sampled for these slices, the ridge around Mac h one do es no t app ear in these tw o plots. F or sideslip lev els of one, tw o, and four the surface again app ears to b e most in ter- esting ar o und Mac h one. But instead of a clean ridge at lev els one and four it is noisy , esp ecially at high angles of attack. It is not clear if this v ariability is due to false con- v ergence of the sim ulator, inadequate co v erage in the design, or if the b oundary really is this complicated. The NASA scien tists ha v e p ostulated tha t the instabilities are more lik ely to b e nume rical, rat her than structural, but w e will w an t our surrogate mo del to capture this uncertain t y . Also of concern are the deviations from the smo oth trend at high sp ee ds (particularly 7 for lev el four), with up w ard deviations at lo w angles of atta c k and do wn w ard deviations at hig h a ng le s. Again, it is suspected that these a re the result of false n umerical con v er- gence of the sim ulator, but we cannot rule out a priori that the phys ical system itself b ecomes unstable at higher sp eeds. So w e desire smo othing, but with an appro pr ia te lo cal estimate of uncertain ty . Fitting with a single stationary GP w ould giv e uncer- tain ty estimates tha t were fairly uniform across the space, b ecause of t he assumption of stationarit y . Thus we turn t o a partitioning a pproac h. Understanding the mean surface is imp ortan t for the engineers for sev eral reasons. First, they ma y disco v er p oten tial pro blems with the design, whic h could lead to struc- tural c hanges in the design. Second, they will need to determine the optimal ﬂigh t conditions so they can pla n the ﬂight tra jectories. Third, they need to b e able to make con tingency plans in cas e problems arise during a mission. If some of the stabilizing ro c k ets fail and the v ehicle m ust re-en ter at a n unplanned angle or sp eed, they will need to b e able to map out its new t ra jectory and adjust the pro cess as necessary . The en- gineers are interes ted in not just the mean surface, but also the uncertain t y asso ciated with prediction, b ecause t hese uncertainties a r e not constant across the surface. 3 Related w ork Our approac h to nonparametric and semiparametric nonstationary mo deling com bines standard GPs and treed partitioning within the con text of Bay esian hierarc hical mo del- ing and mo del av eraging. W e a s sume that the reader is familiar with the ba sic concepts of Ba y esian inference via Mark o v chain Mon te Carlo (e.g., Gilks et al., 1996). An in tro- duction to GPs and treed pa rtition mo deling follows . 8 3.1 Stationary Gaussian Pro cesses A common sp eciﬁcation of sto c hastic pro cess es for spatial data, of whic h the stationary Gaussian pro ces s (G P) is a particular case, sp eciﬁes that mo del o utputs (resp onses) z , depend on m ultiv ariate inputs (explanatory v ariables) x , as z ( x ) = β ⊤ f ( x ) + w ( x ) where β are linear trend co eﬃcien ts, w ( x ) is a zero mean random pro cess with cov ariance C ( x , x ′ ) = σ 2 K ( x , x ′ ), K is a correlation mat r ix, and β is indep enden t of w in the prior. Lo w-order p olynomials are sometimes used instead of the simple linear mean β ⊤ f ( x ), o r the mean pro ces s is sp eciﬁed generically , often as ξ ( x , β ) or ξ ( x ). F ormally , a G auss ian pro cess is a collection of random v ariables Z ( x ) indexed b y x , ha ving a join tly Gaussian distribution for any ﬁnite subset of indices ( Ste in , 1999). It is sp eciﬁed b y a mean function µ ( x ) = E  Z ( x )  and a correlatio n function K ( x , x ′ ) = 1 σ 2 E  [ Z ( x ) − µ ( x )][ Z ( x ′ ) − µ ( x ′ )] ⊤  . W e a s sume that the corr elat io n function can b e written in the form K ( x j , x k | g ) = K ∗ ( x j , x k ) + g δ j,k . (1) where δ · , · is the Kronec ker delta function and K ∗ is a prop er underlying parametric correlation function. The g term in Eq. (1) is referred to as the n ugget. It m ust alw a ys b e p ositiv e ( g > 0), and it serv es tw o purp oses. First, it prov ides a mec hanism fo r in tro ducing measuremen t error in to the sto c hastic pro ces s. It arises when considering a mo del of the form Z ( x ) = ξ ( x , β ) + w ( x ) + η ( x ) , where w ( · ) is a pro cess with cor r elat io ns go v erned b y K ∗ , and η ( · ) is simply Gaussian noise. Sec ond, the n ugget helps preve nt K from becoming nu merically singular. Not a tional con v enience and conceptual congruence motiv ates referral to K a s a correlat io n matrix, ev en thoug h the n ugget term ( g ) forces K ( x i , x i ) > 1. The re is a n isomorphic mo del sp eciﬁcation wherein K depicts pro p er correlations. Under b oth sp eciﬁcations K ∗ do es indeed deﬁne a v alid correlation matrix K ∗ . The correlation functions K ∗ ( · , · ) are t ypically sp ec iﬁed through a lo w dimensional 9 parametric structure, whic h also guaran tees that they are symmetric and p ositiv e semi- deﬁnite. Here w e f ocus on the p o w er family , although o ur metho ds are clearly extensible to other families, suc h as t he Mat´ ern class (Mat´ ern, 1 986). F urther discussion of corre- lation structures can b e found in Abrahamsen (19 9 7 ) or St ein (19 9 9 ). The p o w er family of correlation functions includes the simple isotropic parameterization K ∗ ( x j , x k | d ) = exp  − || x j − x k || p 0 d  , (2) where d > 0 is a single r a nge parameter and p 0 ∈ (0 , 2] determines the smo othness of the pro cess. Th us the correlation of tw o p oin ts dep e nds only on the Euclidean distance || x j − x k || b et w een t hem. A straightforw ar d enhancemen t to the isotropic p o w er family is to employ a separate ra nge parameter d i in eac h dime nsion ( i = 1 , . . . , m X ). The resulting correlation function is still stationary , but no lo ng e r isotropic: K ∗ ( x j , x k | d ) = exp ( − m X X i =1 | x ij − x ik | p 0 d i ) . (3) 3.2 T reed P artitioning Man y spatial mo deling problems require more ﬂexibilit y than is oﬀered by a stationary GP . One w ay to ac hiev e a more ﬂexible, nonstatio nary , pro cess is to use a partition mo del—a meta-mo del which divides up the input space and ﬁts diﬀeren t base mo dels to data independen tly in eac h region. T r eed partitio ning is one p ossible approac h. T reed partition mo dels typically divide up the input space b y making binar y splits on the v alue of a single v ariable (e.g., x 1 > 0 . 8) so that partition b oundaries are parallel to co ordinate axes. P artitioning is recursiv e, so eac h new partition is a sub-partition of a previous one. F or example, a ﬁrst partitio n ma y divide the space in half b y whether the ﬁrst v ariable is ab o v e or b elo w its midp oin t. The second partition will then divide only the space b elo w (or ab o ve) the midp oin t of the ﬁrst v ariable, so tha t there are now 10 three partitions (not four). Since v ariables ma y be revisited, there is no loss of generality b y using binary splits, as m ultiple splits on the same v ariable will b e equiv alent to a non-binary split. In eac h part ition (leaf of the tr ee), an indep ende nt mo del is applied. Classiﬁcation and Regression T rees (CAR T) (Breiman et al., 19 84 ) are an example of a treed partition mo del. CAR T, whic h ﬁts a constan t surface in eac h leaf, has b ec ome p opular b ecause of its ease of use, clear in terpretation, and abilit y to provide a go o d ﬁt in man y cases. The Bay esian approach is straigh tforward to apply to CAR T ( Chipman et al., 1998; Denison et al., 1 998), pro vided that one can sp ecify a meaningful prior for the size of the tree. W e follow Chipman et al. (1998) who sp ecify the prior thro ugh a tree-generating pro cess and enforce a minim um amount of da t a in or de r to infer the pa r a me ters in eac h partition. Starting with a n ull tree (all data in a single region), a leaf no de η ∈ T , represen ting a region of the input space, splits with proba bility a (1 + q η ) − b , where q η is the depth of η ∈ T and a a nd b are parameters c hosen t o give an appropriate size and spread to the distribution of trees. F urther details a re a v a ila ble in the Chipman et al. pap ers. F or our mo dels, w e ha v e fo und tha t default v alues of a = 0 . 5 and b = 2 often w ork w ell in practice, although in an y particular problem prior kno wledge ma y call for ot he r v alues. The prio r for the splitting pro cess inv olves ﬁrst c ho osing the splitting dimension u from a discrete unifo rm, and then the split lo cation s is c hosen uniformly from a subse t o f the lo cations X in the u th dimension. In tegrating out dep endence o n the tr ee structure T can b e a c complished via Rev ersible-Jump (RJ) MCMC as further described in Section 4.2.2. Chipman et al. (2002) generalize Ba y esian CAR T to create the Bay esian treed linear mo del (LM) b y ﬁtting hierarc hical LMs a t the leav es of the tree. In Section 4 w e generalize further by prop osing to ﬁt stationary GPs in eac h of the lea v es of the tree. This approach b ears some similarit y to that of Kim et al. (2005), who ﬁt separate GPs in 11 eac h elemen t of a V oro no i tessellation. The treed G P approach is b etter geared tow ard problems with a smaller n um b er of distinct partitions, leading to a simpler o v erall mo del. V o r onoi tessellations allow an intricate partitioning of the space, but hav e the trade-oﬀ of added complexit y and can pro duce a ﬁnal mo del that is diﬃcult to in terpret. The tessellation approac h also has the b eneﬁt of not b eing restricted to axis-aligned partitions (although in some cases, simple transformations suc h as r o tating the data will suﬃce to allo w axis-aligned part it io ns ). A nice r e view of Bay esian partition mo deling is pro vided b y Denison et al. (2002). Other approac hes to nonstationary mo deling include those whic h use spatial defor- mations and pro cess conv olutions. The idea b ehind the spatial deformation approach is to map nonstationary inputs in the original, geographical, space in to a disp ersion space wherein the pro ce ss is stationary . Sampson and Guttorp (1 992) use thin-plate spline mo dels and m ultidimensional scaling (MDS) to construct the mapping. Damian et al. (2001) explore a similar metho dology fr o m a Bay esian p erspectiv e. Schm idt and O’Hagan (2003) also take the Ba y esian approac h, but put a Gaussian pro cess prior on the map- ping. The pro cess con v olution approach (Higdon et al., 1999; F uen tes, 200 2 ; Pac iorek , 2003) pr o ceeds b y a llo wing the con v olution k ernels to v a ry smo othly in parameterization as an unknow n f unction of their spatial lo cation. A common theme among suc h nonsta- tionary mo dels is the in tro duction of meta-structure whic h ratchets up the ﬂexibilit y of the mo del, ratc heting up the computational demands as we ll. This is in stark contrast to the treed approac h that introduces a structural mec ha nism, the tree T , that a ctually reduces the computational burden relat ive to the base mo del, e.g., a GP , b ecaus e smaller correlation matrices are inv erted. A k ey diﬀerence is that these alt e rnative approaches strictly enforce con tin uit y of t he pro cess, whic h requires muc h more eﬀort than the treed approac h. 12 4 T reed Gaussian pro ce ss mo dels Extending the partitioning ideas of Chipman et al. (1998, 2002) for simple Ba yes ian treed mo dels, w e ﬁt stationa r y GP mo dels with linear trends indep enden tly within each of R regio ns , { r ν } R ν =1 , depicted at the leav es of the tree T , instead of constan t (199 8) or linear (2002) mo dels. The tree is av eraged out by in tegrating o v er p oss ible trees, using RJ-MCMC (Richardson and G ree n, 1 9 97 ), with the tree prior sp eciﬁed thro ugh a tree-generating pro cess. Prediction is conditioned on the tree structure, and is av eraged o v er in the p osterior to get a f ull a cc ounting of uncertaint y . 4.1 Hierarc hical Mo del A tree T recursiv ely partitions the input space into in to R non-ov erlapping regions: { r ν } R ν =1 . Each region r ν con tains data D ν = { X ν , Z ν } , consisting of n ν observ ations. Let m ≡ m X + 1 b e n um b er of cov ariates in the design (input) matr ix X plus a n in tercept. F or eac h region r ν , the hierarc hical generative GP mo del is Z ν | β ν , σ 2 ν , K ν ∼ N n ν ( F ν β ν , σ 2 ν K ν ) , β 0 ∼ N m ( µ , B ) β ν | σ 2 ν , τ 2 ν , W , β 0 ∼ N m ( β 0 , σ 2 ν τ 2 ν W ) τ 2 ν ∼ I G ( α τ / 2 , q τ / 2) , (4) σ 2 ν ∼ I G ( α σ / 2 , q σ / 2) , W − 1 ∼ W (( ρ V ) − 1 , ρ ) , with F ν = ( 1 , X ν ), a nd W is an m × m matr ix . The N , I G , and W are t he (Multiv ari- ate) Normal, In v erse-Gamma, and Wishart distributions, respectiv ely . Hyp erparameters µ , B , V , ρ, α σ , q σ , α τ , q τ are treated as known. The model (4) sp ec iﬁes a m ultiv ariate normal lik eliho o d with linear trend co eﬃcien ts β ν , v aria nc e σ 2 ν , a nd n ν × n ν correlation matrix K ν . The co eﬃcien ts β ν are b eliev ed to hav e come from a common unkno wn mean β 0 and region-sp eciﬁc v a riance σ 2 ν τ 2 ν . There is no explicit mec hanism in the mo del (4) to ens ure that the pro cess near the b oundary of tw o adjacen t regions is contin u- ous a c ross the partitions depicted by T . Ho w ev er, the mo del can capture smo othness 13 through mo del a v eraging, as will b e discussed in Section 4.3. In our work with mo dels for phys ical pro cess es, w e f req uen tly encounter problems with phase transitions where the resp o ns e surface is not smoo th at the b oundary b et wee n distinct ph ysical regimes, suc h as subsonic vs. sup ersonic ﬂigh t of the ro c ket b o oster, so w e view the abilit y to ﬁt a discon tinuous surface as a feature of this mo del. The GP correlation s tructure K ν ( x j , x k ) = K ∗ ν ( x j , x k ) + g ν δ j,k generating K ν for eac h partition r ν tak es K ∗ ν to b e from the isotropic p o w er family (2) , or separable p ow er family (3), with a ﬁxed p o w er p 0 , but unkno wn (random) ra ng e and n ugget parameters. Ho w ev er, since most o f the follo wing discussion holds for K ∗ ν generated b y other families, as w ell as for unkno wn p 0 , w e shall refer to the correlation parameters indirectly via the resulting correlation matrix K , or function K ( · , · ). F or example, p ( K ν ) can represen t either of p ( d ν , g ν ) or p ( d ν , g ν ), etc. Priors that enco de a preference for a mo del with a nonstationary global co v ariance structure are c hosen for pa ramete rs to K ∗ ν and g ν . In particular, w e prop ose a mixture of Gammas prior for d : p ( d, g ) = p ( d ) × p ( g ) = p ( g ) × 1 2 [ G ( d | α = 1 , β = 20) + G ( d | α = 10 , β = 1 0 )] . (5) It giv es roughly equal mass t o small d represen ting a p opulation of GP pa r ame teriza- tions for w a vy surfa ces, and a separate p opulation for those whic h are quite smoo th or appro ximately linear. W e take the prior for g to b e Exp( λ ). Alternativ ely , one could enco de the prior as p ( d , g ) = p ( d | g ) p ( g ) and then use a reference prio r (Berger et al., 2001) for p ( d | g ). W e prefer t he more delib erate mixture sp eciﬁcation b oth b ecause of its mo deling implications, as w ell as its abilit y to interface w ell with limiting linear mo dels (Gramacy and Lee, 2008). It may also b e sensible to deﬁne the prio r for { K , σ 2 , τ 2 } ν hierarc hically , dep ending on parameters γ (not indexed by ν ), similar to ho w the p opulation of β ν parameters is giv en a common prior in terms of W and β 0 in (4). 14 4.2 Estimation The data D ν = { X , Z } ν are used to up date the GP parameters θ ν ≡ { β , σ 2 , K , τ 2 } ν , for ν = 1 , . . . , R . Conditiona l on the tree T , the full set of parameters is denoted as θ = θ 0 ∪ S R ν =1 θ ν , where θ 0 = { W , β 0 , γ } denotes upp er-lev el parameters fro m the hierarc hical prior that are also up dated. Samples f r o m the p oste rior distribution of θ are g a there d using Marko v c hain Mon te Carlo (MCMC) b y ﬁrst conditio ning on the hierarc hical prio r pa r ame ters θ 0 and dra wing θ ν | θ 0 for ν 1 , . . . , ν R , and then θ 0 is dra wn as θ 0 | S R ν =1 θ ν . Section 4.2.1 giv es the details. All para meters can be sampled with Gibbs steps, except tho s e that parameterize the cov ariance function K ( · , · ), e.g., { d, g } ν , whic h require Metrop olis-Hastings (MH) dra ws. Section 4.2.2 sho ws ho w RJ-MCMC is used to gather samples from the jo in t p osterior of ( θ , T ) by alternately drawing θ |T and T | θ . 4.2.1 GP parameters given a t ree ( T ) Conditional conj ug acy allow s Gibbs sampling fo r most parameters. F ull deriv ations of the fo llowing equations are av ailable in Gramacy (2005). The linear regression parameters β ν and prior mean β 0 b oth ha v e m ultiv a r ia te normal full conditiona ls : β ν | rest ∼ N m ( ˜ β ν , σ 2 ν V ˜ β ν ), a nd β 0 | rest ∼ N m ( ˜ β 0 , V ˜ β 0 ), where V ˜ β ν = ( F ⊤ ν K − 1 ν F ν + W − 1 /τ 2 ν ) − 1 , ˜ β ν = V ˜ β ν ( F ⊤ ν K − 1 ν Z ν + W − 1 β 0 /τ 2 ν ) , (6) V ˜ β 0 =  B − 1 + W − 1 P R ν =1 ( σ ν τ ν ) − 2  − 1 , ˜ β 0 = V ˜ β 0  B − 1 µ + W − 1 P R ν =1 β ν ( σ ν τ ν ) − 2  . The linear v ariance parameter τ 2 follo ws an in v erse-gamma: τ 2 ν | rest ∼ I G (( α τ + m ) / 2 , ( q τ + b ν ) / 2) , where b ν = ( β ν − β 0 ) ⊤ W − 1 ( β ν − β 0 ) /σ 2 ν . The linear mo del co v a riance matr ix W f ollo ws an inv erse-Wishart: W − 1 | rest ∼ W m  ( ρ V + V ˆ W ) − 1 , ρ + R  , where V ˆ W = R X ν =1 1 ( σ ν τ ν ) 2 ( β ν − β 0 )( β ν − β 0 ) ⊤ . 15 Analytically in tegrating o ut β ν and σ 2 ν giv es a marginal p osterior for K ν and improv es mixing of the Mark o v c hain (Berger et al., 20 0 1 ). p ( K ν | Z ν , β 0 , W , τ 2 ) =  | V ˜ β ν | (2 π ) − n ν | K ν || W | τ 2 m  1 2  q σ 2  α σ / 2 Γ  1 2 ( α σ + n ν )   1 2 ( q σ + ψ ν )  ( α σ + n ν ) / 2 Γ  α σ 2  p ( K ν ) , (7) where ψ ν = Z ⊤ ν K − 1 ν Z ν + β ⊤ 0 W − 1 β 0 /τ 2 − ˜ β ⊤ ν V − 1 ˜ β ν ˜ β ν . (8) Eq. (7) can be used to iterativ ely obtain dra ws for the parameters of K ν ( · , · ) via Metrop olis- Hastings (MH), or as part of the acceptance rat io for prop osed mo diﬁcations to T [see Section 4.2.2]. An y h yp erparameters to K ν ( · , · ), e.g., parameters to priors for { d, g } ν of the isotro pic p o w er family , would also require MH draws. The conditional distribution of σ 2 ν with β ν in tegrated out allows G ibbs sampling: σ 2 ν | Z ν , d ν , g , β 0 , W ∼ I G (( α σ + n ν ) / 2 , ( q σ + ψ ν ) / 2) . (9) 4.2.2 T ree ( T ) In tegrating out dep endenc e on the tree structure ( T ) is accomplished b y RJ-MCMC. W e augmen t the tr ee op erations of Chipman et al. (1998)— gr ow, prune, change, swap — w ith a rotate op eration. A change op eration prop oses moving an existing split-p oin t { u , s } to either the next greater or lesser v alue of s ( s + or s − ) along the u th column of X . This is accomplished b y sampling s ′ uniformly from the set { u ν , s ν } ⌈ R/ 2 ⌉ ν =1 × { + , −} whic h causes the MH acceptance ratio for cha nge to reduce t o a simple lik eliho o d ra tio since parameters θ r in regions r b elow the split-p oin t { u, s ′ } are held ﬁxed. A swap op eration prop oses c hanging the order in whic h tw o adja ce nt parent-c hild (in ternal) no des split up the inputs. An inte rnal parent-c hild no de pair is pick ed at ran- dom from the tree and their splitting rules are sw app ed. Ho w ev er, sw aps on paren t-c hild in ternal no des whic h split on the same v ariable cause problems b e cause a c hild region 16 b elo w b oth paren ts b ecomes empty after the op eration. If instead a r otate op eration from Binary Searc h T rees (BSTs) is p erformed, the prop osal will almost alw a ys accept. Rotations are a w a y of adjusting the conﬁguration and heigh t of a BST without viola t ing the BST prop ert y , as used, e.g., b y r e d-black tr e es (Cormen et al., 19 9 0 ). X [: , 1] < 5 X [: , 1] ≥ 5 X [: , 1] < 3 X [: , 1] ≥ 3 T 1 T 2 T 3 { 1 , 3 } { 1 , 5 } { 1 , 3 } X [: , 1] < 3 X [: , 1] ≥ 3 T 1 { 1 , 5 } X [: , 1] < 5 T 2 X [: , 1] ≥ 5 T 3 rotate T rotated: T : rotate (right) Figure 2: Rotating on the same v ariable, where T 1 , T 2 , T 3 are arbitrary sub-trees . In the con text of a Ba y esian MCMC tree prop osal, rotations encourage b etter mixing of the Marko v c hain b y pro viding a more dynamic set of candidate no des for pruning, thereb y he lping escape local minima in the margina l posterior of T . Figure 2 show s an example of a success ful righ t-rota tion where a sw ap w ould pro duce an empt y no de (at the curren t lo cation of T 2 ). Since the partit io ns at the leav es remain unch anged, the lik eliho od ratio of a prop osed ro t a te is alw ay s 1 . The only “active ” par t of the MH acceptance r a tio is the prior on T , preferring trees of minimal depth. Still, calculating the acceptance ra tio for a r otate is non- trivial b ecause the depth of t w o of its sub-trees c hange. Figure 2 show s a r igh t- r o tate, where no des in T 1 decrease in depth, while those in T 3 increase. The opp osite is true for left-rotation. If I = { I i , I ℓ } is the set of no des (in ternals and leav es) of T 1 and T 3 , b efore rotation, whic h increase in depth after rotat io n, and D = { D i , D ℓ } are those whic h decrease in depth, then the MH acceptance rat io for a righ t-rotate is 17 p ( T ∗ ) p ( T ) = p ( T ∗ 1 ) p ( T ∗ 3 ) p ( T 1 ) p ( T 3 ) = Q η ∈ I i a (2 + q η ) − b Q η ∈ I ℓ [1 − a (2 + q η ) − b ] Q η ∈ I i a (1 + q η ) − b Q η ∈ I ℓ [1 − a (1 + q η ) − b ] × Q η ∈ D i aq − b η Q η ∈ D ℓ [1 − aq − b η ] Q η ∈ D i a (1 + q η ) − b Q η ∈ D ℓ [1 − a (1 + q η ) − b ] . The MH acceptance ratio for a left- rotate is analogous. Gr ow and prune op erations are complex b ecaus e they add or remo v e partitions, c hanging the dimension of the parameter space. The ﬁrst step fo r either op eration is to uniformly select a leaf no de (fo r g r ow ), o r the paren t of a pair of leaf no des (for prune ). When a new region r is added, new pa ramete rs { K ( · , · ) , τ 2 } r m ust b e prop osed, and when a region is take n a w a y the parameters mus t be absorb ed b y the parent regio n, or discarded. When ev a luating t he MH a cceptance ratio the linear mo del par amete rs { β , σ 2 } r are in tegrated out (7). One of the newly gro wn childre n is unifor mly c hosen t o receiv e the correlation function K ( · , · ) of its paren t, essen tially inheriting a blo c k fr o m its paren t’s correlation matrix. T o ensure that the r esulting Mark ov c hain is ergo dic and rev ersible, the other new sibling draw s its K ( · , · ) from the prior thus giving a unit y Jacobian term in the RJ-MCMC. Note that gr ow op erations a r e the only place where priors are used as prop osals; random-w alk pro p osals are used elsewhere [see Section 4.4]. Symmetrically , prune op erations randomly select parameters from K ( · , · ) for the consolidated no de fro m one of the c hildren b eing absorb ed. Af t er accepting a gr ow or prune , σ 2 r can b e dra wn from its marg inal po s terior, with β r in tegrated out (9), follo w ed b y dra ws for β r and t he rest of the para me ters in the r th region. Let { X , Z } b e the da t a at the new paren t no de η at depth q η , and { X 1 , Z 1 } and { X 2 , Z 2 } b e the partitioned c hild data at depth q η + 1 created b y a new split { u, s } . Also, let P b e the set of prunable no des of T , and G the set of grow able no des. If P ′ are the prunable no des in T ′ —after the (successful) gr ow at η —and the paren t o f η w as prunable in T , then |P ′ | = |P | . Otherwise |P ′ | = |P | + 1. The MH ratio for g r ow is: 18 |G | |P ′ | a (1 + q η ) − b (1 − a (2 + q η ) − b ) 2 1 − a (1 + q η ) − b p ( K 1 , | Z 1 , β 0 , τ 2 1 , W ) p ( K 2 | Z 2 , β 0 , τ 2 2 , W ) p ( K | Z , β 0 , τ 2 , W ) q ( K 2 ) (10) assuming that K 1 w as randomly chose n to receiv e the parameterization of its pa r ent, K , and that the new parameters for K 2 are prop osed according t o q . The prune op eration is analogous. Note that for the p osteriors p ( K | Z , β 0 , τ 2 , W ), p ( K 1 | Z 1 , β 0 , τ 2 1 , W ) and p ( K 2 | Z 2 , β 0 , τ 2 2 , W ), the “constan t” terms in (7) are required b ecaus e they do not o ccur the same n um b er of times in the numerator a nd denominator. 4.3 T reed GP Predictio n Prediction under the ab o ve GP mo del is straigh tforward (Hjort and Omre, 1994). Con- ditional on the cov ariance structure, t he predicted v alue of z ( x ∈ r ν ) is normally dis- tributed with mean and v ariance ˆ z ( x ) = E ( Z ( x ) | data , x ∈ r ν ) = f ⊤ ( x ) ˜ β ν + k ν ( x ) ⊤ K − 1 ν ( Z ν − F ν ˜ β ν ) , (11) ˆ σ ( x ) 2 = V ar( z ( x ) | data , x ∈ r ν ) = σ 2 ν [ κ ( x , x ) − q ⊤ ν ( x ) C − 1 ν q ν ( x )] , (12) where C − 1 ν = ( K ν + τ 2 ν F ν WF ⊤ ν ) − 1 q ν ( x ) = k ν ( x ) + τ 2 ν F ν Wf ( x ) (13) κ ( x , y ) = K ν ( x , y ) + τ 2 ν f ⊤ ( x ) Wf ( y ) with f ⊤ ( x ) = (1 , x ⊤ ), a nd k ν ( x ) is a n ν − v ector with k ν,j ( x ) = K ν ( x , x j ), f o r all x j ∈ X ν . Conditional on a particular tree, T , the p o s terior predictiv e surface describ ed in Eqs. (11 – 12) is discon tinuous across the partition b oundaries of T . How ev er, in the aggregate of samples collected from the join t p osterior distribution of ( T , θ ), the mean tends to smo oth out near lik ely partition b oundaries as t he tree op erations gr ow , prune, change , and swap in tegrate ov er trees and GPs with larger p osterior probabilit y . Uncer- tain ty in the p osterior for T tra nslates in to hig he r p osterior predictiv e uncertain t y near region b oundaries. When the data actually indicate a non-smo oth pro cess, the treed GP 19 retains the ﬂexibilit y necessary to mo del discontin uities. When the da t a are consisten t with a contin uous pro ce ss, as in the motorcycle data example in Section 4.5, the treed GP ﬁt s are almo st indistinguishable from con tin uous. 4.4 Implemen tation The treed GP mo del is co ded in a mixture o f C and C++ , using C++ for the tree struc- ture and C for the GP at eac h leaf of T . The C co de can in terface with either stan- dard platf o rm-speciﬁc Fortran BLAS/Lapack libra r ies for t he necessary linear algebra routines, or link t o those automatically conﬁgured for fast ex ecution on a v ariet y of platforms via the ATLAS library (Whaley and P etitet, 2 004). T o improv e usabilit y , the routines ha v e b een wrapp ed up in an intuitiv e R interface, and are a v ailable on CRAN ( R Dev elopmen t Core T eam, 2 004) at http://www. cran.r-project.org/web/packages/tgp/index.html as a pac k age called tgp . It is useful to ﬁrst t r a ns late and re- scale the input dataset ( X ) so that it lies in a n ℜ m X dimensional unit cub e. This mak es it easier to construct prior distributions for the width parameters to the correlation function K ( · , · ). Conditioning on T , prop osals for all parameters whic h require MH sampling ar e take n from a uniform “sliding windo w” cen tered around the lo cation of the last accepted setting. F or example, a prop osed a new n ugget parameter g ν to the correlatio n function K ( · , · ) in region r ν w ould g o as g ∗ ν ∼ Unif (3 g ν / 4 , 4 g ν / 3). Calculating the forward a nd bac kw a r d prop osal probabilities for the MH a cc eptance ratio is straightforw a r d. After conditioning on ( T , θ ), prediction can be parallelized b y using a pro ducer- consumer mo del. This allo ws t he use of PThreads in order to take adv an tage of m ultiple pro cess ors, and get sp ee d-ups of at least a factor of tw o, whic h is helpful as m ulti- pro cess or machines b ecome commonplace. P arallel sampling of the p osterior of θ |T for 20 eac h of the { θ ν } R ν =1 is also p ossible. 4.5 Illustration In this section w e illustrate t he treed GP mo del on t he Motorcycle Acciden t Dataset (Silv erman, 19 85), a classic dataset used in recen t literature (e.g., Rasm ussen and Ghahramani, 2002) to demonstrate the success of nonstationary mo dels. The data set consists of mea- suremen ts of the acceleration of the head of a motorcycle rider, wh ich w e attempt to predict as a f unction of time in the ﬁrst moments after an impact. In addition to suggest- ing a mo del with a nonstationar y cov ariance structure, there is input-dep enden t noise (a.k.a., heterosc edasticit y). T o ke ep things simple in this illustratio n, the isotr o pic p o w er family (2) correlatio n function ( p 0 = 2) is c hosen fo r K ∗ ( · , ·| d ) with range parameter d , com bined with n ugget g to form K ( · , ·| d, g ). 10 20 30 40 50 −100 −50 0 50 GP, accel mean and error time accel 10 20 30 40 50 −100 −50 0 50 treed GP LLM, accel mean and error time accel Figure 3: Motorcycle Dataset, ﬁt ( left ) by a stationary pro cess and right b y our nonsta- tionary mo del. Figure 3 show s the data and the ﬁts giv en b y b oth a stationary GP (left) and the treed GP mo del ( righ t), along with 90% credible interv als. F or the treed GP , v ertical lines illustrate a t ypical treed partition T . Notice that t he stationary GP is completely unable to capture the heteroscedasticit y , and that the large v ariability in the central region driv es b oth ends to b e more wiggly (in particular, the t r a ns ition from the ﬂat left initial region requires an upw ard curv e b efore descending). In contrast, the treed GP 21 clearly reﬂects the diﬀering lev els o f uncertaint y , as w ell as allowing a ﬂatter ﬁt t o the initial segmen t and a smo other ﬁt to the ﬁnal segmen t. 20 ,000 MCMC rounds yielded an a v erage of 3.11 partitions in T . These results diﬀer from those of Rasmus sen & Ghahramani (2002). In part ic ular, the error-ba r s they rep ort for the leftmost regio n seem to o large relative to the other regions. They use what they call a n “inﬁnite mixture of GP exp erts ” whic h is a Diric hlet pro cess mixture of GPs. They rep ort that the p osterior distribution uses b et w een 3 and 10 experts t o ﬁt this data s et, with ev en 10 - 15 experts still ha ving considerable p osterior mass, although there are “roughly” three regions. Contrast this with the treed GP mo del whic h almost alwa ys partitions into three regions, o ccas ionally four, rarely t w o. On speed grounds, the treed GP is also a winner. Running the mixture of GP exp e rts mo del using a total of 11,000 MCMC rounds, discarding the ﬁrst 1 ,000, to ok ro ughly one hour on a 1 GHz P en tium. Allow ing treed G P to use 25,00 0 MCMC ro und s, discarding the ﬁrst 5,000, tak es ab out 3 minutes on a 1.8 GHz Athalon. W e note t ha t the mean ﬁtted function in the right plot in Figure 3 is essen tia lly that of a con tinuous function. F igure 4 sho ws examples o f the ﬁts from individual MCMC iterations that are ev en tually av eraged. While the individual partition mo dels are ty pically dis contin uous, it is clear f rom F ig ure 3 that the mean ﬁtted function is w ell-b eha ved. 4.6 Limiting linear mo dels In some cases, a GP may not b e needed within a partit ion, and a muc h simpler mo del, suc h as a linear mo del, may suﬃce. In particular, b ecause of the linear mean function in our implemen tation of t he GP , the standard linear mo del can b e seen as a limiting case. The linear model is then more parsimonious, as w ell as m uc h more computationally eﬃcien t. Use of a mo del-switc hing prior allo ws practical implemen tation. More details 22 −100 0 50 Accel 10 20 30 40 50 −100 0 50 Time Accel 10 20 30 40 50 Time 10 20 30 40 50 Time Figure 4: Fits of the motorcycle acciden t data f rom individual MCMC iteratio ns. are a v ailable in Gramacy and Lee ( 2 008). The v alue of such an approach can b e seen from the ﬁt on the rig h t side of Figure 3. The leftmost partition lo oks quite ﬂa t, and so could b e ﬁt just as well with a line ra the r than a GP . The cen ter partition clearly requires a GP ﬁt. Th e rightmost partition lo oks mostly linear, and w ould give a p osterior whic h is a mix o f a GP and linear mo del. Indeed, Figure 4 sho ws examples o f the individual ﬁts, and the leftmost section is nearly alw a ys ﬂat, the rightmost section is often but not alw ay s ﬂat, and the cen ter section is t ypically curve d but ev en there it can b e essen tially piecewise linear (the range par a me ter d is estimated to b e lar ge, giving a nearly linear ﬁt). Replacing the full GP with a linear mo del in a partition greatly reduces the computational resources required to up date the mo del in that partition. T ree d and non-treed Gaussian pro cess with j um ps to the limiting linear mo del are implemen ted in the tgp pac k age on CRAN, and w e tak e adv a n tage of the full form ulation in o ur analyses herein. 23 5 Ro c k e t Bo ost er Mo del Results W e ﬁt our tr eed GP mo del to the ro c ket b o oster data using ten indep enden t RJ-MCMC c hains with 15,000 MCMC rounds each. The ﬁrst 5 ,000 w ere discarded a s burn-in, and ev ery ten th thereafter w as treated as a sample from the p osterior distribution π ( T , θ | Z ). In total, 10,00 0 samples w ere sa v ed. On a single 3 .2 GHz Xeon processor this to ok mach 1 2 3 4 5 6 alpha 0 10 20 30 lift 0.0 0.5 1.0 1.5 lift mean, with (beta) fixed to (0) mach 1 2 3 4 5 6 alpha 0 10 20 30 lift 0.0 0.5 1.0 lift mean, with (beta) fixed to (0.5) mach 1 2 3 4 5 6 alpha 0 10 20 30 lift 0.0 0.5 1.0 lift mean, with (beta) fixed to (1) mach 1 2 3 4 5 6 alpha 0 10 20 30 lift 0.0 0.5 1.0 lift mean, with (beta) fixed to (2) mach 1 2 3 4 5 6 alpha 0 10 20 30 lift 0.0 0.5 1.0 lift mean, with (beta) fixed to (3) mach 1 2 3 4 5 6 alpha 0 10 20 30 lift 0.0 0.5 1.0 lift mean, with (beta) fixed to (4) Figure 5: P osterior pr edictive mean surfaces of lift for all sideslip angles. Note that for lev els 0.5 and 3 (cen ter), Mach ranges only in (1 , 5) a nd (1 . 2 , 2 . 2) ab out 60 hours. On the same machine , using the same (tuned) linear algebra librar ies, in v erting a single 3041 × 3041 matrix takes ab out 17 sec onds, so obtaining the same n um b er of samples from a stationary GP would ha v e tak en a minim um of 708 hours. This is a gross underestimate b ec ause it assumes only one in v erse is needed p er MCMC round. Moreo ver, it do es not coun t an y of the O ( n 2 ) op erations lik e determinan ts of K 24 1 2 3 4 5 6 −5 0 5 10 15 20 25 30 lift quantile diff (error), with (beta) fixed to (0) mach alpha 0 1 2 3 4 5 6 −5 0 5 10 15 20 25 30 lift quantilde diff (error), with (beta) fixed to (0.5) mach alpha 1 2 3 4 5 6 −5 0 5 10 15 20 25 30 lift quantile diff (error), with (beta) fixed to (1) mach alpha 1 2 3 4 5 6 −5 0 5 10 15 20 25 30 lift quantile diff (error), with (beta) fixed to (2) mach alpha 0 1 2 3 4 5 6 −5 0 5 10 15 20 25 30 lift quantilde diff (error), with (beta) fixed to (3) mach alpha 1 2 3 4 5 6 −5 0 5 10 15 20 25 30 lift quantile diff (error), with (beta) fixed to (4) mach alpha Figure 6: P osterior predictiv e v ariance surfaces of lift for all six sideslip angles. Dots sho w the lo cations of experimen tal runs. Dark er shades a r e hig her v alues. (assuming a f actorized K w as sa v ed in computing K − 1 ) or multiplications lik e ZK − 1 Z in (8) , nor do e s it factor in the time nee ded to sample from the p osterior pred ictiv e distribution. Figures 5 & 6 summarize the p osterior predictiv e distribution for the lift r e sp onse for each of the six lev els of sideslip a ng le . Figure 5 con tains plots of the ﬁtted mean lift surface b y sp ee d and angle of attac k, and Figure 6 plots a measure of the estimated predictiv e uncertain t y giv en b y the diﬀerence in 95% and 5% quan tiles of samples from the p osterior predictiv e distribution. The treed GP w orks w ell here. Most of the space is nice ly smo o th, with the sharp transition at Mac h one a lso w ell-mo deled. Most of the p oten tia l false con v ergences hav e b een smo othed out. But the estimated v ariability reﬂects b oth increased v a riabilit y where the f unction is c hanging rapidly (e.g., near Mac h one, particularly f o r higher sideslip lev els) and esp ecially where there are issues of p ossible false numeric al conv ergence. Note tha t the uncertain t y is not that high near Mac h one a t sideslip lev el zero b ec ause of the large num b er of samples tak en in that region. W e also note that the increased uncertain ty seen in the top row s around Mac h 25 three and hig he r angles of attack is due to t he noisy de pression area in the data for sideslip lev el of one-half. Figure 7 sho ws the MAP treed partitions ˆ T found during 0 1 2 3 4 5 6 −5 0 5 10 15 20 25 30 lift: treed partitions (beta=0) mach alpha Figure 7: MAP treed partitions ˆ T f o r the lift resp onse at sideslip lev el zero. MCMC for the slice of sideslip lev el zero. Notice the aggressiv e partitioning near Mac h one due to the regime shift b et w een subsonic and sup ersonic sp eeds. Extra partitioning at lo w sp eeds and large angles of attac k address the singularit y o utlined in Figure 1, and near Mac h three due to the n umerical instabilities at sideslip lev el o ne- ha lf. 1 2 3 4 0.6 0.8 1.0 1.2 1.4 Lift Mean & Quantiles: (Alpha,Beta)=(25,0) Mach Lift Figure 8: A slice of the mean ﬁt with error bars as a function of Mach with alpha ﬁxed to 25 a nd b eta ﬁxed t o zero. Figure 8 sho ws the mean ﬁt and a 90% credible inte rv a l for once slice of predicting lift, with Mach on the x -a x is and considering o nly angle of at tac k equal to 25 and 26 slideslip angle equal to 0 (i.e., this is one slice from the upp er left plot in Fig ure 5; this plot is from ﬁtting the whole dataset, but w e o nly plot o ne slice for visibilit y). The k ey item to note is that t he ﬁt is essen tia lly con tin uous. The plot is made up only of p oints at ﬁtted v alues, no in terp olation or lines hav e b een used. In con trast, Fig ure 9 shows 0.6 1.0 1.4 Lift 1 2 3 4 0.6 1.0 1.4 Mach Lift 1 2 3 4 Mach 1 2 3 4 Mach Figure 9: Slices of the p osterior predictiv e mean from individual MCMC iterations for the LGBB data with alpha ﬁxed to 2 5 and b eta ﬁxed to zero. examples of the treed G P ﬁts from individual MCMC iterations, whic h often ha v e clear discon tin uities from the partit io ning structure. Th us as is t ypical, our mean ﬁtted v alues are quite smo oth, b ecause they are an a v erage, ev en though the individual compo ne nts of the mean may not b e con tinuous . T o measure go o dnes s of ﬁt w e t ypically rely on qualita tiv e visual barometers. F or example, t race s are used to asses mixing in the Mark o v c hain, and p o s terior predictiv e slices and pro jections are insp ected, a s describ ed ab o v e. F or a mor e quantitativ e as- sessme nt w e follow the suggestion of Gelfand (1995) and use 10-fold cross v alidation. P osterior predictiv e quan tiles are obtained fo r the input lo cations held-o ut of each fold, and the prop ortion of held-out resp onses that fall within the 90% predictiv e in terv al is 27 recorded. F o r the LGBB data w e found a pro portion of 0.96 using the treed GP LLM mo del. Th us our mo del ﬁts w ell, and if an ything, our predictiv e in terv als are sligh tly wider than necess ary , so we app ear to b e fully accoun ting for uncertaint y . 6 Conclus ion W e dev elop ed the treed Gaussian pro cess mo del for the rock et bo oster computer ex- p erimen t, but it also has a wide range of uses as a simple and eﬃcien t metho d for nonstationary mo deling. A fully Bay esian treatmen t o f the tr eed G P mo del w as laid out, t reating the hierarc hical parameterization of the correlation function K ( · , · ) as a mo dular comp o ne nt, easily replaced b y a diﬀeren t family of correlations. The limiting linear model parameterization of the GP can b e b oth useful and accessible in terms of Bay esian p osterior estimation and prediction, resulting in a uniquely nonstationary , semiparametric, tractable, and highly accurate mo del that contains the Bay esian treed linear mo del as a sp ecial case. W e b eliev e that a la rge con tribution of t he tr e ed GP will b e in the do main of se- quen tial design of computer experimen ts (San tner et al., 2003; Grama cy et al., 200 4 ). Empirical evidence suggests that man y computer experimen ts con ta in muc h linearity , as we hav e seen with lar g e regions of the space fo r the ro c k et b o oster sim ulator. The Ba y esian treed G P pro vides a full p osterior predictive distribution (particularly a non- stationary and thus regio n-speciﬁc estimate of predictiv e v ariance) whic h can b e used to w ards activ e learning in the input domain. Exploitation of these c hara c teristics should lead to an eﬃcien t framew ork fo r the adaptive exploration of computer exp erimen t pa- rameter spaces. 28 References Abrahamsen, P . (1997). “A Review of Gaussian Random Fields and Correlation F unc - tions.” T ech . Rep. 917, Norw egia n Computing Cen ter, Bo x 114 Blindern, N- 0314 Oslo, Norw a y . Berger, J. O., de Oliv eira, V., and Sa ns´ o, B. (2001) . “Ob jectiv e Bay esian Analysis o f Spatially Correlated Data.” Journal of the A meric an S tatistic al Asso cia t ion , 96, 456, 1361–137 4. Breiman, L., F riedman, J. H., Olshen, R., and Stone, C. (1984). Classiﬁc ation and R e gr ession T r e es . Belmont, CA: W adsw o rth. Chipman, H., G eorge, E., and McCullo c h, R. (199 8 ). “Bay esian CAR T Mo del Search (with discussion).” Journal of the Americ an Statistic al Asso ciation , 93, 935–9 6 0. — (2002). “Bay esian T reed Mo dels.” Machine L e arning , 48, 303–324. Cormen, T. H., Leiserson, C. E., and Rive st, R. L. (1990). Intr o d uction to Algorithms . The MIT EE & CS Series. MIT Press/McGra w Hill. Damian, D ., Sa mp son, P . D., and Gutto rp, P . (2001). “Bay esian Estimation of Semipara- metric Nonstationary Spatial Co v ar ia nc e Structure.” Envir onme t rics , 1 2 , 161–178 . Denison, D., Adams, N., Holmes, C., and Hand, D . (2002). “Ba y esian P artition Mo d- elling.” C o mputat ional Statistics and Data Analysis , 38, 475–485. Denison, D., Mallick , B., and Smith, A. (1998) . “A Bay esian CAR T Algorithm.” Biometrika , 85, 363–37 7. F uen tes, M. (2002). “Sp ectral Metho ds fo r Nonstationary Spatial Pro cesses.” Biometrika , 89, 197–21 0. 29 Gelfand, A. (1995). “Mo del Determination Using Sampling- Base d Metho ds.” In Markov Chain Monte Carl o In Pr a ctic e , eds. S. R . W. Gilks and D. Spiegelhalter, 145–1 6 1. London: C hapman Hall. Gilks, W., Ric hardson, S., and Spiegelhalter, D. (1996 ). Markov Ch a in Monte Carlo in Pr actic e . London: Chapman & Hall. Gramacy , R. B. (2005). “Bay esian T reed G aus sian Pro ces s Mo dels.” Ph.D. thesis, Univ ersit y of California, Sa n ta Cruz. Gramacy , R. B. and Lee, H. K. H. (2008). “Gaussian Pro cess es and Limiting Linear Mo dels.” Computational Statistics and Data A nalysis , 53, 123–136. Gramacy , R. B., L ee, H. K. H., and Macready , W. (200 4 ). “P a ramete r Space Exploration With Gaussian Pro cess T rees.” In ICML , 353–360. Omnipress & A CM Digital Library . Higdon, D ., Sw all, J., and Kern, J. (1999). “Non-Stationary Spatial Mo delin g.” In Bayesian Statistics 6 , eds. J. M. Bernardo, J. O. Berger, A. P . Da wid, and A. F. M. Smith, 761–768. Oxford Univ ersity Press. Hjort, N. L. and Omre, H. (19 94). “T opics in Spatial Statistics.” Sc andinavian Journal of Statistics , 21, 289–357. Kennedy , M. and O’Hagan, A. (2001). “Bay esian Calibration of Computer Mo dels (with discussion).” Journal of the R oyal Statistic al So ciety, Series B , 6 3, 425–46 4 . Kim, H.-M., Mallic k, B. K., and Holmes, C. C. (2005). “Analyzing Nonstat io nary Spatial Data Using Piecewise Gaussian Pro ces ses.” Journal of the Americ an Statistic a l Asso ciation , 100, 653– 668. Mat ´ ern, B. (1 9 86). Sp atial V ariation . 2nd ed. New Y ork: Springer-V erlag. P aciorek, C. (2003). “Nonstationary G a us sian Pro ces ses for Regression and Spatial Mo delling.” Ph.D. thesis, Carnegie Mellon Univ ersit y , Pittsburgh, Pe nnsylv ania. 30 Rasm ussen, C. and Ghahramani, Z. (2 002). “Inﬁnite Mixtures o f Gaussian Pro cess Exp erts.” In A dvanc es in Neur al Information Pr o c essin g Systems , v ol. 14, 8 81–888. MIT Press. Ric hardson, S. a nd Green, P . J. (1997) . “O n Bay esian Analysis of Mixtures With An Unkno wn Num b er of Comp onen ts.” Journal o f the R oyal S tatistic al So ciety, Series B, Metho dolo gi c al , 59, 731–758 . Rogers, S. E., Afto s mis, M. J., P andy a, S. A., N. M. Chaderjian, E. T. T., and Ahmad, J. U. (2003). “Automated CFD Parameter Studies on Distributed P arallel Com- puters.” In 16th AIAA Computational F luid Dynamics Confer enc e . AIAA P ap er 2003-42 29. Sac ks, J., W elc h, W. J., Mitc hell, T. J., and Wynn, H. P . (1989). “D es ign and Analysis of Computer Exp erimen ts.” Statistic al Scienc e , 4, 409–435. Sampson, P . D. and Gutto rp, P . (1992). “Nonparametric Estimation of Nonstation- ary Spatial Co v ariance Structure.” Journal of the Americ an Statistic al Asso ciation , 87(417), 108–119. San tner, T . J., Williams, B. J., a nd Notz, W. I. (2003). The Design and Analysis of Computer Exp eriments . New Y ork, NY: Springer-V erlag. Sc hmidt, A. M. and O ’Hagan, A. (200 3). “Bay esian Inference for Nonstationary Spa- tial Co v aria nc e Structure via Spat ia l Deformations.” Journal of the R oyal Statistic al So ciety, Series B , 65, 745–758. R Dev elopmen t Core T eam (2004). R : A la n guage and envir onment for statistic al c om - puting . R F oundation for Statistical Computing, Vienna, Aus. ISBN 3-90005 1 -00-3. Silv erman, B. W. (198 5). “Some Asp ects of the Spline Smo othing Approac h to Non- 31 P arametric Curve Fitting.” Journal of the R oyal Statistic al So ciety Series B , 47, 1–52. Stein, M. L. (1999). I nterp olation of Sp atial Data . New Y ork, NY: Springer. Whaley , R. C. and P etitet, A. (2004). “ ATLAS (Automatically Tuned L ine ar Algebra Soft w are).” 32

Bayesian treed Gaussian process models with an application to computer modeling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment