Dynamic Trees for Learning and Design

Dynamic regression trees are an attractive option for automatic regression and classification with complicated response surfaces in on-line application settings. We create a sequential tree model whose state changes in time with the accumulation of n…

Authors: Matthew A. Taddy, Robert B. Gramacy, Nicholas G. Polson

Dynamic T r ees f or Lear ning and Design Matthe w A. T addy , Robert B. Gramacy and Nicholas G. Polson The University of Chica go Booth School of Business A B S T R AC T : Dynamic regression trees are an attracti v e option for automatic regression and classification with complicated response surf aces in on-line application settings. W e create a sequential tree model whose state changes in time with the accumulation of ne w data, and provide particle learning algorithms that allo w for the efficient on-line posterior filtering of tree-states. A major advantage of tree re gression is that it allows for the use of very simple models within each partition. The model also facilitates a natural division of labor in our sequential particle-based inference: tree dynamics are defined through a fe w potential changes that are local to each newly arriv ed observation, while global uncertainty is captured by the ensemble of particles. W e consider both constant and linear mean functions at the tree leav es, along with multinomial leav es for classification problems, and propose default prior specifica- tions that allow for prediction to be integrated over all model parameters conditional on a gi ven tree. Inference is illustrated in some standard nonparametric regression examples, as well as in the setting of sequential experiment design, including both active learning and optimization applications, and in on-line classification. W e detail implementation guidelines and problem specific methodology for each of these motiv ating applications. Throughout, it is demonstrated that our practical approach is able to provide better results compared to commonly used methods at a fraction of the cost. K E Y W O R D S : Sequential design of experiments; search optimization; acti ve learning; on-line classifica- tion; partition tree; nonparametric regression; CAR T ; B AR T ; Gaussian Process; TGP; particle learning. T addy is Assistant Professor and Robert L. Grav es Faculty Fellow , Gramacy is Assistant Professor , and Polson is Professor , of Econometrics and Statistics at the Univ ersity of Chicago Booth School of Business, 5807 S W ood- lawn A ve, Chicago, IL 60637 (taddy , rbgramac y , ngp @chicagobooth.edu). Gramac y w as also a lecturer in the Statistical Laboratory at Univ ersity of Cambridge during production of this article. The research was partially funded by EPSRC grant EP/D065704/1 to Gramacy and by T addy’ s support under the IBM Corporation Faculty Research Fund at Chicago Booth. Finally , three referees provided advice which greatly impro ved the article. 1 Intr oduction This article de v elops sequential re gression trees, with implementation details and examples for applications in experiment design, optimization, classification, and on-line learning. Our most basic insight is the characterization of partition trees as a dynamic model state. This allo ws the model to gro w with data, such that each new observ ation leads to only small changes in the posterior , and has the practical ef fect of reducing tree space dimension without af fecting model flexibility . A Bayesian inferential framew ork is built around particle learning algorithms for the ef ficient on-line filtering of tree-states, and examples demonstrate that the proposed approach is able to provide better results compared to commonly used methods at a fraction of the cost. W e consider two generic regression formulations. The first has real-valued response (uni- v ariate here, but this is not a general restriction) with unspecified mean function and Gaussian noise, and the second has a categorical response with co v ariate dependent class probabilities: a. y = f ( x ) + ε, ε ∼ N(0 , σ 2 x ) b. p( y = c ) = p c ( x ) , c = 1 , . . . , C. (1) The regression tree frame work provides a simple yet powerful solution to modeling these re- lationships in situations where very little is known a priori . For such models, the cov ariate space is partitioned into a set of hyper -rectangles, and a simple model (e.g., linear mean and constant error variance) is fit within each rectangle. This adheres to a general strategy of ha ving a state-space model (i.e., the partition structure) induce flexible prediction despite a restricti ve model formulation conditional on this state. For trees, our strategy allo ws posterior inference to be marginalized over the parameters of simple models in each partition. Gi ven a particular tree, predictiv e uncertainty is a vailable in closed form and does not depend upon approximate posterior sampling of model parameters. Uncertainty updating is then a pure filtering prob- lem, requiring posterior sampling for the partition tree alone. Thereby this article establishes a natural sequential frame work for the specification and learning of re gression trees. 2 Although partitioning is a rough modeling tool, trees will often be better suited to real- world applications than other “more sophisticated” alternativ es. Compared to standard basis function models for nonparametric regression, the dynamic trees proposed herein pro vide some appealing properties (e.g., flexible response surface, nonstationarity , heteroskedasiticity) at the expense of others (e.g., smoothness, explicit correlation structure) within a framework that is tri vial to specify and allows for conditional inference, giv en the global partition tree-state, to be marginalized over all model parameters. Prediction is very fast, requiring only a search for the rectangle containing a ne w x , and indi vidual realizations yield an easily interpretable decision tree for regression and classification problems. Finally , and perhaps most importantly , the models are straightforward to fit through sequential particle algorithms, and thus offer a major adv antage in on-line inference problems. Consider the setting of sequential experiment design, which serves as one of the moti v ators for this article. In engineering applications, and especially for computer experiments, the typ- ical model of (1.a) is b uilt from a stationary Gaussian process (GP) prior for f with constant additi ve error variance (see, e.g., Santner et al., 2003). This GP specification is very flexible, but inference for sample size n is O ( n 3 ) due to the need to decompose n × n matrices. Fur- thermore, full Bayesian inference is usually built around MCMC sampling, which demands O ( B n 3 ) computation for a posterior sample of size B . Sequential design requires this batch sampling to be restarted whene ver new [ x , y ] pairs are added into the data, making an already computationally intensiv e procedure ev en more e xpensi ve by failing to take advantage of the sequential nature of the problem. W ith lower order inference (i.e., only a search of partitions follo wed by simple leaf regression model calculations) and an explicit dynamic structure, our dynamic regression trees are less e xpensi ve and can be fit in serial. Furthermore, it is essential that the fitted model is appropriate for the problem of sequential design. In this, the standard choice of a stationary GP will again often fall short. Whether designing for optimization or for general learning, the goal is to search for distinctive areas of 3 the feature space that warrant further exploration (e.g., due to high v ariance or low expected response). The single most appropriate global specification may be very different from the ideal local model in areas of interest (e.g., along breaks in the response surface or near optima), such that it is crippling to assume stationarity or constant error v ariance. Howe v er , nonstationary GP modeling schemes (e.g. Higdon et al., 1999) require even more computational ef fort than their stationary counterparts. In contrast, nonstationary function means and nonconstant variance are fundamental characteristics of tree-based regression. These considerations are meant to illustrate a standard point: It may be that some flexible functions { f , σ } or p are ideal for modeling (1.a-b) in any particular setting, b ut that a simple and f ast approximation – with predictions a veraged over posterior uncertainty – is better suited to the design scenario or classification problem at hand. W e thus put forth dynamic regression trees as a robust class of models with some key properties: prior specification is scale-free and automatic; inference is marginalized ov er model parameters giv en a global partition state; it is possible to sequentially filter uncertainty about the partition state; and the resulting prediction surfaces are appropriate for modeling complicated re gression functions. The remainder of the paper is organized as follows. Section 1.1 provides a surve y of parti- tion tree models and 1.2 contains a re vie w of Bayesian inference for trees. Our core method- ology is outlined in Section 2, which dev elops a sequential characterization of uncertainty up- dating for regression trees. The partition e v olution is defined in 2.1, leaf model and prediction details are in 2.2, a particle learning algorithm for posterior simulation is provided in 2.3, and marginal likelihood estimation is discussed in 2.4. Section 3 then describes a set of application examples. First, 3.1 illustrates linear and constant mean regression models on some simple datasets, and compares results to alternativ es from the literature. W e consider the sequential design of experiments, with optimization search in 3.2 and acti ve learning in 3.3. Finally , 3.4 provides tw o classification examples. Section 4 contains short closing discussion. 4 1.1 P artition T r ees The use of partition trees to represent input-output relationships is a classic nonparametric modeling technique. A decision tree is imposed with switching on input variables and terminal node predictions for the relev ant output. Although other schemes are av ailable (e.g., based on V oronoi tessellations), the standard approach relies on a binary recursiv e partitioning of input v ariables, as in the classification and regression tree (CAR T) algorithms of Breiman et al. (1984). This forces axis-aligned partitions, although pre-processing data transformations can be used to alter the partition space. In general, the computational and conceptual simplicity of rectangular partitions will fa vor them ov er alternativ e schemes. As binary recursi ve partition trees are fundamental to our re gression model, we shall outline some notation and details here. Consider cov ariates x t = { x s } t s =1 . A corresponding tr ee T consists of a hierarchy of nodes η ∈ T associated with dif ferent subsets of x t . The subsets are determined through a series of splitting rules, and these rules also dictate the terminal node associated with any ne w ˜ x . Ev ery tree has a r oot node, R T , which includes all of x t , and ev ery node η is itself a root for a sub-tree containing nodes (and associated subsets of x t ) belo w η in the hierarchical structure defined by T . A node is positioned in this structure by its depth , D η , defined as the number of sub-trees of T other than T which contain η (e.g., D R T = 0 ). Figure 1 shows two diagrams of local tree structure (partition trees tend to grow upside- do wn). Graph ( i ) illustrates the potential neighborhood of each node η ∈ T , and graph ( ii ) sho ws a hypothetical local tree formed through recursive binary partitioning. Left and right childr en , C l ( η ) and C r ( η ) , are disjoint subsets such that C l ( η ) ∪ C r ( η ) = η ; and the parent node, P ( η ) , contains both η and its sibling node S ( η ) , where η ∩ S ( η ) = ∅ and η ∪ S ( η ) = P ( η ) . Node ancestors are parents which contain a giv en node, and a node’ s depth is equiv alent to its number of ancestors. Any of the neighbors in ( i ) may be nonexistent for a specific node; for example, root R T has no parent or sibling. If a node has children it is considered internal , otherwise it is referred to as a leaf node. The set of internal nodes for T is denoted I T , and the 5 i. P ( η ) . & η S ( η ) . & C l ( η ) C r ( η ) ii. x i = 0 . & x j = 2 { x : x i > 0 } . & { x : x i ≤ 0 , x j ≤ 2 } { x : x i ≤ 0 , x j > 2 } Figure 1: T ree heuristics. Graph ( a ) shows the parent-sibling-child network, and ( b ) illustrates parti- tioning on two hypothetical co v ariates (internal nodes sho w split points, and leafs show the node set). set of leav es is L T . The tree is completed with a decision rule (i.e., a simple regression or classification model) at each leaf. Suppose that ev ery cov ariate vector x s is accompanied by response y s , such that the complete data is [ x , y ] t = { x s , y s } t s =1 . Then, with regression models parametrized by θ η for each leaf η ∈ L T , independence across tree partitions leads to lik elihood p( y t | x t , T , θ ) = Q η ∈ L T p ( y η | x η , θ η ) , where [ x , y ] η = { x i , y i : x i ∈ η } is the data subset allocated to η . 1.2 Infer ence for T r ee Models A nov el approach to re gression trees was pioneered by Chipman, George, and McCulloch (CGM: 1998, 2002), who designed a prior distribution, π ( T ) , over possible partition struc- tures. This allows for coherent inference via the posterior , p( T | [ x , y ] t ) ∝ p( y t | T , x t ) π ( T ) , including the assessment of partition uncertainty and the calculation of predicti ve bands. The CGM tree prior is generativ e, in that it specifies tree probability by placing a prior on each individual partition rule. Any giv en leaf node η may be split with depth-dependent probability p split ( T , η ) = α (1 + D η ) − β , where α, β > 0 . The coordinate (i.e. dimension of x ) and location of the split, ( i, x ) η , hav e independent prior p rule ( T , η ) , which is typically a discrete uniform distribution ov er all potential split points in x η = x t ∩ η . Implicit in the prior is the restriction that a partition may not be created if it would contain too fe w data points to fit the leaf model (i.e., ≥ 3 for the constant model and ≥ d + 2 for a linear model with d co v ariates). 6 Ignoring in valid partitions, the joint prior is thus π ( T ) ∝ Y η ∈ I T p split ( T , η ) Y η ∈ L T [1 − p split ( T , η )] . (2) That is, the tree prior is the probability that internal nodes hav e split and leav es hav e not. In their seminal paper , CGM develop a Metropolis-Hastings MCMC approach for sampling from the posterior distribution of partition trees. This algorithm is able to explore posterior tree space by stochastically proposing incremental modifications to T ( gr ow , prune , change , and swap moves) which are accepted according to the appropriate Metropolis-Hastings ratio. Although this approach provides a search of posterior trees, the resulting Markov chain will generally hav e poor mixing properties. In particular , the proposed tree mov es are very local. An improbable sequence of (first) prunes and (then) gr ows are required to mov e between parts of tree space that hav e similar posterior probability , but yet represent drastically different hier - archical split rules. Furthermore, MCMC is a batch algorithm and must be re-run for each ne w observ ation, making it inef ficient for on-line applications. The next section reformulates regression trees as a dynamic model. Our nov el characteri- zation leads to an entirely ne w class of models, and we dev elop a sequential particle inference frame work that is able to alle viate many of the dif ficulties with MCMC for trees. 2 Dynamic Regr ession T r ees W e no w redefine partition trees as a dynamic model for predictiv e uncertainty . W e introduce model state T t which, following from Section 1.1, includes the recursiv e partitioning rules asso- ciated with x t , the set of cov ariates observed up-to time t . Section 2.1 defines the mechanisms of state transition T t − 1 → T t as a function of x t , the newly observed cov ariates, such that T t is only allo wed to e volv e from T t − 1 through a small set of operations on partition structure in the neighborhood of x t (see Figure 1). That is, we specify a prior distribution for the e volu- 7 tion, p( T t | T t − 1 , x t ) . Section 2.2 then shows ho w to obtain tree likelihood p( y t | T t , x t ) as a product of leaf node marginal likelihoods, and hence allows us to assign posterior weight over the discrete set of potential trees generated through the e volution prior . Section 2.2 also de- tails the conditional predictive distribution and describes model particulars for three re gression leav es: constant, linear, and multinomial. A particle learning algorithm for on-line posterior simulation is outlined in Section 2.3. In this, each particle consists of a tree (partitioning rules) and sufficient statistics for leaf node predicti ve models, and our filtering update at time t com- bines small tree changes around x t with a particle resampling step that accounts for global uncertainty . Finally , we describe marginal lik elihood estimation in Section 2.4. 2.1 P artition Evolution This section will specify tree dynamics through the ev olution equation p( T t | T t − 1 , x t ) . Keep- ing possible changes localized in the re gion rele vant to a ne w observ ation x t , the tree ev olution T t − 1 → T t is defined through one of three equally probable mov es on leaf node η ( x t ) : stay: The tree hierarchy remains unchanged, and T t = T t − 1 . prune: The tree may be pruned by removing η ( x t ) and all of the nodes belo w and including its sibling S ( η ( x t )) . Hence, parent P ( η ( x t )) becomes a leaf node in T t . If T t − 1 is null (contains only a root), then the prune mov e is in valid since η ( x t ) is parentless. gro w: The gr ow move creates a new partition within the hyper -rectangle implied by split rules of the ancestors of η ( x t ) . It consists of uniformly choosing split dimension j and split point x grow j (within an interval determined by { j , η ( x t ) , x t } as described belo w), and then dividing η ( x t ) according to this rule. The former leaf containing x t becomes an internal node in T t , acting as parent to two ne w leaf nodes (one of which contains x t ). In this way , tree changes are restricted to the neighborhood of η ( x t ) illustrated in Figure 1; we will see in Section 2.3 that this feature is ke y in limiting the cost of posterior simulation. 8 Gi ven these three possible mov es, we can now define an ev olution prior as the product of two parts: a probability on each type of tree mo ve and a distribution over the resultant tree structure. In the former case, we assume that possible moves among stay , prune , and gr ow are each a priori equally likely (e.g., p mov e = 1 / 3 for each when all are a v ailable), although one may use other schemes if desired. For the latter distrib ution, we build on the inferential frame work of Section 1.2 and assume a CGM prior for tree structure such that π ( T t ) is as in (2) based on p split ( T t , η ) = α (1 + D η ) − β . W e then have p( T t | T t − 1 , x t ) ∝ P m ∈M ( x t ) p m π ( T m ) , where T m is the tree that results from applying mov e m to T t − 1 and M ( x t ) is the set of possible mov es. Hence, one may view p( T t | T t − 1 , x t ) as a cov ariate dependent prior for the next tree, where the penalty on general tree shape, π ( T t ) , is constant for all t but the set of new potential trees you are willing to entertain is dictated by x t . It is important to note that, in contrast with staying or pruning, the gro w mov e actually encompasses a set of tree ev olutions: we can split on any input dimension and, for each di- mension j , the set of possible gro w locations is the interval { x grow j : l η ( x t ) j ≤ x grow j ≤ u η ( x t ) j } where l η ( x t ) j and u η ( x t ) j are sup and inf points such that a minimal number of observ ations in η ( x t ) ∪ { x t } hav e j th coordinate less than l η ( x t ) j and strictly greater than u η ( x t ) j (i.e., such that each ne w leaf satisfies minimum data restriction for the leaf regression model, as described in Section 2.2). Combining these grow mechanics with our discussion on prior specification, the implied conditional prior for moving from T t − 1 to a specific gr own tree T t is the product of p grow (i.e., usually 1 / 3 ), the penalty term π ( T t ) , and the probability of the specific split location used in this grow mov e. V iewed another way , the marginal prior probability of growing in any manner is p grow π ( T t ) integrated over T t with respect to the measure for possible split locations. W e place a conditional uniform prior distribution ov er the abov e set of candidate split lo- cations. Hence, j is uniformly distributed over cov ariate dimensions with non-empty sets of possible grow points and, giv en j , the split point is assigned a uniform prior on [ l η ( x t ) j , u η ( x t ) j ] . This distribution is uniform over eligible cov ariate dimensions, regardless of variable scaling 9 within each dimension. As an aside, since the likelihood is unchanged for all grow mov es which result in equiv alent new leaf nodes – that is, the model is only identified up to location sets which separate elements of η ( x t ) – an alternati ve prior would restrict split locations to members of η ( x t ) (such that ev ery possible grow changes the likelihood). W e prefer the former option since it leads to smoother posterior predicti ve surf aces in our particle approximations. The formulation in this section has many things in common with the original CGM par- titioning approach. Indeed, both in volve similar gr ow and prune mov es. Ho we ver , in CGM these are merely MCMC proposals, whereas in dynamic trees they are embedded directly into the definition of the sequential process. Also, analogues of chang e and swap are not present in our sequential formulation. Although these could hav e been incorporated, we felt that limiting computational cost w as more desirable. In particular , there are two aspects of our frame work – a global particle approach to inference and a filtered posterior that changes only incrementally with each ne w observ ation – which allo w for the con venience of a smaller set of tree rules. 2.2 Pr ediction and Leaf Re gr ession Models Our dynamic tree model is such that posterior inference is driv en by two main quantities: the marginal lik elihood for a giv en tree and the posterior predictiv e distribution for ne w data. This section establishes these functions for dynamic trees, and quickly details exact forms for our three simple leaf regression models. First, with each leaf η ∈ L T t parametrized by θ η iid ∼ π ( θ ) , the likelihood function is a vailable after marginalizing o ver regression model parameters as p  y t | T t , x t  = Y η ∈ L T t p( y η | x η ) = Y η ∈ L T t Z p ( y η | x η , θ η ) dπ ( θ η ) . (3) This is combined with the conditional prior of Section 2.1 to obtain posterior p( T t | [ x , y ] t , T t − 1 ) . 10 Second, the predicti ve distrib ution for y t +1 gi ven x t +1 , T t , and data [ x , y ] t , is defined p( y t +1 | x t +1 , T t , [ x , y ] t ) = p( y t +1 | x t +1 , [ x , y ] η ( x t +1 ) ) (4) = Z p( y t +1 | x t +1 , θ ) d P  θ | [ x , y ] η ( x t +1 )  where node η ( x t +1 ) ∈ L T t is the leaf partition containing x t +1 (obtained by passing through the split rules at each le vel of T t ) and d P is the posterior distribution over leaf parameters gi ven the data in η ( x t +1 ) . That is, the predicti ve distrib ution for cov ariate vector x is just the regression function at the leaf containing x , inte grated ov er the conditional posterior for model parameters. In a somewhat subtle point, we note that a modeling choice has been made in our statement of (4): prediction at x t +1 depends only on T t , rather than av eraging over potential T t +1 as would be necessary for a con ventional state-space model. Our predicti ve function in (4) leads to substantially more ef ficient inference (the alternati ve is impractical in high dimension), and it seems to us that the distinction should not produce dramatically dif ferent results. W e concentrate on three basic options for leaf regression: the canonical constant and lin- ear mean leaf models, and multinomial leav es for categorical data. From (3) and (4), we see that conditioning on a gi ven tree reduces our necessary posterior functionals to the product of independent leaf likelihoods and a single leaf regression function, respectiv ely . Ease of imple- mentation and general efficienc y of our models depends upon an ability to e v aluate analytically the integrals in (3) and (4), such that prediction and likelihood are always marginalized over unkno wn regression model parameters (gi ven a default scale-free prior specification). Fortu- nately , such quantities are easily av ailable for each leaf, and the following three subsections quickly outline modeling specifics for our three types of regression tree. Before moving to leaf details, we note that the comple xity of leaves is limited only by computational budget and data dimensions. For e xample, the treed Gaussian process (TGP) approach of Gramac y and Lee (2008) fits a GP regression model at each leaf node. The tree 11 imposes an independence structure on data cov ariance, providing an inexpensi ve nonstationary GP model. Howe ver , unlike the models proposed herein, GP leav es do not allo w inference to be inte grated o ver regression model parameters. This significantly complicates posterior inference, leading to either rev ersible-jump methods for MCMC or much higher-dimensional particles for sequential inference (which can kill algorithm efficienc y). Thus, although dynamic trees can be paired with more complex models, the rob ust nature of less e xpensiv e trees will lead to them being the preferred choice in many industrial applications. Indeed, our results in Section 3 indicate that extra modeling does not necessarily lead to better performance. 2.2.1 Constant Mean Lea ves Consider a tree T t partitioning data [ x , y ] t at time t , and let | η | < t be the number of observations in leaf node η . The constant mean model assumes that y η = { y i : x i ∈ η } are distrib uted as y η 1 , . . . , y η | η | iid ∼ N( µ η , σ 2 η ) . (5) Under this model, leaf suf ficient statistics are ¯ y η = P | η | i =1 y η i / | η | and s 2 η = P | η | i =1 ( y η i − ¯ y η ) 2 = P | η | i =1 ( y η i ) 2 − | η | ¯ y 2 η , where our prior forces the minimum data condition | η | > 2 . Note that these statistics are easy to update when the leaf sets change. Under the moti vation of an automatic re gression framework, we assume independent scale- in variant priors for each leaf model, such that π ( µ η , σ 2 η ) ∝ 1 /σ 2 η . The leaf likelihood is then p( y η | x η ) = p( y η ) = Z N( y η | µ η , σ 2 η ) 1 σ 2 η dµ η dσ η = 1 (2 π ) | η |− 1 2 1 p | η |  s 2 η 2  − | η |− 1 2 Γ  | η | − 1 2  . (6) 12 For co variate v ector x allocated to leaf node η , the posterior predictiv e distribution is p( y | x , η , [ x , y ] η ) = p( y | η , y η ) = St   y ; ¯ y η ,  1 + 1 | η |  | η | − 1 s 2 η , | η | − 1   , (7) where St is a Student- t distrib ution with | η | − 1 de grees of freedom. 2.2.2 Linear Mean Lea ves Consider extending the abo ve model to a linear leaf re gression function. Leaf responses y η are accompanied by design matrix X η = [ x η 1 , . . . , x η | η | ] 0 , where { x η i } | η | i =1 = x t ∩ η , and y η ∼ N( µ η 1 | η | + X η β η , σ 2 η I | η | ) (8) for intercept parameter µ η , d -dimensional slope parameter β η , and error variance σ 2 η . Sufficient statistics for leaf node η are then ¯ y η and s 2 η , as in Section 2.2.1, plus the cov ariate mean vector ¯ x η = P | η | i =1 x η i / | η | , shifted design matrix ˆ X η = [ x η 1 − ¯ x η , . . . , x η | η | − ¯ x η ] 0 , Gram matrix G η = ˆ X 0 η ˆ X η and slope vector ˆ β η = G − 1 η ( ˆ X η ) 0 ( y η − ¯ y η ) for the shifted design matrix, and regression sum of squares R η = ˆ β 0 η G η ˆ β η . W e now restrict | η | ≥ d + 2 . As before, these statistics are easily updated; in particular , matrices can be adapted through partitioned in verse equations. W e again assume the scale in variant prior π ( µ η , β η , σ 2 η ) ∝ 1 /σ 2 η and, as above, it is straight- forward to calculate the essential predicti ve probability and mar ginal lik elihood functions. The marginal lik elihood for leaf node η is then p( y η | x η ) = 1 (2 π ) | η |− d − 1 2  |G − 1 η | | η |  1 2  s 2 η − R η 2  − | η |− d − 1 2 Γ  | η | − d − 1 2  . (9) 13 For x allocated to leaf node η and with ˆ x = x − ¯ x η , the posterior predicti ve distrib ution is p( y | x , η , [ x , y ] η ) = St  y ; ¯ y η + ˆ x 0 ˆ β η , (1 + | η | − 1 + ˆ x 0 G − 1 η ˆ x )  s 2 η − R η | η | − d − 1  , | η | − d − 1  . (10) 2.2.3 Multinomial Leav es For categorical data, as in (1.b), each leaf response y η s is equal to one of C different factors. The set of responses, y η , can be summarized through a count vector z η = [ z η 1 , . . . , z η C ] 0 , such that z η c = P | η | s =1 1 ( y η s = c ) . Summary counts for each leaf node are then modeled as z η ∼ MN( p η , | η | ) (11) where MN( p , n ) is a multinomial with e xpected count p c /n for each category . W e assume a default Dirichlet Dir(1 C /C ) prior for each leaf probability vector , such that posterior information about p η is summarized by ˆ p η = ( z η + 1 /C ) / ( | η | + 1) , which is tri vial to update. The marginal likelihood for leaf node η is then p( y η | x η ) = p( z η ) = 1 | η | ! C Y c =1 Γ( z η c + 1 /C ) Γ(1 /C ) . (12) Cov ariates x allocated to leaf node η lead to predicti ve response probabilities p( y = c | x , η , [ x , y ] η ) = p( y = c | z η ) = ˆ p η c , for c = 1 , . . . , C . (13) 2.3 P article Learning for P osterior Simulation Posterior inference about dynamic regression trees is obtained through sequential filtering of a representati ve sample. The posterior distrib ution ov er trees at time t − 1 is characterized by N equally weighted particles, each of which includes a tree T ( i ) t − 1 encoding recursiv e partitioning 14 rules and S ( i ) t − 1 , the associated set of all suf ficient statistics for each leaf regression model. Upon the arriv al of ne w data, particles are first resampled and then propagated to mov e the sample from p ([ T , S ] t − 1 | [ x , y ] t − 1 ) to p ([ T , S ] t | [ x , y ] t ) . In particular , our approach is a version of the particle learning (PL) sequential Monte Carlo algorithm introduced by Carv alho et al. (2010a) in the context of mixtures of dynamic linear models. The methods in this section are specific to dynamic regression trees and we defer further detail to the original PL paper (in addition, Carvalho et al., 2010b, focuses on PL for general mixtures and thus describes a partition model with parallels to trees). In the context of our dynamic trees, PL recursi ve update equations are p  [ T , S ] t | [ x , y ] t  = Z p ([ T , S ] t | [ T , S ] t − 1 , [ x , y ] t ) d P  [ T , S ] t − 1 | [ x , y ] t  (14) ∝ Z p ([ T , S ] t | [ T , S ] t − 1 , [ x , y ] t ) p ([ x , y ] t | [ T , S ] t − 1 ) d P  [ T , S ] t − 1 | [ x , y ] t − 1  , where the second line is due to a simple application of Bayes rule. Hence, the particle set { [ T , S ] ( i ) t − 1 } N i =1 ∼ p([ T , S ] t − 1 | [ x , y ] t − 1 ) can be updated by first r esampling particles propor- tional to p([ x , y ] t | [ T , S ] t − 1 ) , and then pr opagating from p([ T , S ] t | [ T , S ] t − 1 , [ x , y ] t ) . Although wider questions of PL efficienc y are addressed in Carvalho et al. (2010a), we should make some quick notes on the tree-specific algorithm. Crucially , resampling first to condition on [ x , y ] t introduces information from [ x , y ] t that is not conditional on T t , thus reduc- ing direct dependence of particles on the high dimensional tree-history T t (this same idea is found in the best particle filtering algorithms, such as K ong et al. (1994) and Pitt and Shephard (1999), e ven if it remains standard to track and weight all of T t ). Second, any analytical integra- tion ov er model parameters will greatly improv e Monte Carlo sampling performance through more efficient Rao-Blackwellized inference. In our case, simple leaf models allow direct ev al- uation of p( T t | [ x , y ] t ) = R p( T t , { θ η : η ∈ L T t }| [ x , y ] t ) d P( { θ η : η ∈ L T t }| [ x , y ] t ) ; without such marginalization, particles would need to be augmented with all leaf parameters. Finally , 15 since only the neighborhood of η ( x t ) changes under an y of these mo ves, it is possible to ignore all other data and tree structure when calculating relati ve probabilities for tree changes. This, combined with grow mov es that are only identified up to a discrete set of locations, restricts particle updates to a lo w dimensional and easily sampled space. In full detail, suppose that {T ( i ) t − 1 } N i =1 ∼ p( T t − 1 | [ x , y ] t − 1 ) is a particle approximation to the posterior at time t − 1 . The dimension of each [ T , S ] ( i ) t depends upon both the tree structure and the regression model, with each particle containing | L ( i ) t | sets S ( i ) tη of the leaf suf ficient statistics described in Section 2.2. All trees/particles begin empty . Due to minimum data restrictions for each partition, stay is the only plausible tree move until enough data has accumulated to split into two partitions. Hence, updating be gins at t = 6 for the constant model and at t = 2( d + 2) under linear leav es, with each T ( i ) t consisting of a single leaf/root node. Upon observing ne w cov ariates x t and response y t , update as follo ws. Resample: Draw particle indices { ζ ( i ) } N i =1 with predicti ve probability weight p( ζ ( i ) = i ) ∝ p( y t | x t , [ T , S ] ( i ) t − 1 ) = p( y t | x t , S ( i ) tη ( x t ) ) , as in (7), (10), or (13). There are v arious lo w-variance options for resampling (e.g., Capp ´ e et al., 2005), and our implementation makes use of residual-resampling. Set T ( i ) t − 1 = T ζ ( i ) t − 1 for each i to form a new particle set. Propagate: W e need to update each tree particle with a sample from the discrete distribu- tion p([ T , S ] t | [ T , S ] t − 1 , [ x , y ] t ) ∝ p( T t | T t − 1 , x t )p( y t | T t , x t ) . First, propose changes for T ( i ) t − 1 → T ( i ) t via stay , prune , or a randomly sampled gr ow move on η ( i ) t − 1 ( x t ) . For each particle’ s gr ow , we propose from the uniform gro w-prior of Section 2.1 by drawing j from eligible dimensions (i.e. those with non-empty sets of split locations) and then sampling the split point from a uniform on [ l η ( x t ) j , u η ( x t ) j ] . The three candidate trees are no w T t ∈ {T stay , T prune , T grow } . 16 Since candidate trees are equi v alent abov e P ( η ( x t )) , the parent node for x t on tree T t − 1 , we calculate posterior probabilities only for the subtrees rooted at this node. W ith candi- date subtrees denoted T mov e t and containing data [ x , y ] t (including x t and y t ), the ne w T t is sampled with probabilities proportional to π ( T mov e t )p( y t | x t , T mov e t ) . Here, the prior penalty is (2) and the likelihood is (3) with leaf mar ginals from (6), (9), or (12). Finish with deterministic suf ficient statistic updates S ( i ) t − 1 → S ( i ) t . These two simple steps yield an updated particle approximation {T ( i ) t } N i =1 ∼ p( T t | [ x , y ] t ) . In an appealing di vision of labor , resampling incorporates global changes to the tree poste- rior , while propagation provides local modifications. As with all particle simulation methods, some Monte Carlo error will accumulate and, in practice, one must be careful to assess its ef- fect. Howe ver , as mentioned abov e, our strategy makes major gains by integrating ov er model parameters to obtain particles which consist of only split rules and sufficient statistics. Gi ven this ef ficient low-dimensional particle definition, our resampling-first procedure will sequen- tially discard all models except those which are predicting well, and this tempers posterior search to a small set of plausible trees with good predictiv e properties. W e will see in Section 3 that PL for trees has some significant adv antages ov er the traditional MCMC approach. 2.4 Mar ginal Likelihood Estimation An attracti ve feature of our dynamic trees is that, due to the use of simple leaf regression models, reliable posterior marginal likelihood estimates are av ailable through the sequential factorization p( y T | x T ) = Q T t =1 p( y t | x t , [ x , y ] t − 1 ) . Howe ver , use of improper priors in constant and linear trees implies that the marginal likelihood is not properly defined. As de- scribed in Atkinson (1978), this can be ov ercome if some of the data is set aside, before any model comparison, and used to form proper priors for leaf model parameters. Such “train- ing samples” are already enforced within our dynamic tree ev olution through the minimum 17 partition-size requirements for each model. As such, a valid approximate marginal likelihood is av ailable by conditioning on the first t 0 observ ations, where t 0 is large enough to provide a proper posterior predicti ve distrib ution for the unpartitioned “root” tree. Hence, we write p( y T | x T ) ≈ Q T t = t 0 +1 p( y t | x t , [ x , y ] t − 1 ) where the factors are approximated via PL as E  p( y t | x t , [ x , y ] t − 1 , T t − 1 )p( T t − 1 | [ x , y ] t − 1 )  ≈ 1 N N X i =1 p( y t | x t , [ T , S ] ( i ) t − 1 ) . (15) This is just the normalizing constant for PL resampling probabilities (refer to Section 2.3). These marginal likelihood calculations make it possible to compare leaf model specifica- tions through the mar ginal likelihood ratio or Bayes f actor (BF). For e xample, in a comparison of linear lea ves against a constant leaf model, p lin ( y T | x T ) / p const ( y T | x T ) measures the relati ve e vidence in fa vor of linear leav es (see Kass and Raftery, 1995, for information on as- sessing BFs; also, note that the posterior probability of linear leav es is p lin ( y T | x T ) / (p lin ( y T | x T ) + p const ( y T | x T )) given e ven prior probability for each model). Linear and constant leav es are just e xtreme options for the general model choice of which co variates should be leaf regressors, and the BF approach is generally applicable in such problems. This approach to inference is data-order dependent, due to the state-space factorization assumption p( y t | x T , y t − 1 ) = p( y t | x t , y t − 1 ) . In addition, BFs are only av ailable conditional on the initial y t 0 training sample, and clearly for marginal likelihood estimation both training sample and data-order must be the same across models. Howe ver , we will see in Section 3.1 that these BFs lead to consistent model choices in the face of strong evidence, and that av erage BFs ov er repeated reorderings and training samples serve an effecti ve model selection criterion. 3 A pplication and Illustration W e now present a series of examples designed to illustrate our approach to on-line regression. Section 3.1 considers two simple 1-d re gression problems, illustrating both constant and linear 18 mean leaf models o ver multiple PL runs, before comparing dynamic tree predicti ve perfor- mance against competing estimators in a 5-d application. The next two sections focus on the sequential design of experiments, with optimization problems in Section 3.2 and activ e learning in Section 3.3. Finally , Section 3.4 describes the application of dynamic multinomial leaf trees to a 15-d classification problem. Throughout, our dynamic trees were fit using the dynaTree package (Gramacy and T addy, 2010b) for R under the default parametrization. In particular , the tree prior of Section 1.2 is specified with α = 0 . 95 and β = 2 ; inference is generally robust to reasonable changes to this parametrization (e.g., the four settings described in Figure 3 of CGM (2002) lead to no qualitati ve dif ferences). 3.1 Re gr ession Model Comparison W e begin by considering two simple 1-d applications. The first “parabola data” set has 100 observ ations from y ( x ) ∼ N( x + x 2 , 1 / 5 2 ) , where x was drawn uniformly from [ − 3 , 3] . The second “motorcycle data” set, av ailable in the MASS library for R , consists of observed accel- eration on motorcycle riders’ helmets at 133 dif ferent time points after simulated impact. The top two ro ws of Figure 2 sho w 30 repeated filtered posterior fits for random re-orderings of the data, obtained through the PL algorithm of Section 2.3 with 1000 particles. The first ro w corresponds to constant leaf models while the second row corresponds to linear leaves, and each plot sho ws posterior mean and 90% predictiv e interv als. Although there is clear v ariation from one PL run to the next, this is not considered excessiv e gi ven the random data orderings and small number of particles. Linear leaf models appear to be far better than constant models at adapting to the parabola data. W ith the parabola data, a constant leaf model leads to higher av erage tree height (9.4, vs 5.4 for linear leaves) while the opposite is true for the motorcycle data (av erage constant leaf height is 10.5, vs 12.8 for linear leav es). A more rigorous assessment of the relati ve e vidence in fa vor of each leaf model is possible through estimated Bayes factors, as described in Section 2.4. Figure 3 presents filtered log BFs 19 comparing linear to constant leaves, calculated for the 30 random data orderings used in Figure 2 and conditional on the first t 0 = 5 observations. As discussed in 2.4, due to both random ordering and dif ferent y t 0 training samples, the BF estimates are not directly comparable across runs. Ho wev er , for the parabola data, in ev ery case the linear model is clearly preferable. For the motorcycle data, although there is no consistent evidence in fav or of either model, the majority of runs produce log BF v alues belo w zero and the mean across runs (which eliminates order dependence as the number of runs increases) shows fairly strong e vidence against the linear model. This agrees with the visually similar regression fits drawn in Figure 2, which were obtained after fe wer av erage partitions under constant leaves than with the linear lea ves. Finally , the bottom row of Figure 2 compares the means of the 30 dynamic tree fits to two similar modern nonparametric re gression techniques: treed Gaussian processes (TGP) and Bayesian additi ve regression trees (B AR T ; Chipman et al., 2010). Each new model is designed as an extension to constant or linear regression trees, and makes steps to partially alleviate the mixing problems of CGM’ s original MCMC inference. As mentioned pre viously , TGP takes adv antage of a more fle xible leaf regression model to allo w for broader cov ariate partitions. In a different approach, B AR T proposes a mixture of relati vely short trees, and the authors sho w that combinations of simple indi vidual partitioning schemes can lead to a complicated predicti ve response surface. The TGP model is given 10 restarts of a 10,000 iteration MCMC run, while B AR T results are based on a single 1,100 iteration chain. For the parabola data, all of the models find practically identical fits, except for the poorly performing constant leaf dynamic tree. Ho we ver , for the motorc ycle data, we see that each model leads to mean functions that are very similar , b ut that posterior predictiv e 90% interv als for BAR T and TGP appear to v ariously ov er or under estimate data uncertainty around the regression mean. In particular , BAR T’ s global v ariance term is misspecified for this heteroskedastic data. T o benchmark our methods on a more realistic high-dimensional example, we also consider out-of-sample prediction for a Friedman test function (originally designed to illustrate multi- 20 0 2 4 6 8 12 y Parabola Data -150 -50 50 Acceleration Motorcycle Data Constant mean leaves 0 2 4 6 8 12 y -150 -50 50 Acceleration Linear mean leaves 0 2 4 6 8 12 y constant linear bart tgp -3 -2 -1 0 1 2 x -150 -50 0 50 Acceleration Comparison 10 20 30 40 50 Time Figure 2: T op two ro ws show , for constant and linear leav es, posterior mean and 90% interval for each of 30 runs with 1000 particles to randomly ordered data. Bottom row shows mean and 90% interv al for the av erage across dynamic tree runs, as well as for BAR T and TGP . 0 20 40 60 80 100 0 20 40 60 Parabola Data individual run mean over 30 runs 0 20 40 60 80 100 120 -15 -10 -5 0 5 Motorcycle Data Observation log Bayes factor Evidence for linear leaves Figure 3: Filtered log Bayes factors for linear against constant leaf tree models, calculated for 30 random data orderings (the mean is sho wn in black) with 1000 particles. 21 v ariate adaptiv e regression splines (MARS) in Friedman, 1991). The response is 10 sin( π x 1 x 2 )+ 20( x 3 − 0 . 5) 2 + 10 x 4 + 5 x 5 with N(0 , 1) additi ve error . Our comparators, and their R im- plementations, are: treed constant (TC) and linear (TL) models, and GP models from tgp (Gramacy and T addy, 2009); MARS from mda (Leisch et al., 2009); Random Forests (RF) from randomForest (Liaw and W iener, 2002); BAR T from BayesTree (Chipman and McCulloch, 2009); and neural networks with 2-3 hidden layers from the nnet library . For this experiment, 100 random training and prediction sets, of respecti ve size 200 and 1000, were dra wn with inputs uniform on [0 , 1] 5 . Follo wing Chipman et al. (2002), we measure performance through predictiv e root mean-square error (RMSE) to the true mean, and results are sho wn in Figure 4. Dynamic tree models (DTC/DTL) were fit in a single (random) pass with N = 1000 particles, and the number of MCMC iterations and restarts, etc., for other Bayesian methods was such that Monte Carlo error in RMSE for repeated runs on the same data is negligible ag ainst the results in Figure 4. Non-Bayesian comparators were fit under package defaults. The dynamic versions of the tree models clearly outperform their static counterparts, and the dynamic treed linear model (DTL) performs nearly as well as or better than the GP and B AR T , both of which of fer flexible mean functions under a (true) constant v ariance term. 3.2 Sear ch Optimization In T addy et al. (2009), TGP regression was used to augment a local pattern search scheme with ranked lists of locations that maximize a measure of expected improv ement. The GP leaf model was successful in this setting – deterministic optimization with 8 inputs – but pair- wise co variance calculations and repeated MCMC runs became unwieldy in higher dimen- sions. At the same time, such hybrid optimization schemes are most useful in complicated high-dimensional settings: although true global optimization is impossible without great ex- pense, a hybrid method uses regression to locate promising input areas and move the local search appropriately , thus iterating tow ards robust optimality . 22 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● TC DTC TL DTL GP MARS RF BART NNET1 NNET2 1 2 3 4 RMSE RMSE method mean (sd) GP 0.723 (0.053) DTL 0.917 (0.104) B AR T 0.935 (0.063) TL 1.244 (0.217) MARS 1.513 (0.072) NNET2 1.913 (0.429) NNET1 2.246 (0.424) RF 2.320 (0.093) DTC 2.459 (0.074) TC 3.067 (0.102) Figure 4: Predictiv e RMSE on the Friedman data; sample of 100 for each estimator . W e propose that our dynamic regression trees, due to a flexible on-line inference framework, provide a very attractiv e regression tool for hybrid local-global search optimization. Although they are not interpolators, and thus of limited usefulness in deterministic optimization, both constant and linear regression trees pro vide an ef ficient model for the prediction and optimiza- tion of stochastic functions. Such stochastic optimization problems are common in operations research and, despite being deterministic, many engineering codes are more properly treated as stochastic due to numerical instability . This section will outline use of dynamic regression trees for the global component of a stochastic optimization search. W e seek to e xplore the input space in areas that are likely to pro vide the optimum response (by default, a minimum). In the deterministic setting, Jones et al. (1998) search for inputs which maximize the posterior expectation of an impr ovement statistic, max { f min − f ( x ) , 0 } . In more generality , improv ement is just the utility deriv ed from a new observ ation. For global search optimization, there is no negati ve consequence of an ev aluation that does not yield a ne w optimum (this information is useful), and the utility of a point which does provide a new optimum is, say , linear in the difference between mean responses (other quantiles or moments 23 are also possible). Thus, the improvement for a stochastic objecti ve y ( x ) after t observations is I ( x ) = max { ˆ y min − ˆ y ( x ) , 0 } , where ˆ y ( x ) is the re gression model mean function (e.g., ˆ y ( x ) = µ η for constant leav es or ˆ y ( x ) = µ η + x 0 β η for linear leav es) and ˆ y min is this expected response minimized ov er the input domain. In hybrid schemes, we ha ve found that maximizing E [ I ( x )] can be overly myopic. W e thus incorporate an acti ve learning idea (refer to Section 3.3), and instead search for the maximizing argument to G ( x ; φ ) = E [ I ( x )] + sd( ˆ y ( x )) /φ , where φ is a precision parameter that can be decreased to fav or a more global scope. Con veniently , use of constant or linear leaves allows for both E [ I ( x )] and sd( ˆ y ( x )) to be calculated in closed form conditional on a giv en tree. In particular , with St( a η ( x ) , b η ( x ) , c η ) the posterior for ˆ y ( x ) giv en tree T t and ˆ y min fixed at the minimum for a η ( · ) over our input domain, e xpected improvement E [ I ] is (suppressing x ) ( ˆ y min − a η )T c η ˆ y min − a η p b η ! + p b η c η − 1  c η + ( ˆ y min − a η ) 2 b η  t c η ˆ y min − a η p b η ! , with T c and t c the standard t cumulati ve distribution and density , respecti vely , with c degrees of freedom (this is similar to the mar ginal impro vement deri ved in W illiams et al., 2000). Finally , since E [ I ] = 1 N P N i =1 E [ I | T ( i ) ] and V ar( ˆ y ) = E [ V ar( ˆ y | T )] + V ar( E [ ˆ y | T ]) , it is possible to obtain G = E [ I ] + sd( ˆ y ) /φ by ev aluating the appropriate functionals conditional on each T ∈ {T ( i ) t } N i =1 before taking moments across particles (see Section 3.3 for further detail). In a generic approach to sequential design, which is also adopted in the activ e learning algorithms of Section 3.3, the choice of the “next point” for ev aluation is based on criteria optimization over a discrete set of candidate locations. After initializing with a small number of function e valuations, each optimization step augments the e xisting sample [ x , y ] t by drawing a space-filling design (we use uniform Latin hypercube samples) of candidate locations, ˜ X = { ˜ x i } M i =1 , and finding x ? ∈ ˜ X which maximizes G ( x ? ) . Next, e valuate the ne w location, set [ x t +1 , y t +1 ] = [ x ? , y ( x ? )] , and follo w the steps in Section 2.3 to obtain an updated particle 24 approximation to the tree posterior . This is repeated as necessary . Figure 5 sho ws some results for mean optimization of the test function y = sin( x ) − Cauc h y ( x ; 1 . 6 , 0 . 15) + ε , with ε ∼ N(0 , σ = 0 . 1) , for precision v alues φ = 1 and φ = 10 . The fitted regression trees assume linear leav es and, in each case, we used 500 particles and 100 candidate locations. The role of φ in gov erning global scope is clear , as the higher precision search is quickly able to locate a robust minimum but does not then proceed with wider exploration. The linear leaf model seems to be ef ficient at fitting our test function. W e also consider mean optimization of the two dimensional exponential function y ∼ N( x 1 exp( − x 2 1 − x 2 2 ) , 10 − 6 ) , which is discussed further in Section 3.3. Giv en data from a ran- dom initial sample of 10 locations, our search routine was used to obtain the next 10 function e v aluations, with φ = 1 , 10 , or 100 , using 1000 particles and candidate sets of 200 locations. A TGP-based search optimization, as described in Gramac y and T addy (2010a), was also applied. This routine is the same as that outlined above for dynamic trees, except that it maximizes E [ I g ] rather than G , where each iteration requires a new MCMC run (we use 500 iterations after a burn-in of 500), and the candidate set is augmented with the location minimizing predicted response (such adaptation will often lead to a more ef ficient search). In this, the g parameter increases the global scope (refer to Schonlau et al., 1998) and we consider g = 1 , 5 , or 10 . Each optimization scheme was repeated 50 times, and the summary is sho wn in T able 1. In a result that will parallel findings in Section 3.3, constant, rather than linear , leav es lead to a better exploration of the 2-d e xponential function; this relativ e weakness exposes a dif ficulty for linear models in fitting the rapid oscillations around (0 , 0) . Interestingly , constant trees also lead to better results than for the TGP optimization routine. This is despite the added flexibility of GP leaves, an augmented candidate sample, and use of a wide range of g v alues for E [ I g ] (which, although harder to compute when mar ginalizing over model parameters, is the literature standard for this type of search). In addition, due to repeated MCMC runs, the TGP algorithm requires near to 10 times the computation of sequential tree optimization. 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 − 1.5 − 0.5 0.5 1.0 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 0.8 1.0 out[sx, 2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 − 1.5 − 0.5 0.5 1.0 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 out[sx, 2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 − 1.5 − 0.5 0.5 1.0 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 0.8 1.0 out[sx, 2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 − 1.5 − 0.5 0.5 1.0 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 out[sx, 2] Figure 4: Constant and linear leaf model fits. T op tw o ro ws sho w posterior mean and 90% interv al for each of 30 runs with 1000 particles to randomly ordered data, and bottom ro w has filtered Bayes f actors for linear ag ainst constant lea v es, with data processed from left to right. optimization o v er a discrete set of candidate locations. After initializing search with a small number of function e v aluations, each optimization step augments the e xisti ng sample [ x , y ] t by dra wing a space-filling design (we use uniform Latin h ypercube samples) of candidate loca- tions, ˜ X = { ˜ x i } M i =1 , and finding x ? ∈ ˜ X which maximizes GI ( x ? ) . Ne xt, e v aluate the ne w location, set [ x t +1 , y t +1 ] = [ x ? , y ( x ? )] , and follo w the steps in Section 3.3 to obtain an updated particle approximation to the tree posterior . This is repeated as long as necessary . 4.3 Active Learning Bob: you can fill this in by taking material from the pre vious v ersion. 4.4 On-line Classification I w ant to do something with higher dimension (5-20) here, b ut without model comparison. 5 Discussion Bob/Nic: you guys will probably ha v e a better perspecti v e from which to write this. 21 Iterations 10, 20, 40, and 60 with φ = 1 . ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 − 1.5 − 0.5 0.5 1.0 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 0.8 1.0 out[sx, 2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 − 1.5 − 0.5 0.5 1.0 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 out[sx, 2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 − 1.5 − 0.5 0.5 1.0 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 0.8 1.0 out[sx, 2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 − 1.5 − 0.5 0.5 1.0 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 out[sx, 2] Figure 4: Constant and linear leaf model fits. T op tw o ro ws sho w posterior mean and 90% interv al for each of 30 runs with 1000 particles to randomly ordered data, and bottom ro w has filtered Bayes f actors for linear ag ainst constant lea v es, with data processed from left to right. optimization o v er a discrete set of candidate locations. After initializing search with a small number of function e v aluations, each optimization step augments the e xisti ng sample [ x , y ] t by dra wing a space-filling design (we use uniform Latin h ypercube samples) of candidate loca- tions, ˜ X = { ˜ x i } M i =1 , and finding x ? ∈ ˜ X which maximizes GI ( x ? ) . Ne xt, e v aluate the ne w location, set [ x t +1 , y t +1 ] = [ x ? , y ( x ? )] , and follo w the steps in Section 3.3 to obtain an updated particle approximation to the tree posterior . This is repeated as long as necessary . 4.3 Active Learning Bob: you can fill this in by taking material from the pre vious v ersion. 4.4 On-line Classification I w ant to do something with higher dimension (5-20) here, b ut without model comparison. 5 Discussion Bob/Nic: you guys will probably ha v e a better perspecti v e from which to write this. 21 Figure 4: Constant and linear leaf model fits. T op tw o ro ws sho w posterior mean and 90% interv al for each of 30 runs with 1000 particles to randomly ordered data, and bottom ro w has filtered Bayes f actors for linear ag ainst constant lea v es, with data processed from left to right. ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 − 1.5 − 0.5 0.5 1.0 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.00 0.10 0.20 out[sx, 2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 − 1.5 − 0.5 0.5 1.0 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.00 0.04 0.08 0.12 out[sx, 2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 − 1.5 − 0.5 0.5 1.0 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.00 0.02 0.04 0.06 out[sx, 2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 − 1.5 − 0.5 0.5 1.0 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.00 0.02 0.04 0.06 out[sx, 2] Figure 5: Constant and linear leaf model fits. T op tw o ro ws sho w posterior mean and 90% interv al for each of 30 runs with 1000 particles to randomly ordered data, and bottom ro w has filtered Bayes f actors for linear ag ainst constant lea v es, with data processed from left to right. optimization o v er a discrete set of candidate locations. After initializing search with a small number of function e v aluations, each optimization step augments the e xisti ng sample [ x , y ] t by dra wing a space-filling design (we use uniform Latin h ypercube samples) of candidate loca- tions, ˜ X = { ˜ x i } M i =1 , and finding x ? ∈ ˜ X which maximizes GI ( x ? ) . Ne xt, e v aluate the ne w 21 Iterations 2, 15, 30, and 45 with φ = 10 . Figure 5: Optimization with dynamic regression trees. For each iteration, posterior mean and 90% interv al are on top (with true function and data in grey , and “next point” in green), while the bottom plot has E [ I ] in green, sd( ˆ y ) in blue, and G = E [ I ] + sd( ˆ y ) /φ in red. Constant tr ee Linear tr ee T reed GP φ = 100 φ = 10 φ = 1 φ = 100 φ = 10 φ = 1 g = 1 g = 5 g = 10 − 0 . 20 − 0 . 21 − 0 . 17 − 0 . 04 − 0 . 07 − 0 . 05 − 0 . 10 − 0 . 08 − 0 . 13 (0 . 13) (0 . 15) (0 . 15) (0 . 09) (0 . 13) (0 . 09) (0 . 12) (0 . 10) (0 . 14) T able 1: Minimization of the 2-d exponential function. For each optimization scheme, the top number is mean solution and the bottom number is standard de viation (out of 50 runs). 26 3.3 Active Learning Another common application of sequential design is focused on ev aluating an unkno wn re- sponse surface – the mean function f ( x ) in (1.a) – at its most informati ve input location. Such active learning procedures are intended to provide efficient automatic exploration of the cov ari- ate space, thus guiding an on-line minimization of prediction error . As in the search optimiza- tion of Section 3.2, each activ e learning iteration will draw candidate locations, ˜ X = { ˜ x i } M i =1 , and select the next ( t + 1 st ) design point to be x ? ∈ ˜ X which maximizes a heuristic statistic. Ho we ver , instead of aiming for a function optimum, we are no w solely interested in under- standing the response surface o ver some specified input range. Hence, we need to choose x ? to maximize some measure of the predicti ve information gained by sampling y ( x ? ) . T wo common heuristics for this purpose are acti ve learning MacKay (ALM; 1992) and acti ve learning Cohn (ALC; 1996). An ALM scheme selects the x ? that leads to maximum v ariance for y ( x ? ) , whereas ALC chooses x ? to maximize the expected reduction in predicti ve v ariance averaged over the input space. A comparison between approaches depends upon the regression model and the application, howe ver it may be shown that both approximate a max- imum expected information design and that ALC improves upon ALM under heteroskedastic noise. Both heuristics ha ve computational demands that gro w with | ˜ X | : ALC requires time in O ( | ˜ X | 2 ) , wheres ALM is in O ( | ˜ X | ) . Gramac y and Lee (2009) pro vide further discussion of acti ve learning as well as extensi ve results for ALC and ALM schemes based on MCMC inference with CGM’ s static treed constant/linear (TC/L) models, GPs, and TGPs. Since acti ve learning is an inherently on-line algorithm, it pro vides a natural application for dynamic regression trees. The remainder of this section will illustrate ALC/ALM schemes built around particle learning for trees, and show that our methods compare fav orably to existing MCMC-based alternati ves. As in 3.2, both constant and linear leaf models lead to closed-form calculations of heuristic functionals conditional on a giv en tree. Fast prediction allows us to e v aluate the necessary statistics across candidates, for each particle, and hence find the optimal 27 x ? under our heuristic. T rees are then updated for y ( x ? ) , and the process is repeated. ALM seeks to maximize V ar( y ( x )) , which is fairly simple to calculate. For a giv en tree T t such that x is allocated to leaf node η ∈ L T t , let µ η ( x ) = E [ y ( x ) | η ] and v η ( x ) = V ar( y ( x ) | η ) denote the conditional predictiv e mean and variance respectiv ely for y ( x ) . These are a vailable from equations (7) and (10) for each leaf model as µ η ( x ) = ¯ y η or µ η ( x ) = ¯ y η + ˆ x 0 ˆ β η and v η ( x ) = s 2 η (1+1 / | η | ) / ( | η | − 3) or v η ( x ) = [ s 2 η −R η ]  1 + 1 / | η | + ˆ x 0 G − 1 η ˆ x  / ( | η |− d − 3) . Giv en the particle set {T ( i ) t } N i =1 , the unconditional predicti ve v ariance is V ar( y ( x )) = E [ V ar( y ( x ) | T )] + V ar( E [ y ( x ) | T ]) ≈ 1 N " N X i =2 v ( i ) η ( x ) + µ ( i ) η ( x ) 2 # − " 1 N N X i =1 µ ( i ) η ( x ) # 2 , (16) which is straightforward to e valuate for all x ∈ ˜ X during our search for the maximizing x ? . The ALC statistic is more complicated. Let ∆ σ 2 x ( x 0 ) = V ar( y ( x 0 )) − V ar( y ( x 0 ) | x 0 ) denote the reduction in variance at x 0 when the design is augmented with x . Since y ( x ) is not observed, the ALC statistics must condition on the existing model state. Hence, we define ∆ σ 2 x ( x 0 ) = E [∆ σ 2 x ( x 0 | T t )] at time t in terms of the conditional v ariance reduction ∆ σ 2 x ( x 0 | T t ) = V ar( y ( x 0 ) | T t ) − V ar( y ( x 0 ) | x , T t ) . (17) Expressions for ∆ σ 2 x ( x 0 | T t ) may be obtained as a special case of the results giv en by Gramacy and Lee (2009). Clearly , ∆ σ 2 x ( x 0 | T t ) = 0 if x and x 0 are allocated to different leav es of T t . Otherwise, for x and x 0 allocated to η ∈ L T t in a gi ven tree, we hav e that ∆ σ 2 x ( x 0 | T t ) = ∆ σ 2 x ( x 0 | η ) is (for constant and linear lea ves respecti vely) s 2 η | η | − 3 ×  1 | η |  2 1 + 1 | η | or s 2 η − R η | η | − m − 3 ×  1 | η | + ( ˆ x 0 ) 0 G − 1 η ˆ x  2 1 + 1 | η | + ( ˆ x 0 ) 0 G − 1 η ˆ x , (18) where ˆ x = x − ¯ x η and similarly for ˆ x 0 . In practice, the integral ∆ σ 2 ( x ) = R R d ∆ σ 2 x ( x 0 ) d x 0 is 28 −3 −2 −1 0 1 2 3 0.6 1.0 1.4 parabola x sqr t alm constant linear 10 20 30 40 50 5 10 15 motorc ycle accel sqr t alm −3 −2 −1 0 1 2 3 0.01 0.03 0.05 0.07 parabola x sqr t alc 10 20 30 40 50 0.0 1.0 2.0 3.0 motor c yc le accel sqr t alc Figure 6: Activ e learning comparison for the parabola (left) and motorcycle data (right). The top ro w sho ws V ar( y ( x )) for ALM and the bottom sho ws p ∆ σ 2 ( x ) for ALC. approximated by a discrete sum o ver the random space-filling set ˜ X . Hence, x ? is the candidate location which maximizes the sum of ∆ σ 2 x ? ( x 0 | T t ) over both x 0 ∈ ˜ X and T t ∈ {T ( i ) t } N i =1 . W e use the parabola and motorcycle data examples of Section 3.1 to compare ALC and ALM for both constant and linear leaf dynamic trees. Figure 6 illustrates the statistics associ- ated with each combination of the two heuristics and two regression models, ev aluated gi ven the complete datasets. For the parabola data (left column), the ALM and ALC plots look roughly similar and, in each case, the linear model statistics are much more flat than for the constant model. In contrast, the ALM and ALC plots are v ery dif ferent from each other for the motorcycle data, which e xhibit heteroskedastic additiv e error . For a further comparison we return to the sin/Cauchy data from Section 3.2. Figure 7 sho ws sequential design progress under our dynamic tree model with linear leav es and the ALC heuristic in three snapshots: after an initial t = 10 Latin hypercube (LHS) sample, after 15 samples taken via ALC ( t = 25 ) and after 45 ALC samples ( t = 55 ). W ith the exception of 29 t = 10 t = 25 t = 55 ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 −1.5 −0.5 0.5 1.5 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.006 0.010 0.014 out[sx, 2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 −1.5 −0.5 0.5 1.5 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.0005 0.0015 0.0025 out[sx, 2] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 −1.5 −0.5 0.5 1.5 fit$X[1:(t + 1), ] fit$Y[1:(t + 1)] ● 0 1 2 3 4 5 6 7 0.00006 0.00012 0.00018 out[sx, 2] Figure 7: Snapshots showing the progression of the ALC heuristic driv en by the dynamic tree model with linear leav es on the sin/Cauchy test data. The top panels show the true surface (grey lines), sampled input/output pairs (grey points), and the dynamic tree predictiv e means and 90% quantiles (red lines). The bottom panels show the corresponding ∆ σ 2 ( x ) surface. 1-d Sin/Cauchy model design RMSE TGP ALC 0.0641 TGP ALM 0.0660 DTL ALC 0.0755 GP ALM 0.0847 DTL ALM 0.0992 – LHS 0.1251 GP ALC 0.1265 – ME 0.1325 TC ALC 0.1597 DTC ALC 0.1616 TC ALM 0.1960 TL ALC 0.2193 DTC ALM 0.2251 TL ALM 0.2458 2-d Exponential model design RMSE TGP ALM 0.00577 DTC ALC 0.00732 DTL ALM 0.00740 TGP ALC 0.00852 DTL ALC 0.00936 DTC ALM 0.00997 TC ALM 0.01126 TC ALC 0.01182 TL ALM 0.01622 GP ALM 0.02411 TL ALC 0.02431 – LHS 0.03636 – ME 0.04408 GP ALC 0.05419 T able 2: Comparing models and acti ve learning heuristics on the 1-d sin/Cauchy data and the 2-d exponential data. The tables are sorted on the fourth column (RMSE). 30 t = 10 , when there is not enough data to support a split in the tree(s), the ALC statistic is high at points where y ( x ) is changing direction—highest where it is changing most rapidly . T able 2 (left) sho ws how the dynamic tree models compare to v arious static Bayesian treed models, including TGP , on this data. All experiments were started with an initial LHS in [0 , 7] of size 10 follo wed by 40 acti ve learning rounds. Each round uses 20 random LHS candidates ˜ X in [0 , 7] . At the end of the 40 rounds the root mean squared error (RMSE) was calculated for a comparison of predicti ve means to the truth in a size 200 hold-out set from a GP-based maximum entropy (ME) design in [0 , 7] . This was repeated 30 times and the a verage RMSE is sho wn in the table. The expensi ve TGP methods dominate, with dynamic trees using linear leav es (DTL) trailing. In both cases, ALC edges out ALM. Interestingly , while static treed linear models (TL) fit through MCMC are amongst the worse performers, their sequential alternati ves are amongst the best. Neither TC or DTC are strong performers, perhaps due to the almost linear deriv ativ es of our test function. Since the “interesting” aspects of this response are e venly distributed throughout the input space, both the GP with ALM and the offline LHS and ME comparators also do well. T able 2 (right) shows the results of a similar experiment for the 2-d exponential data from Section 3.2. The setup is as described above, except that the initial LHS is no w of size 20 and is follo wed by 55 acti ve learning rounds. Here, the dynamic trees share the top 6 spots with TGP only , and clearly dominate their static analogues. Indeed, DTC is the second best performer , echoing our findings in Section 3.2 that the constant leaf model is a good fit for this function. In contrast to the results from our simple 1-d example, the offline (LHS and ME) and GP methods are poor here because the region of interest is confined to the bottom left quadrant of our 2-d input space. Finally , in all of these examples, dynamic regression trees (DTC and DTL) are the only methods that run on-line and do not require batch MCMC processing. 31 3.4 Classification Classification is one of the original applications for tree models, and is very often associated with sequential inference settings. Categorical data can be fit with multinomial leaf trees (as in 2.2.3) and, without an application specific loss function, the predictive classification rule assigns to ne w inputs the class with highest mean conditional posterior probability . That is, ˆ c ( x ) = arg max ( 1 N N X i =1  ˆ p η ( x ) c  ( i ) : c = 1 , . . . , C ) , (19) for a set of N particles, where each ˆ p η ( x ) c is the c -class probability from (13) for the leaf corre- sponding to x . Evaluating ˆ c ( x ) ov er the input space provides a predicti ve classification surf ace. W e first apply our methodology to the Cushing’ s data, again av ailable in the MASS library for R , used throughout the book by Riple y (1996) to illustrate v arious classification algorithms. This simple data set has two inputs – each patient’ s urinary excretion rates for tetrahydrocorti- sone and pregnanetriol, considered on the log scale – and the response is one of three different types of Cushing’ s syndrome (referred to as a , b , or c ). W e fit the multinomial leaf dynamic tree model to this data, and compare our results to those from a GP soft-max classifier as outlined in Broderick and Gramacy (2009), which fits independent GP models to each latent response y c such that p c = exp( y c ) / P C l =1 exp( y l ) . Figure 8 illustrates the outcome of this example. The dynamic tree results (top row) are based on a set of 1000 particles, while the GP soft-max results (bottom row) use ev ery 100 th ob- serv ation of a 10,000 iteration MCMC chain. The left column of this figure shows the syndrome classification surface based on maximum mean posterior probability , as in (19). Although it is not possible to assess whether any classifier is outperforming the other , these surfaces clearly illustrate the ef fect of axis-symmetric (tree) classification as opposed to the surface from a radial-basis (GP) model. W e also note that the GP soft-max classification rule is very similar to results from Ripley (1996, chap. 5.2) for a neural network fit with weight decay . 32 Classification a a a a a a b b b b b b b b b b c c c c c −3 −2 −1 0 1 2 Entropy Multinomial tree a a a a a a b b b b b b b b b b c c c c c a a a a a a b b b b b b b b b b c c c c c −3 −2 −1 0 1 2 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 a a a a a a b b b b b b b b b b c c c c c Soft−max GP log Tetrahydrocortisone log Pregnanetriol 0.5 0.7 0.9 1.1 Figure 8: Cushing’ s data example. The left column sho ws the class corresponding to maximum mean posterior probability , and the right has entropy for the posterior mean probability vector . Results for a dynamic classification tree are on top, and for a GP soft-max classifier on the bottom. Class (a → c) and entrop y rise from black to white, and plotted data is labelled by class. The right column of Figure 8 sho ws the entropy , − P l ∈{ a,b,c } p l log( p l ) , of the posterior mean probability surface for each classification model. Such entropy plots, which are com- monly used in classification settings to assess response variability , illustrate change in e xpected v alue of information ov er the input space. Indeed, Gramac y and Polson (2010) propose entropy as an activ e learning criteria for classification problems, analogous to ALC in Section 3.3. In our application, dynamic tree entropy is peaked around a 3-class border region in the top-left. This intuiti vely of fers better guidance about the value of additional information than the GP entropy surface, which is high e verywhere e xcept right at clusters of same-class observ ations. T o better assess the performance of our dynamic tree classifier , we turn to a larger “credit approv al” data set from the UC Irvine Machine Learning Database. This data includes 690 33 credit card applicants grouped by appro val status (either ‘+’ or ‘-’) and 15 input v ariables. Ele ven of these inputs are cate gorical, and for each of these we hav e encoded the categories through a series of binary variables. This con venient reparametrization, which expands the input space to 47 dimensions, allo ws direct application of the model in Section 1.1 with nodes splitting on x i = 0 or 1 for each binary v ariable (see Gramac y and T addy, 2010a, for discussion of this approach in general partition trees). As an aside, we note that the binary encoding can also be used to incorporate categorical data into constant and linear leaf trees; the only adaptation is, for linear lea ves, to e xclude binary v ariables from each re gression design matrix. W e applied the multinomial dynamic tree to 100 independent repetitions of training on 90% of the credit appro val data and predictive classification on the left-out 10% sample. This exper - iment is identical to that in Broderick and Gramacy (2009), who tested both a TGP soft-max classifier (includes binary variables for partitioning of latent TGP processes and fits indepen- dent GP to continuous inputs within each partition) and a na ¨ ıve GP soft-max classifier (fit to the reparametrized 47 inputs, treating binary v ariables as continuous in the correlation function). Their results are repeated here, beside those for our dynamic tree, in T able 3. Once again, despite using a less sophisticated leaf model, the dynamic tree is a clear performance winner and the only classifier able to beat 14% av erage missclassification. The dynamic tree was also orders-of-magnitude faster than the GP and TGP classifiers (we use 1000 particles, and they hav e 15,000 iterations), and only our approach will be feasible in on-line applications. Classifier Dynamic Multinomial T ree TGP soft-max GP soft-max Missclassification rate 0.136 (0.038) 0.142 (0.036) 0.146 (0.04) T ime (CPU hours) per fold 0.01 1.62 5.52 T able 3: Mean (and standard deviation) for out-of-sample missclassification ov er 10 random 10-fold cross-v alidations and computing time for each classifier on the credit approv al data. 34 4 Discussion W e have reformulated regression trees as a dynamic model. This sequential characterization leads to an entirely new class of models which can be fit on-line; hav e a scale-free automatic prior specification; av oid the need for parameter sampling; and lead to rob ust and fle xible prediction. W e hav e sho wn empirically that our dynamic regression trees can provide superior performance and are less expensi ve than common alternati ves. In a key point, we hav e been able to define particles for filtering which contain only split rules and sufficient statistics, leading our sequential filtering to ef ficiently discard all partition models except those which are predicting well. Due to the size and complexity of potential tree space, this narrowing of posterior search is an essential aspect of our models’ success. In addition, restrictions on partition size allo w us to mak e use of improper priors (for constant and linear leav es), which is not usually possible in particle inference. The modeling and examples contained herein provide encouragement for further work un- der a strategy that looks to take advantage of sequential model characterizations. In particular , we hope to find that similar ideas on update mechanisms for cov ariate dependent model features will lead to insight about dynamic versions of other graphical structures. Refer ences Atkinson, A. C. (1978). “Posterior Probabilities for Choosing a Regression Model. ” Biometrika , 65, 39–48. Breiman, L., Friedman, J. H., Olshen, R., and Stone, C. (1984). Classification and Re gression T rees . Belmont, CA: W adsworth. Broderick, T . and Gramac y , R. B. (2009). “Classification and Categorical Inputs with T reed Gaussian Process Models. ” T ech. rep., Uni versity of Cambridge. Capp ´ e, O., Douc, R., and Moulines, E. (2005). “Comparison of Resampling Schemes for Particle Filtering. ” In 4th International Symposium on Image and Signal Pr ocessing and Analysis (ISP A) . Zagreb, Croatia. 35 Carv alho, C. M., Johannes, M., Lopes, H. F ., and Polson, N. G. (2010a). “Particle Learning and Smoothing. ” Statistical Science , 25, 88–106. Carv alho, C. M., Lopes, H. F ., Polson, N., and T addy , M. A. (2010b). “Particle Learning for General Mixtures. ” Bayesian Analysis . T o appear . Chipman, H. and McCulloch, R. (2009). BayesTree : an R pac kage implementing Bayesian Methods for T ree Based Models . R package version 0.3-1. Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). “Bayesian CAR T Model Search (with discussion). ” J ournal of the American Statistical Association , 93, 935–960. — (2002). “Bayesian T reed Models. ” Machine Learning , 48, 303–324. — (2010). “B AR T: Bayesian Additiv e Regression T rees. ” The Annals of Applied Statistics , 4, 266–298. Cohn, D. A. (1996). “Neural Network Exploration using Optimal Experimental Design. ” In Advances in Neural Information Pr ocessing Systems , vol. 6(9), 679–686. Morgan Kaufmann Publishers. Friedman, J. H. (1991). “Multi v ariate Adaptiv e Regression Splines. ” Annals of Statistics , 19, No. 1, 1–67. Gramacy , R. B. and Lee, H. K. H. (2008). “Bayesian Treed Gaussian Process Models with an Application to Computer Modeling. ” Journal of the American Statistical Association , 103, 1119–1130. — (2009). “ Adaptiv e Design and Analysis of Supercomputer Experiment. ” T echnometrics , 51, 2, 130–145. Gramacy , R. B. and Polson, N. G. (2010). “Particle Learning of Gaussian Process Models for Sequential Design and Optimization. ” Journal of Computational and Gr aphical Statistics . T o appear . Gramacy , R. B. and T addy , M. A. (2009). tgp : an R pac kage for Bayesian tr eed Gaussian pr ocess models . R package version 2.3. — (2010a). “Categorical inputs, sensitivity analysis, optimization and importance tempering with tgp version 2. ” Journal of Statistical Softwar e , 33. — (2010b). dynaTree: an R pac kage implementing dynamic tr ees for learning and design . R package version 1.0. 36 Higdon, D., Swall, J., and K ern, J. (1999). “Non-Stationary Spatial Modeling. ” In Bayesian Statistics 6 , eds. J. M. Bernardo, J. O. Ber ger , A. P . Dawid, and A. F . M. Smith, 761–768. Oxford Uni versity Press. Jones, D., Schonlau, M., and W elch, W . (1998). “Efficient Global Optimization of Expensi ve Black-Box Functions. ” J ournal of Global Optimization , 13, 455–492. Kass, R. E. and Raftery , A. E. (1995). “Bayes factors. ” Journal of the American Statistical Association , 90, 773–795. K ong, A., Liu, J. S., and W ong, W . H. (1994). “Sequential Imputations and Bayesian Missing Data Problems. ” J ournal of the American Statistical Association , 89, 278–288. Leisch, F ., Hornik, K., and Riple y , B. D. (2009). mda : an R pac kag e for Mixtur e and fle xi- ble discriminant analysis . R package version 0.4-1, S original by T rev or Hastie & Robert T ibshirani. Liaw , A. and W iener , M. (2002). “Classification and Re gression by randomF orest. ” R Ne ws , 2, 3, 18–22. MacKay , D. J. C. (1992). “Information–based Objectiv e Functions for Activ e Data Selection. ” Neural Computation , 4, 4, 589–603. Pitt, M. K. and Shephard, N. (1999). “Filtering via Simulation: Auxiliary Particle Filters. ” J ournal of the American Statistical Association , 94, 590–599. Ripley , B. D. (1996). Pattern Recognition and Neur al Networks . Cambridge Univ ersity Press. Santner , T . J., W illiams, B. J., and Notz, W . I. (2003). The Design and Analysis of Computer Experiments . Ne w Y ork, NY : Springer-V erlag. Schonlau, M., W elch, W ., and Jones, D. (1998). “Global versus local search in constrained optimization of computer models. ” In New developments and applications in experimental design , 11–25. Institute of Mathematical Statistics. T addy , M. A., Lee, H. K. H., Gray , G. A., and Griffin, J. D. (2009). “Bayesian Guided Pattern Search for Robust Local Optimization. ” T echnometrics , 51, 389–401. W illiams, B. J., Santner , T . J., and Notz, W . I. (2000). “Sequential Design of Computer Exper- iments to Minimize Integrated Response Functions. ” Statistica Sinica , 10, 1133–1152. 37

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment