Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process Predictions

Generaliz ed Pr oduct of Experts f or A utomatic and Principled Fusion of Gaussia n Pr ocess Pr edictions Y anshuai Cao David J . Fleet Departmen t of C omputer Science, Uni versity of T oronto Abstract In this work, we prop ose a g eneralized produ ct of experts (gPoE) framework for combinin g the predictions of m ultiple p robabilistic models. W e identify fou r d e- sirable proper ties t hat are important for scalability , expressi veness and robustness, when learning and inf erring with a combin ation of multiple models. Throu gh analysis and experimen ts, we show th at gPoE of Gaussian proce sses (GP) ha ve these qualities, while no other e xisting combination schemes satisfy all of them at the same time. The resulting GP-gPoE is highly scalable as individual GP ex- perts can be in depende ntly learned in pa rallel; very expressi ve as the way exper ts are comb ined depends on the input rather than ﬁxed; the co mbined prediction is still a valid prob abilistic mo del with natural interpretation; an d ﬁnally rob ust to unreliable predictio ns f rom individual experts. 1 Intr oduction For both practical and theor etical reasons, it is often nece ssary to combin e the predictions of multiple learned mod els. Mixture of experts, product of e xperts (PoE )[[1]], ensemble methods are perh aps the mo st obvious framew orks for such prediction fusion. Howe ver there are f our desirab le proper ties which n o existing fusion scheme ach iev es at the same tim e: (i) p redictions are combined witho ut the need for join t trainin g or tra ining m eta mod els; (ii) the way prediction s are comb ined depend s on the input rath er than ﬁxed; (iii) the co mbined pr ediction is a valid probab ilistic model; (iv) unreliable prediction s are au tomatically ﬁltered out from the c ombined model. Property (i) allo ws individual experts to be tra ined ind ependen tly , making the overall model e asily scalable via parallelization; proper ty (ii) giv es the combined model more expressive power; while p roperty (iii) allows uncer- tainty to b e used in subsequ ent modelling or de cision making; and ﬁnally property (iv) ensures that the combin ed prediction is robust to poor prediction by some of the exp erts. In this work, we pr opose a novel s cheme called g eneralized produ ct of e xpert (gPoE) that achieves all four proper ties if indi- vidual exper ts are Gaussian processes, and consequently , excels in ter ms of scalability , robustness and expressiv eness of the resulting model. In comparison, a mixture o f experts with ﬁxed mixin g pr obabilities d oes n ot satisfy ( ii) and (iv), an d because e xperts and mixing pr obabilities g enerally n eed to b e learned tog ether, (i) is no t satis ﬁed either . If an inp ut dep endent gating function is used, then the MoE can achieve proper ty (ii) an d (iv), but joint tr aining is still need ed, an d the ability (iv) to ﬁlter out poor predictions crucially depend s on the joint training. Dependin g on the natur e of the expert model, a PoE may or may n ot need joint or re-tr aining, b u t it does not satisfy prop erty (iv), because without the gating function to ”shut-down” bad experts, the combined prediction is easily mislead by a single expert putting lo w probability on the true label. In the ensemble meth od regime, bagging[ 2 ] does no t satisfy ( ii) and ( i v), as it uses ﬁxed equ al weights to combine models, and does not auto matically ﬁlter po or predictions, altho ugh empirically it is usually ro bust due to the eq ual weigh t voting. Boosting an d stackin g[3] req uires sequ ential join t training and train ing a meta-predicto r respectiv ely , so they do not satisfy (i). Furthermo re, boosting does not satisfy (ii) and (iv), while stacking only has limited ability for (iv) th at depends on training. 1 As we will demonstrate, the pr oposed gPoE of Gaussian pro cesses not on ly enjoys the goo d qualities giv en b y the four de sired properties o f p rediction f usion, but it also retains some important attr ibutes of PoE: m any weak uncertain p redictions can yield very sharp comb ined pre diction tog ether; and the combinatio n has closed analytical form as another Gaussian distribution. 2 Generalized Produ ct of Expert 2.1 PoE W e start by b rieﬂy descr ibing the pr oduct of expert model, of which our p roposed m ethod is a generalizatio n. A PoE mod els a target probab ility distribution as th e product o f multiple densities, each of which is given by one expert. The product is then renorma lized to sum to one. In the co ntext of supervised learning , th e distributions are conditional: P ( y | x ) = 1 Z Y i p i ( y | x ) (1) In contrast to mixtu re models, experts in PoE hold ”veto” power , in th e sense that a value has low probab ility un der the PoE if a single expert p i ( y | x ) assign s low pr obability to a particu lar v alue. As Hinton poin ted o ut [1], training such model for gene ral experts by m aximizing likelihood is hard b ecause of the renor malization term Z . Howe ver, in the special ca se of Gaussian experts p i ( y | x ) = N ( m i ( x ) , Σ i ( x )) , the produ ct distribution i s still Gaussian, with mean and covariance: m ( x ) = ( X i m i ( x )T i ( x ))( X i T i ( x )) − 1 (2) Σ( x ) = ( X i T i ( x )) − 1 (3) where T i ( x ) = Σ − 1 i ( x ) is the prec ision of the i -th Gau ssian expert at point x . Qualitatively , conﬁ- dent prediction s have more inﬂuence over the combin ed pred iction than the less co nﬁdent on es. If the p redicted variance were always the correct conﬁd ence to be used, then PoE would h av e exactly the behavior needed. Howe ver, a slight model misspeciﬁcation could cause an expert to pr oduce erroneo usly lo w predicte d variance along with a biased m ean predictio n. Because of the com bi- nation rule, su ch over - conﬁdence by a single expert ab out its erro neous pr ediction is enoug h to be detrimental for the resulting combined model. 2.2 gPoE Giv en that PoE has almost the desired beh avior excep t for the fact that an exper t’ s predictive pre - cision is not necessarily the rig ht measure of reliability of pr ediction fo r use in weightin g, we will introdu ce another measure of such reliability to down-weight or ignore bad prediction s.Like PoE, the p roposed generalized p roduct of experts is also a probability mod el deﬁn ed by produ cts of d is- tributions. Here we again focus on conditiona l distribution for supervised learning, taking the form: P ( y | x ) = 1 Z Y i p α i ( x ) i ( y | x ) (4) where α i ( x ) ∈ R + is a measure of the i - th expert’ s reliability at p oint x . W e will introdu ce one particular choice for Gaussian pro cesses in the ne xt subsectio n, for n ow let us ﬁrst analyze the e ffect of α i ( x ) . Raising a density to a power as done in equ ation (4) has been widely used as a way for annealing d istributions in MCMC, or balancing different parts o f a pro babilistic mode l that has different degrees of freed om ([4], see 6. 1.2 Balanced GPDM). If α i ( x ) = 1 ∀ i, x , the n we r ecover the Po E as a special case. α i > 1 sharpens the i -th distribution in the produ ct (4), whereas α i < 1 broade ns it. T he limit cases o f α i → ∞ , wh ile the other expon ents are ﬁxed causes the largest mode of i -th distribution to dominate the produ ct distribution with arbitrarily large “veto” p ower; on the other hand, α i → 0 causes i - th expert to h av e arb itrarily small weight in the combined mode l, effecti vely ignoring its prediction. Another interesting proper ty of gPoE is that if each p i is Gaussian, then the resulting P ( y | x ) is still Gaussian as in PoE. T o see th is, it sufﬁces to show that p α i i is Gau ssian, which is ap parent with a little algebraic manipulatio n: p α i i = exp( α i ln( p i )) = 1 C exp( − . 5( y − m i ) ⊤ ( α i Σ − 1 i )( y − m i )) (5) 2 This also shows that t he power α i essentially scales the precision of i -th Gaussian. Theref ore, similar to equation (2) and (3), the mean and covariance of Gaussian gPo E are: m ( x ) = ( X i m i ( x ) α i ( x )T i ( x ))( X i α i ( x )T i ( x )) − 1 (6) Σ( x ) = ( X i α i ( x )T i ( x )) − 1 (7) 2.3 gPoE f or Gaussian processes Now that we have established how α i ( x ) can b e used to co ntrol the inﬂue nce of individual experts, there is a natu ral ch oice of α i ( x ) for Gaussian pr ocesses th at can reliably d etect if a particu lar GP expert d oes n ot generalize well a t a given point x : the chan ge in en tropy from pr ior to p osterior at point x , ∆ H i ( x ) , which takes a lmost n o extra com putation since posterior variance at x is already computed when the GP exper t m akes prediction an d the prior variance is simply k ( x, x ) , where k is the kernel used. When th e entr opy change at point x is zer o, it means th e i -th expert does n ot have any infor mation about this point that come s f rom tr aining o bservation, th erefore, it shall not contribute to the com - bined prediction, which is achieved by our mo del b ecause α i ( x ) = 0 in (6) and (7). F or Gau ssian processes, this covers both the case if point x is far away from trainin g points or if the model is misspeciﬁed. 1 There are other qu antities that could be used as α i ( x ) , for example the difference between the prio r and posterio r v ar iance (instead of half o f difference of log of the two variances). The reason we choose th e entro py change is b ecause it is unit-less in the sense of d imensional analysis in ph ysics, so that th e resulting pred ictiv e variance in equ ation ( 7) is o f the corr ect u nit. The same is not true if α i ( x ) is the d ifference o f variances which carries the unit of variance (or squar ed unit of variable y ). The KL diver gence between prio r and p osterior d istribution at po int x co uld also potentially b e used as α i ( x ) , but we ﬁn d the entr opy change to be alr eady effecti ve in our exp eriments. W e will explore the use of KL di vergence in future w ork. 3 Experiment W e c ompare gPoE against b agging , M oE, and PoE on three dif ferent datasets: KIN40K (8D feature space, 10K training poin ts), SARCOS (2 1D, 4448 4 train ing points), and the UK ap artment price dataset (2D, 64910 training po ints) u sed in SVI- GP work of Hensman et al. [8]. W e try three different ways to b uild individual GP experts: ( So D ) random subset o f data; ( local ) lo cal GP aro und a rando mly selected po int; ( tr ee ) a tree based constru ction, where a ball tree [5] built on training set recursively par titions the space, and on e ach level of the tree, a r andom subset of data is dr awn to build a GP . On all datasets and fo r a ll methods of GP e xpert construction , we use 256 data points for each exper t, and construct 51 2 GP experts in to tal. Each GP exper t uses a kernel that is the sum of an ARD kernel and white kernel, and all hyperparameter s are learned by scaled conjug ate gradient. For MoE, we do not jointly learn experts and gating functions, as it is very time consuming , in stead we use the sam e entro py change as the gating functio n ( re-nor malized to sum to one). Therefo re, all experts in all combination schemes co uld be lear ned independently in pa rallel. On a 32-core m a- chine, with th e described setup , train ing 512 GP experts with in depend ent hyperparam eter learn ing via SGD on ea ch datasets tak es between 20 seconds to just und er one minute, including the time for any preprocessing such as ﬁtting the ball tree. In terms o f test per formanc e ev aluation, we use the common ly u sed metrics standar dized negative log proba bility (SNLP) and standardized mean square error (SMSE) . T able 1 and 2 show that gPoE consistently out-perf orms bagging, Mo E an d Po E combination r ules b y a large margin on both scores. Under the tree based expert construction method, we explore a heu ristic variant of gPoE (tree-gPoE ), which takes o nly experts tha t are on the path d eﬁned by the ro ot node to the leaf node of the test point in the tree. Alter nativ ely , this could be viewed a s deﬁning α i ( x ) = 0 for all experts that ar e n ot o n the root-to -leaf path. This variant giv es a slight f urther b oost to perfo rmance across the board. 1 For the exp eriments, we also normalize the weighting factors at each point x so that P i α i ( x ) = 1 . 3 Another interesting observation is that while gPoE p erform s consistently well, PoE is almost always poor, espec ially in SNLP score. This empirically conﬁrms our previous ana lysis th at misg uided over -conﬁden ce b y experts are d etrimental to the resulting PoE, an d sho ws that th e correction by entropy change in gPoE is an effecti ve w ay to ﬁx this problem. SoD Local T ree Bagging MoE PoE gPoE Bagging MoE PoE gPoE Bagging MoE PoE gPoE tree-gPoE SARCOS .619 0.164 0.438 0.0603 0 .685 0.119 0.619 0.0549 0.648 0.208 0.493 0.014 0.009 KIN40K 0.628 0.520 0.543 0.346 0.761 1.174 0.671 0.381 0.735 0.691 0 .652 0.285 0.195 UK-APT 0.00219 0.00220 0.00218 0.00214 0.00316 0.00301 0. 00315 0.00122 0.00309 0. 00193 0.00310 0.00162 0.00144 T able 1: S MSE SoD Local T ree Bagging MoE PoE g PoE Bagging MoE PoE g PoE Bagging MoE PoE gPoE tree-gPoE SARCOS N/A -0.528 205.27 -1.445 N/A - 1.432 3622.5 - 2.456 N/A -0.896 1305.46 -2.643 -2.77 KIN40K N/A -0.344 215.02 -0.542 N/A 0.6136 495.17 -0.518 N/A -0.155 376.4 -0.643 -0.824 UK-APT N/A -0.175 244.06 -0.191 N/A -0.215 805.4 -0.337 N/A - 0.235 6 27.07 -0.355 -0 .410 T able 2: S NLP Finally , we would lik e to note that as a testimony to the e xpressive p ower g iv en by the gPoE, GP experts train ed o n only 2 56 p oints with very generic kernels cou ld co mbine to g i ve pred iction per- forman ce close to or even superior to sophisticated spar se Gaussian process app roximation such as stochastic v ariational inference (SVI-GP), as evidenced by the co mparison in tab le 3 for th e UK- APT dataset. Note a lso that due to parallelization, training in our case to ok less than 30 seco nds on this pro blem of 64910 training poin ts, altho ugh testing time in o ur gPoE is much lo nger than sparse GP appr oximation s. Similar r esults competitive o r e ven superior to sophisticated FITC[6] o r CholQR[7] approximatio ns are observed o n SARCOS and KIN40K as well, but due to spa ce and time constraints are not included in this extended abstract, b ut in fu ture work instead. W e w ould like to emph asize that such comp arison does not suggest gPoE as a silver b u llet for beating benchm arks using any nai ve expert GP model, b u t to demonstrate the expressi vene ss of the resulting mode l, and shows its potential to be used in conjun ction with other sophisticated tec hniques for sp arsiﬁcation and automatic model selection. SoD-256 SoD-500 ∗ SoD-800 ∗ SoD-1000 ∗ SoD-1200 ∗ SVI-GP ∗ SoD-gPoE Local-gPoE T ree-gPoE Tree2-gPoE RMSE 0.566 0.522 +/- 0.018 0.510 +/- 0.015 0.503 +/- 0.011 0.502 +/- 1.012 0.426 0.556 0.419 0.484 0.456 T able 3 : RMSE: comparing root mean square error on the UK-APT dataset, we use this score instead of SMSE and SNLP as it is the measure used by Hensman et al. in [8]. All methods wi th ∗ next to the name indicates that the number is what is reported in the SVI-GP paper [8]. 4 Discussion and Conclusion In this work, we propo sed a principled w ay to combine predictions of multiple indepen dently learned GP exper ts without the n eed f or fur ther training. The co mbined mod el takes the form of a g eneral- ized p roduct o f exper ts, and th e com bined pred iction is Gaussian an d h as desirab le pro perties such as increased expressi veness and r obustness to poor predictions of some e xperts. W e showed that the gPoE has many inter esting qualities over other comb ination rules. Howe ver, one thin g it can not capture is multi-modality like in mix ture of exper ts. In future work it would interesting to explore generalized product of mixture of Gaussian processes, wh ich captures both “or” constrain ts as well as “and” constrain ts. Anothe r fu ture work direction is to explore other measures of mod el reliability for GP . Fin ally , while th e lack of chan ge in entropy is an indica tor of ir relev an t pred iction, the co n verse stateme nt does not seem to be true, i.e., sufﬁcient chang e in entropy does not necessarily guarantee reliable prediction because of all the potential ways the model could be mis-speciﬁed. Howev er , our empir ical r esults sugg est that at least with RBF kern els, the change in entro py is always reliable even if the estimated posterior v ar iance itself is not accurate. Further theoretical work is needed to better understand the con verse case. 4 Acknowledgments W e thank J. He nsman for provid ing h is dataset for b enchmar king, as well as th e anonymou s revie wer for insightfu l comments and q uestion about the reliability issue in the conv erse case. Refer ences [1] G.E. Hinton. T r aining p roducts of experts by minimizing c ontrastive d iv ergence. Neural Computation , 14:1771 –180 0, 2 002. [2] Breiman, Leo. Bagging predictors. Ma chine Learning , 24 ( 2): 1 2314 0, 1996. [3] David H. W o lpert. Stacked Generalization Neu ral N etworks , 5, 241– 259, 1992. [4] R. Urtasun . M otion Mod els for Robust 3D Hu man Body T rack ing. Phd Thesis 3541 , EPFL 2006 [5] Omohun dro, Stephen M. Fiv e Balltree Construction Alg orithms Intern ational Computer Sci- ence Institute T ech nical Report, 1989. [6] E. Snelson an d Z. G hahraman i. Sparse ga ussian proce sses using pseudo -inputs. NIPS 18 , p p. 1257– 1264. 2006. [7] Y . Cao , M. Brubaker , D. J. Fleet, A. Hertzm ann. Efﬁcient Optimization for Sparse Gau ssian Process Regression NIPS 20 13 , 2013. [8] James Hensman, Nicolo Fusi, Neil D. Lawrence. Gaussian Processes for B ig Data NIPS 2013 , 2013. 5

Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process Predictions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment