Wavelet Decomposition of Gradient Boosting

W a v elet Decomp osition of Gradien t B o osting Ma y 6, 2019 Shai Dek el, Oren Elisha, and Ohad Morgan Abstract In this pap er w e introduce a signiﬁcan t improv ement to the p opular tree-based Stochastic Gradien t Bo osting algorithm using a wa velet deco m- p ositio n of the trees. This app roac h is based on h armonic analysis and appro x imation theoretical elements, and as we show through ex tensiv e ex - p erimen tation, our w av elet based method generally outperforms existing method s, particularly in diﬃcult scenarios of class unbala nce and misla- b eling in the training data. 1 In tro duction In the setting of regression a nd classiﬁcatio n tasks on s tr uctured da ta, d ecis io n tree ensem bles are extremely us eful as oﬀ-the-shelf to ols [18], being relatively fast to construct, a daptiv e, a nd when trees are sma ll, they also pro duce int er pretable mo dels. T ree-based b oo sting methods [18, 10, 2 0] are p opular ensemble methods that a re constructed b y a sequential genera tio n of pruned decision trees. These ’weak learners’ are then co m bined to form a ’strong’ estim a tor. In this work we fo c us o n impro ving the w ell-k nown Gradien t Bo osting (GB) techn ique [10, 11], which co mputes a sequence of tr ees, that are trained to predict the r esidual betw een the resp onse v aria ble and the negative gra dien t directio n of the previous tree. These residuals are used as the respo nse v ariables for t he next tree in the iterative pro cesse s so that the sum of the w eighted trees in the ensemble is the ﬁnal estimator. W av elets [5, 16] ar e a powerful, y et simple, tool for spa r se represen tations of ’complex’ functions. In [8], a w avelet-based GB metho d w a s introduced, using a clas s ic wa velet deco mposition of the original pre dicto rs, followed by a com- po nen twise linear least squares GB mechanism. Our w avelets approach is very diﬀerent . W e use the theoretical foundation developed in [9] and apply a w av elet decomp osition of the decision trees formed during the b oos ting pro cess. This mathematical mo del allo ws us to design a mor e robust pr uning algo r ithm by re-order ing of the nodes of the trees based on their signiﬁcance. As w e will sho w in the experimental pa r t, the adaptive wa velet pruning approach g enerally out- per forms ex isting m etho ds, particularly in diﬃcult scenarios of class un bala nce 1 and mislab eling in the tr aining data. In addition, we also demonstra te how to employ our method o n the s to chastic GB version, which uses ba gging to ga in diversit y and use the Out Of Bag (OOB) samples to improve ge neralization. The res t of paper is org anized as follows. In section 2, we overview Gradi- ent Bo osting (GB) algorithms and in particular tree-based GB alg o rithms. In addition, w e descr ibe the sto c hastic GB algorithm which combines a ba gging pro cedure int o the GB. In sectio n 3, w e present the “G eo metric W av elets” ( GW) decomp osition, and highlight some theoretical and practical prop erties that em- phasis their corresp ondence with spa rsit y . In Section 4 we present the Geometric W av elets Gradien t Boosting (GW GB) algorithm whic h co m bines sto ch a s tic GB with the GW decomp osition. In section 5, w e c o nclude with experiment res ults that compar e our algor ithm with comp eting b oo sting algorithms in diﬀere nt challenging settings. 2 Gradien t b o osting trees 2.1 Decision trees In the setting of statistics and machine le a rning [18, 4] the constr uctio n we present in this chapter is refer red as Decision T r ee or the Classiﬁcation and Regression T ree (CAR T). When w e are given with a real-v alued function or a discrete data-set { x i ∈ Ω 0 , y i = f ( x i ) } m i =1 , (1) in some con vex bo unded doma in Ω 0 ⊂ R n , our goal is to ﬁnd an eﬃcient repre- sentation of this da ta ˆ f ( x ), ov ercoming the co mplexit y , ge o metry and p ossibly non-smo oth nature of the function v alues. The eﬃciency of ˆ f ( x ) is typically estimated by minimization of a lo ss function L , with respect to the data (1), and is usually co m bined with additional regularizatio n condition that aims to reduce the gener alization er ror. F or example as seen in (5), a spar sit y condition is applied to reduce overﬁttin g artifacts. The decision tre e ’s ﬁrs t level is formed b y a partition of the initial domain Ω 0 , in to t wo sub-domains, e.g. b y in tersecting it with a hyper-plane, so as to minimize a g iv en cos t function. This sub division pro cess then con tinues recursively on the nested sub-domains un til some exit criterio n is met, which in turn, determines the le aves of the tree. W e now describ e one instance o f the cost function. At each stage of the sub division pro cess, at a cer tain no de of the tree, the a lgorithm ﬁnds, for the co n vex doma in Ω ⊂ R n asso ciated with the no de, a partition by a n hyper - plane in to tw o co n vex sub-domains Ω ′ , Ω ′′ , and t wo m ultiv a riate low-order p olynomials Q Ω ′ , Q Ω ′′ , of ﬁxed (typically lo w) total degree r − 1, that minimize the following quantit y k f − Q Ω ′ k p L p (Ω ′ ) + k f − Q Ω ′′ k p L p (Ω ′′ ) , Ω ′ ∪ Ω ′′ = Ω . (2) If the data-set is discrete, consisting of featur e vectors x i ∈ R n , with resp onse 2 v a lues f ( x i ), then a discrete functional is minimized X x i ∈ Ω ′ | f ( x i ) − Q Ω ′ ( x i ) | p + X x i ∈ Ω ′′ | f ( x i ) − Q Ω ′′ ( x i ) | p (3) Observe th a t for any given subdividing hyperpla ne, the approximating polyno- mials in (2) can be uniquely determined fo r p = 2, by least sq ua re minimization. F or r = 1 , the approximating p olynomials are nothing but the mean of the function v alues ov er each of the sub-domains Q Ω ′ = c Ω ′ = 1 # { x i ∈ Ω ′ } X x i ∈ Ω ′ f ( x i ) , Q Ω ′′ = c Ω ′′ = 1 # { x i ∈ Ω ′′ } X x i ∈ Ω ′′ f ( x i ) . (4) Denoting by Ω j t a no de on level j of the tree with coun ting index t . It is easy to see that for each ﬁxed level J , Ω 0 = S 2 J t =1 Ω J t . Therefore, we can describ e the tree ev a luated at any ﬁxed level J by T J ( x ) = P 2 J t =1 Q Ω J t ( x )1 Ω J t ( x ), or s imply b y T ( x ), when ev aluation is done o n the terminal no des or to a predeﬁned level J . Here, 1 Ω ( x ) = 1, if x ∈ Ω and 1 Ω ( x ) = 0, if x / ∈ Ω. In class iﬁcation problems, the input tra ining set co nsists of lab eled data using P classes instead of function v alues. In this scenar io, ea c h input training po in t x i ∈ R n is assigned with a cla ss C ( x i ). T o conv ert the problem to the same ‘functional’ setting describ ed ab ov e one assigns to each class C the v a lue of a no de on the regular s implex cons is ting of P vertices in R P − 1 (all with equal pa irwise distances). Thu s, we ma y assume that the input data is in the form { x i , y i } m i =1 ∈  R n , R P − 1  . In this case, if we c ho ose approximation using consta n ts ( r = 1), then the ca lculated mean ov er an y s ub-domain Ω is in fact a po in t − → E Ω ∈ R P − 1 , inside the simplex. Ob vio usly , a n y v a lue inside the multidimensional simplex, can b e mapp ed bac k to a clas s, along with an estimated certaint y conﬁdence level, b y calcula ting the close s t vertex o f the simplex to it. As will beco me obvious, these mappings can be applied to any wa velet a ppro xima tio n of functions receiving m ultidimensional v alues in the simplex. In man y a lgorithms that are based on dec is ion trees, the high-dimensionalit y of the data do es not a llo w to search thro ugh a ll poss ible subdivisio ns . As in our exper imen tal results, one may re s trict the s ubdivisions to the class of hy- per planes aligned with the main axes. In contrast, there ar e cases where one would like to consider more adv anced form of sub divisions, where they take certain hyper-s urface for m, such as conic-sections. Our para digm of wa velet decomp osition can s upport in principle all of these for ms. 2.2 Pruning decision t r ees In man y cases, the resp onse v ariable f ( x i ) in (1 ) is obtained with noise of diﬀerent t yp es. Thu s, bias-v ariance consideration (e.g. a voiding ov er ﬁt) [18] 3 encourages pruning techniques that restrict the size of a tre e in v aries wa ys. Pre-pruning [15] in volv es a “termination condition” to determine when it is desirable to terminate some of the branches pr e ma turely when a decision t r ee is generated. F or example, in [3 ] a minimal no de size is used as an exit c riterion for tr ee genera tio n. On the other hand, p ost-pruning [1 5], may b e applied to remov e s ome of the branches after tree is generated and could b e ev aluated. F or example, in the CAR T alg orithm [4 ] after a t re e m o del had b een generated, one applies a regularization co ndition w ith a factor γ , that p enalize adding more no des, b y minimizing      m X i =1 ( y i − T ( x i )) 2      l 2 + γ # n Ω j t ∈ T ( x ) , j = 1 , ..J o . (5) As will b e describ ed in the next sections, while most B oos ting algorithms [18] set a ﬁxed level J as a pr uning s tr ategy , our GWGB metho d uses a diﬀerent pruning a pproach that is based on the da ta encapsulated in each no de r a ther than the lev el of the tree. 2.3 Gradien t b o osting Gradient Bo osting (GB) [10, 17, 18] is ba sed on computing a sequence of weak learners such as pruned decision trees, using a gradien t descent iterative method. A functional gradient view o f b o osting was ﬁrst presented in [17]. This led to the dev elopment of bo osting algorithms in man y areas o f machine lea rning and statistics, beyond regression and cla ssiﬁcation. Let X ∈ R n denote a real v alue random input v ector , and Y ∈ R P − 1 a real v a lue random o utput vector, with P r ( X , Y ) their join t distribution. in such case we hav e a t ypically unknown f unction f ( X ) = E [ Y | X ] , so w e may seek for an approximation function ˆ f : R n → R P − 1 based on the training sa mples (1). Ideally , t he approximation ˆ f ( x ) should b e the o ne that minimizes the exp ected prediction error with respect to some speciﬁed loss function L ( Y , f ( X )), ˆ f ( X ) = arg mi n f E X,Y L ( Y , f ( X )) . (6) F requently employed loss functions L include squar ed-error ( y − f ( x )) 2 and absolute e r ror | y − f ( x ) | for y ∈ R (regressio n), a nd nega tiv e binomial log - likelih o o d l og  1 + e − 2 y f ( x )  , when y ∈ {− 1 , 1 } (binary classiﬁcation). Here , we pr esen t a uniﬁed approa c h for regr ession and classiﬁcation a nd choo se the squared-er ror, as describ ed in (2) a nd 3. In the setting of tree-based GB algorithm, t he appr oximation fun ction ˆ f ( x ) is a com bination of K + 1 weak lea rners ˆ f K ( x ) := ˆ f 0 ( x ) + ν K X k =1 T k ( x ) , (7) 4 where K is the num b er of bo osting iterations, T k ( x ) are the pruned trees , and ν is the step size (also called Learning Rate or Shrinkage). The function ˆ f 0 ( x ) is t ypically an initial estimate suc h as ˆ f 0 ( x ) = arg min c N X i =1 L ( y i , c ) . T o generate the weak learners, one typically co nstructs K decision tr ees { T k } K k =1 , while applying a ﬁxed tree level J , in the following wa y . In each iteration k , a decision tree T k is built, so residuals co uld b e set as the res ponse v ariables for the next k + 1 step n y k +1 i = ˆ f k ( x i ) − y k i o m i =1 with  y 0 i = y i  m i =1 . This iterative pro cedure resembles a gradient decent in the sense that a t ea c h iteration we set the next s tep in the opp osite direction of the pseudo g radient of the loss f unction L . As describ ed in [18], it is common pr actice to set 4 ≤ J ≤ 8, to hav e g oo d results in the context of bo osting. As w e shall see, our approach is to choose a higher lev el J and t hen apply the wa velet-based appro ach of pruning s peciﬁc no des. The selection o f K should s et a bala nce b et ween reducing the training error and avoiding overﬁttin g when K → ∞ . Th us, [19] uses a v a lida tion samples or an OOB sa mples (when applying the stochastic version of GB), to ﬁnd the optimal K . A mo diﬁcation to tre e-based GB algo rithm, ca lled Sto c hastic GB (SGB), was prop osed in [1 1] by applying a bagging step at each iter a tion, that uses only a random subsample of the tr aining data. This rando mly selected subsample is then used, instead o f the full sample, to ﬁt the r e g ression tre e. The OOB samples are then used for v alida tion. In our algor ithm, as well as previous algorithms, the OOB are used to determine the pruning, not only v alidation. 3 Geometric w a v elets The Geometric W avelet (GW) decomp osition o f decision trees were presented at [6 ], ba sed on the theory of [13, 14]. It was re c e ntly g e ner alized to a decom- po sition of Random F orests [9] and used to enhance their per for mance as well as in tro duce a no vel algorithm for feature impor tance. Let Ω ′ be a child o f Ω in a decision tree T , i.e. Ω ′ ⊂ Ω and Ω ′ has b een created by a partition of Ω, and let tw o p olynomials Q Ω ′ , Q Ω ′′ that minimize the quantit y (2). W e use the po lynomial a ppro ximatio ns Q Ω ′ , Q Ω ∈ Π r − 1 ( R n ) and deﬁne ψ Ω ′ := ψ Ω ′ ( f ) := 1 Ω ′ ( Q Ω ′ − Q Ω ) (8) as the geometric wa velet associa ted with the sub-domain Ω ′ and the function f , or the giv en discrete data-set ( 1). 5 Each w av elet ψ Ω ′ is a ‘lo cal diﬀ erence’ co mponent that b elongs to the detail space b etw een tw o levels in the tree, a ‘low resolution’ lev el ass o ciated with Ω and a ‘high resolution’ lev el asso ciated with Ω ′ . Also, the wav elets (8) have the ‘zero mo ments’ prop erty , i.e., if the r esponse v ariable is sampled fro m a po lynomial o f degre e r − 1 ov er Ω, then our lo cal scheme will compute Q Ω ′ = Q Ω = f ( x ), ∀ x ∈ Ω ′ and therefore ψ Ω ′ = 0. Under certain mild conditions on the tree T and the function f , w e ha ve b y the nature of the wav elets, the ‘telesco pic’ sum of diﬀerences: f = X Ω ∈ T ψ Ω , where ψ Ω 0 := Q Ω 0 . (9) F or example, (9) ho lds in L p -sense (1 ≤ p < ∞ ), if f ∈ L p (Ω 0 ), and for any x ∈ Ω 0 and s eries o f doma ins Ω j ∈ T , each on a level j with x ∈ Ω j , we hav e that l im j →∞ diam (Ω j ) = 0 (see Theorem 2.1 in [6]). In the theoretical setting, the nor m o f a w av elet i s co mputed b y k ψ Ω ′ k 2 2 = Z Ω ′ ( Q Ω ′ ( x ) − Q Ω ( x )) 2 dx, (10) and in the discrete ca se b y k ψ Ω ′ k 2 2 = X x i ∈ Ω ′ | Q Ω ′ ( x i ) − Q Ω ( x i ) | 2 , (11) where Ω ′ is a child of Ω. This wav elet norm tell us how m uch information this wa velet encapsulate (see [6],[9]). Recall that our approa c h con verts classiﬁcation problems in to a ‘functional’ setting b y assig ning the P class lab els to vertices of a s implex in R P − 1 (see discussion in 2.1) . In suc h cases of m ulti-v alued functions, c ho osing r = 1, the wa velet ψ Ω ′ : R n → R P − 1 is ψ Ω ′ = 1 Ω ′  − → E Ω ′ − − → E Ω  , and its norm is given b y k ψ Ω ′ k 2 2 =    − → E Ω ′ − − → E Ω    2 l 2 # { x i ∈ Ω ′ } , where for − → v ∈ R P − 1 , k − → v k := q Σ P − 1 i =1 v 2 i . Accordingly , we can consider the squared-er ror as loss funct io n for classiﬁcation problems. It is easy to see that the decision tree T can be wr iten as T ( x ) = X Ω j ∈ T ψ Ω j ( x ) . (12) The theory (see Theorem 4 in [6]) tells us that spars e approximation is achiev ed b y order ing the w avelet comp onen ts based on their norm   ψ Ω k 1   2 ≥   ψ Ω k 2   2 ≥   ψ Ω k 3   2 ... (13) 6 Figure 1: Illustration of greedy n o de selection by wa velet n o r ms Thu s, the adaptive M -term approximation of a decision tree T is T M ( x ) := M X j =1 ψ Ω k j . (14) This pruning meth o d is, in so me sense, a generalization of the classical M -term wa velet sum, where the w av elets are constructed over dyadic cubes (see [7]). 4 W a v elet decomp osition of gradien t b o osting In this section, we introduce a combination of Stochastic GB T ree alg orithm and Geometric W av elet decomposition. In our setting, instead of the pr uned decision tree T at so me ﬁxed lev el J as weak learner, w e retrieve t he M “most impo r tan t” no des, in term of wav elet-norm (see (1 3) and (14)), which is the M -term a ppr o ximation T M . This M -term a ppro ximation is used as the weak learner at each bo osting iteration. T o select M at each iter ation, we use the OOB data as in 2 and add the w avelets, one by one according to its wa velets norm un til er ror o f the model is minimized on the OO B s e t. Our implemen tation for “Geometric W a velets Gra dien t Bo osting” described in Algorithm 1. . One of the adv antages of this fo r m of tr ee pruning is the fact that no des are selected ac c o rding to their contribution to the prediction (9), ra ther than their p osition at a certain lev el in the tree. This a llo ws a n adaptive selection of high a nd low resolution at the same step of the bo osting. An illustr ation of M -term GW collection who se graph represen tation includes some unco nnected comp onen ts is shown in Figure 1. The M -term no des are marked in red, while the rest of the no des in the tree are n o t used for the estimation. Another adv a n tage of relaying on the wav elets no rms for pruning in the GBM setting, is an eﬃcien t feature selection for th e ensemble. In so me ca ses, explana- tory a ttributes may b e non-descriptiv e a nd even noisy , leading t o the cr eation of problematic nodes in the decision trees. Nevertheless, in these cases, the corre- sp o nding w av elet norms ar e con trolled and t hese no des can be omitted from the representation (14). An example that demonstrates this p henomen is presented in [9] (see Exa mple 1). The example shows that with high pr o babilit y , the wa velets asso ciated with the cor rect v ariables hav e relatively higher norms than wa velets a s socia ted with non-descriptive v ariables. Hence the w av elet based cr i- terion will choose, with high probability the correct v ar iable. Since the tree 7 Algorithm 1 Geometric W av elets Gradien t Bo osting 1. Initialize ˆ f 0 ( x ) = arg min c P m i =1 L ( y i , c ) . 2. set  y 0 i = y i  m i =1 . 3. F or k = 1 , 2 , ..., K (a) upda te the residua ls n y k i = ˆ f k − 1 ( x i ) − y i o m i =1 . (b) Choo se randomly subset of m ′ v a riables from the original data -set, denote b y  x i, y k i  m ′ i =1 . based on this training set gener ate a tree T ( x ) = P j ψ Ω k j , where n ψ Ω k j o j are the GW sorted by w avelet norm (see 13). (c) Denote the OOB subse t by OOB = n ( x, y ) | ( x, y ) / ∈ { x i, y i } m ′ i =1 o , then compute: M k = arg min M X ( x,y ) ∈ OO B L   y k , M X j =1 ψ Ω k j ( x )   . (d) Update the prediction model: ˆ f k ( x ) = ˆ f k − 1 ( x ) + ν P M k j =1 ψ Ω k j ( x ). 4. Output ˆ f ( x ) = ˆ f K ( x ). partitions are based on (2) and (13), the non-descriptive v ariables are les s lik ely to form partitions that a r e part o f the GWGB ensem ble. 5 Exp erimen tal results In this section we compare the Geometric W a velets GB algorithm (GW GB, al- gorithm 1) with other b o osting a nd bagging metho ds in ter ms of cla ssifying im- balance datasets, improving regressio n ta sks and overcoming mislabeling noise in classiﬁcation tasks. Our GWGB co de was written in C# and is publicly av ailable 1 . At each iteration we use the OOB tec hnique as describ ed in section 2, with 80% of the training set to build the wa velets tree, and 20% fo r M -term selection. Moreover, a ﬁxed step s ize ( ν ) of 0.1 is used thro ughout all of o ur experiments. 1 h ttps://githu b.com/ohadmorgan/Geomet ri cW a veletGradeint Bo osting.git 8 5.1 Classiﬁcation with im balance class distributions Classiﬁcation problem with data-sets that suﬀer from imbalanced class distribu- tions is a c hallenging problem in t he ﬁeld of machine learning. W e pres en t a comparison of o ur algor ithm p erforma nce , with sta te of the art ensemble-based techniques for im bala nced data-sets presen ted in [12]. The exp e rimen t is based on the testing a v ariety of metho dologies on 4 4 real-world im bala nced problems from K EEL data-set rep ository [1 ]. W e use the same 5- fold cros s -v alida tion data and pa rtitions that are provided in [12] to measure the A UC (Area Under the Curve) met ric . Moreov er, a t each iteration we grow the tr ee to a ﬁx e d level of depth level 8, and selected M k terms according to our algor ithm, while using the same K = 10 which was used in [12]. Compar- ison res ults, including GW GB algorithm, are pres en ted in table 1. Since the authors of [12] reviewed 37 diﬀerent bagg ing, bo osting a nd classics a lgorithms, for brevit y a nd space limi tation, we present a comparison of our metho d t o the bes t algor ithm in e ac h c ategory (in term of mean AUC) for each data set.The success of our metho d is due to the fact that we c o uld build deep er trees and reach high res olution areas where the rare categ ories might accor d, and select these no des in early stages of t he ensemble (low er K ). This i s the adv antage of wa velets reo rdering acco r ding to thier norm which enables a pr uning strategy that is not dependent o n the tree’s depth or level. 5.2 Regression In this section we compare o ur method with the m o st recen t Boos ting s c hemes presented at [20]. W e fo llow the same ra ndomization process pr e sen ted by [20] with 20 ra ndom trails of 2-fold cro ss v alidation, and w e ha ve follow ed t he sa me tech nique for adaptive selection of the nu mber of iterations a s in [20] a s we c hos e the b est k ∈ [0 , 5 00] in term of RMSE o n v a lidation. As in the previous section, at each iteration we hav e grow the tree to a ﬁxed lev el o f depth level 8, and select M k terms acco rding to our algo r ithm. The results are pr esen ted in T able 2 are the a verage RMSE and standard deviation of th e 2 0 ra ndom trails and are compared to the three b est algor ithms from [20]. 5.3 Ov ercoming mislab eling noise in c lassi ﬁcation Bo osting metho ds are known to b e sensitive to lab el noise [2]. The expe r imen t is based on the s ame testing metho dolog y presented in [2], with the injection of tw o noise lev els (NL) of r andom 10 % and 30% of the or iginal la bels in the da tasets. A verage and standard deviation of the misclass iﬁcation rates co mputed from 10-fold cross v alida tion. As in [2] w e hav e restricted the n umber of iteration to K = 150, and restricted the level to2. The ﬁrst metho d (rA daBo ost) is mo diﬁcation of the Ad a Boo st algorithm, using "r obust cla ssiﬁers" that c o m bined and b oo sted using kno wn Ad a B oos t algorithm. The tw o next metho ds (rBo ost-Fixed gamma and rBo ost) are new robust b oo sting algorithms where the ob jectiv e function is a con vex combination 9 T able 1: Class Im balance results comparison (A UC) Dataset na me Best Bagging- based method Best Bo osting- based method Best Classic method Geometric W av elets UB4 R US 1 SMT GW GB glass1 0.737 0.763 0.737 0.816 ecoli0vsl 0.980 0.969 0.973 0.986 Wisconsin 0.960 0.964 0.953 0.985 Pima 0.760 0.726 0.725 0.809 Iris0 0.990 0.990 0.990 1.000 glass0 0.814 0.813 0.775 0.880 ye a st1 0.722 0.719 0.709 0.775 ve hicle1 0.787 0.747 0.730 0.810 ve hicle2 0.964 0.970 0.950 0.982 ve hicle3 0.802 0.765 0.728 0.805 Hab erman 0.664 0.655 0.616 0. 651 glass012 3vs456 0.904 0.930 0.923 0.9 60 ve hicle0 0.952 0.958 0.919 0.9 82 ecoli1 0.900 0.883 0.911 0.9 51 new-th yroid2 0.958 0.938 0.966 0.99 6 new-th yroid1 0.964 0.958 0.963 0.99 3 ecoli2 0.884 0.899 0.811 0.9 18 Segimm t0 0.988 0.993 0.993 0.987 glass6 0.904 0.918 0.884 0.9 35 ye a st3 0.934 0.925 0.891 0.9 57 ecoli3 0.908 0.856 0.812 0.9 23 Page-blocks0 0.958 0.948 0.950 0.9 90 ye a st2vs4 0.936 0.933 0.859 0.9 81 ye a st05679vs4 0.794 0.803 0.760 0.8 63 vo wel 0 0.947 0.943 0.951 0.9 88 glass016 vs2 0.754 0.617 0.606 0. 720 glass2 0.769 0.780 0.639 0.690 ecoli4 0.888 0.942 0.779 0.906 suttle0vs4 1.000 1.000 1.00 0 1.000 yrast1vs7 0.786 0.715 0.700 0.760 glass4 0.846 0.915 0.887 0.9 63 page- blo c ks13v s4 0.978 0.987 0.996 0.992 abalone9vs 18 0.719 0.693 0.628 0.8 27 glass016 vs5 0.943 0.989 0.813 0.946 suttle2vs4 1.000 1.000 0.992 0.994 yrast1458 vs7 0.606 0.567 0.537 0.594 glass5 0.949 0.943 0.881 0.9 82 ye a st2vs8 0.783 0.789 0.834 0.616 ye a st4 0.855 0.812 0.712 0.8 65 ye a st1289vs7 0. 734 0.721 0.683 0.765 ye a st5 0.952 0.959 0.934 0.9 68 ecoli0137vs26 0.745 0.794 0 .814 0.814 ye a st6 0.869 0.823 0.829 0.8 76 Abalone19 0.72 1 0.631 0.521 0.594 10 T able 2: Regression results comparison Dataset Decision stumps V anilla ne ural netw orks GWGB name R- Bo osting ǫ - Bo osting R T- Bo osting R- Bo osting ǫ - Bo osting R T- Bo osting Diab etes 58.71±1.2 58.94±1 .9 58.61±2.7 58.03±2.3 58.03±2.4 58.03±2.4 57.0 1±3.0 Housing 4.13±0.3 4.33±0.2 4.14±0.4 4.02±0.3 4.45±0.3 4.45±0.3 3.38±0.4 CCS 5.47±0.1 6.10±0.7 5.35±0.3 6.59±0.4 6.62±0.2 6.52±0.3 4.80±0.4 Abalone 2.28±0.02 2.40±0.05 2.28±0.05 2.12±0.05 2.10±0.0 3 2.13±0.02 2.17±0.06 T able 3: Misclassiﬁcation results comparison Dataset name NL r-A da Bo ost rBo ost- Fixed γ rBo ost GBo ost MBo ost GW GB Banana 0.1 86.87±1.1 87.06±0.9 87.04±0.9 83.91±1.6 78.13±3.4 87.60±1.6 0.3 85.27±3.0 8 5.53±2.1 85.06±2.7 79.38±1.6 75.31±2.5 85.49±1.3 PID 0.1 74.20±2.3 74.37±1.5 74.80±2.4 72.60±2.0 75.67±1 .9 74.21±6.2 0.3 72.53±1.9 70.43±2.4 71.43±2.3 69.40±2.9 73.33±2.3 75.65±7.5 Heart 0.1 78.40±3.1 79.70±3.5 79.10±4.4 76.40±3.1 77.60±3.5 80.74±8.2 0.3 78.50±4.0 77.40±6.5 78.10±4.3 70.00±5.5 75.20±3.7 73.70±11.5 T wo- Norm 0.1 95.70±0.8 95.58±0.9 95.59±0.7 90.35±1.0 92.79±0.5 96.40±0.8 0.3 93.33±0.9 93.13±1.3 93.40±1.1 83.94±2.0 91.16±0.9 94.82±0.7 11 of t wo exp onential losses. The results for the w ell-known Gentle Bo ost (GBo ost), and Mo dest Bo ost (MBo ost) are also taken from [2] and presented in T a ble 3. The reason for improv ed results of our method in the case o f mislabeling no ise is due to remo ving wa velets with small GW no rm. As seen in 11 the mag nitude of a w avelet norm that corresp onds to a s ingle point ( typical miss labeled po in t) is small and hence will b e typically pruned while k eeping informativ e no des on the same lev el of the tree. References [1] J Alcalá, A F ernández, J Luengo, J Derra c, S García, L Sánch ez, a nd F Herrera. Keel da ta-mining so ft ware to ol: Data set r epositor y , integration of algo rithms and exp eriment al a nalysis framework. J ournal of Multiple- V alue d L o gic and S oft Computing , 17(2-3):255 –287, 2010 . [2] Jakramate Bo otkra jang and Ata Kabá n. Bo osting in the presence of la- bel noise. Pr o c e e dings of the Twenty-Ninth Confer enc e on Unc ertainty in A rtiﬁcial Intel ligenc e (UAI2013) , 2013 . [3] Leo Breiman. Random forests. Mach. L e arn. , 45(1):5–32, October 200 1. [4] Leo Br eiman, Jer ome F riedman, Charles J Stone, a nd Richard A Olshen. Classiﬁc ation and r e gr ession tr e es . CR C press, 198 4. [5] Ingrid Daubechies. T en le ctur es on wavelets . SIAM, 19 92. [6] S. Dekel and D. Leviatan. A daptive mu ltiv ariate a ppro ximation using bi- nary space partitions and g eometric wav elets. SIAM Journal on Numeric al A nalysis, T o Ap p e ar , 43:707 – 732, 200 5 . [7] Ronald A. De V ore. Nonlinear approximation. A CT A NUMERICA , 7:51 – 150, 1998 . [8] Eugene Dub ossarsky , Jerome H F riedman, John T Ormero d, and Matthew P W and. W av elet-based gradient bo osting. Statistics and Com- puting , 26(1-2):93 –105, 2016 . [9] Oren E lisha and Shai Dekel. W av elet decomp ositions of ra ndo m forests - smo othness analysis, spar se approximation and applications. Journal of Machine Le arning Rese ar ch , 17(198):1– 38, 2016. [10] Jerome H. F riedma n. Greedy function approximation: A gradient b o osting machine. Annals of Statistics , 29:118 9–1232 , 2000. [11] Jerome H. F riedman. Sto chastic gradient bo osting. Comp ut . Stat. Data A nal. , 38(4):367–3 78, F ebruary 2002. 12 [12] Mik el Galar, Alb erto F er nandez, Edurne Barrenechea, H umberto Bustince, and F rancisco Herr era. A review o n ensemb les for the class imbalance pr ob- lem: bagging -, b o osting-, a nd hybrid-based a pproaches. IEEE T r ansactions on Systems, Man, and Cyb ernetics, Part C (A pplic ations and R eviews) , 42(4):463 –484, 201 2. [13] Borislav Ka raiv a no v a nd Penc ho Petrushev. Nonlinear piecewise polyno- mial a pproximation beyond b esov spaces. Ap pl. Comput. H armonic A nal , pages 177–2 23, 2003 . [14] Borislav Ka raiv a no v, Penc ho Petrushev, Robert, and C. Sharpley . Algo- rithms for nonlinear piecewise p olynomial approximation: Theo r etical as- pects. T r ans., A mer. Math. So c , 355:258 5–2631 , 2002. [15] Sotiris B Kotsian tis. Decision tree s : a recent ov er v iew. A rtiﬁcial Intel li- genc e R eview , pages 1–23, 2013. [16] Stephane Mallat. A wavelet tour of si gnal pr o c essing: the sp arse way . A ca- demic press, 2008. [17] Llew Maso n, Jonathan B axter, Peter Bartlett, and Marcus F rean. Bo o sting algorithms as gradient descent in function space, 1999 . [18] Jerome F riedman T revor Hastie, Rober t Tibshirani. The Elements of Sta- tistic al L e arning, se c ond e dition . springer, 2009. [19] Greg Ridgew ay with contributions from others. Packa ge gbm , 2 .1.1 edition, 03 2015. [20] Lin Xu, Shaob o Lin, Y ao W ang, and Zongb en Xu. Shrinkage degree in l2-resca le b o osting for regr ession. IEEE T r ansactions on Neur al Networks and L e arning S ystems , 2016. Shai Dek el, is with WIX AI a nd the School of Mathema tics, Tel-A v iv University, Tel-A viv. shaidekel6@gmail.c om Oren El isha, is with Micr osoft Israel and the School of M a thema tics, Tel -A viv University, Tel-A viv. or enelis@gmail.c om Ohad Morgan, is with the School of Ma thema tics, Tel-A viv Un iversity, Tel-A viv. ohadmo r gan1989@gmail.c om 13

Wavelet Decomposition of Gradient Boosting

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment