Evolutionary algorithms for constructing an ensemble of decision trees

Ev olutionary algorithms for constructing an ensem ble of decision trees Evgeny Dolotov 1 , 2 and Nikolai Zolotykh 1 , 2 1 Y andex, Mosco w, Russia 2 National Researc h Universit y Higher School of Economics, Moscow , Ru ssia { dolotov- e, ni kzolotykh } @yandex -team.ru Abstract. Most decision tree induction algorithms are based on a greedy top-down recursive partitioning strategy for tree gro wth. I n this pap er, w e p ropose several method s for induction of decision trees and their en- sem b les based on evolutionary algorithms. The main diﬀerence of our approac h is using real-v alued vector representation of decision tree t hat allo ws to use a large number of diﬀerent optimization algorithms, as we ll as optimize the whole tree or ensemble for av oiding local optima. Dif- feren tial evolution and evolution strategies w ere chosen as optimization algorithms, as they hav e go od results in reinforcemen t learning problems. W e test the predictive p erforma n ce of this metho ds using severa l p ub- lic U CI data sets, and the p roposed metho ds show b etter quality than classical metho ds. Keywords: classiﬁcatio n · decision tree induction · evolutionary algo- rithm · diﬀerential evolution. 1 In tro duction Decision trees are a p opular metho d of machin e learning for solving class iﬁc a tion and regres s ion pro blems. Because of their p opularity many algorithms exists to build decision tr ees [1,2]. How ever, the task o f constr uc ting optimal or near- optimal decision tree is very co mplex. Mo st decis io n tr e e induction algo r ithms are based o n a greedy top-down recur siv e partitioning strategy for tree g rowt h. They use diﬀerent v ariants of impurity measur es, such as information gain [2], gain ratio [3], gini-index [4] and distance-ba sed mea sures [5], to s elect an input attribute to b e asso ciated with an in ter na l no de. One ma jor drawback of the greedy sear c h is that it usually leads to s ub-optimal solutio ns. The underlying reason is that lo cal dec is ions at each no des are in fact interdependent and cannot be found in this wa y . A p opular approach that can pa r tially so lv e these pro ble ms is the induction of decision tree s through evolutionary algorithms (EAs) [15]. In this approa c h, each individual in evolutionary a lgorithms represents a solutio n to the clas siﬁcation problem. Each solutio n is ev aluated b y a ﬁtness function, which measur es the quality of it. At each new genera tio n, the best solutions have a hig her probability of being selected for r eproductio n. The selected solutio ns undergo o pera tions 2 Evgeny Doloto v and Nikolai Zolotykh inspired b y g enetics, such as crossover and m uta tion, pro ducing new s olutions which will replace the paren ts, creating a new population of so lutions. This pro cess is rep eated un til a stopping criterion is satis ﬁe d. Instead of a lo cal search, EAs per form a robust g lobal sear c h in the space o f candidate solutions. As a result, E As tend to cop e b etter with attribute interactions than greedy metho ds and avoid lo cal optima. In this pap er w e prop ose an approach that encodes a decision tree as a real-v alued homoge ne o us vector, since w e also enco de indices of featur e s by rea l nu mber s and deco de them using the oper ation o f ﬁnding the minimum. This approach a llows to use a lar ge n umber o f diﬀeren t optimization algo rithms, such as diﬀeren tia l evolution [6] and evolution strateg ies [7]. 2 Related work The num b er of prop osed evolutionary alg orithms for decision tree induction has grown in the past few y ea rs, mainly b ecause they r epor t go o d predictive accuracy whilst keeping the co mprehensibilit y of decision tr e e s. There a re tw o the most common approaches to enco ding decisio n trees for evolutionary algorithms: tree- based enco ding and ﬁxed-length vector encoding . They all use diﬀerent methods to enco de indices of fea tur es, threshold v alues, leav es, and op erators in no des. The ma in diﬀerences in tree-bas ed appr oach e s ar e the presence of p ointers to no des a nd the abilit y to encode trees o f v arious sizes. Axis- pa rallel decisio n trees are the most co mmon t yp e found in the literature, mainly b ecause this t y p e of tree is usually muc h easie r to interpret than an o blique tree . A no de in axis-par allel decision tree can b e describ ed by tw o par ameters: index of tested feature and threshold v alue. A popula r approa c h [8] to enco ding s uc h tr ees is to enco de each no de with one int eg er and o ne r eal n umber , but in this ca se, w e get hetero geneous and mor e complex r e presen ta tion o f decision tree, than in ap- proach prop osed in this ar ticle, which makes the pro cess of ﬁnding the optimal solution more complex. Authors of a rticle [9] describ e a very simila r appro ac h to enco ding oblique decision trees with r eal-v alued vectors and o ptimizing them with diﬀeren tia l evolution algor ithms, but in this a r ticle we prop ose a more com- pact r epresentation speciﬁca lly for axis-par allel decision trees. A mor e detailed ov erview of an evolutionary methods for constructing decision trees can be found here [10]. 3 Prop osed approach In this pap e r we prop ose a new approa c h to co nstruct axis -parallel decision tree for classiﬁcation pr oblems using evolutionary algo rithms. Evol u tionary algorithms for constructin g an ensem b le of decision trees 3 3.1 Real-v alued vector representa ti on In axis-para llel trees each no de splits datas et according to the following rule : f ( x ) = ( 1 , if a i ≤ t 0 , otherwise (1) Thu s , each no de of the tree is describ ed by t wo parameter s: the index of a feature and a thr e shold v alue. Suppo se we hav e a ﬁxed-length real- v alued vector with v alues in the seg men t [0 , 1]. This vector consists of tw o par ts o f equal length – the ﬁrst par t enco des feature indices, and the second part enco des threshold v alues. Also supp ose that all features of o b jects belo ng to the segment [0 , 1]. If this is no t the ca se, then we no rmalize featur es using the maximu m and minim um v alues from the tr ain- ing datas et. T o restore the index of a fea ture from the vector, we s hould ﬁnd the p osition of the minimum v alue in the ﬁrst part of this v ec to r and ﬁnd its remainder of integer division by the n umber of featur es. The v alue in the second part of the vector in this p o sition is used as threshold v alue in the co rresp onding no de. After that, the next minimal v alue in the vector, the corres ponding index of the featur e in the no de and a thres hold v alue should b e found. This op eration is rep eated until the entire vector is used. Using the indices of features a nd their threshold v alues fo r a ll no des, the decision tre e without leav es can b e built b y sequentially adding the no de s . After tha t, the leav es are added to the decisio n tree by using tra ining dataset and the ma jority rule. Thus, we can c onstruct a decision tree fro m a real-v alued v ecto r and ev aluate its characteristics . 3.2 Diﬀerent i al e v olution The diﬀere n tial evolution (DE)[6] is a n eﬀective evolutionary algor ithm designed to s olv e optimization problems with r eal-v alued para meters. A po pulation in DE consists of N individuals: P = { x 1 , x 2 , ..., x N } (2) The j -th v alue of the invidual x i in the initial p opulation is calculated as follows: x i j = x min j + r ( x max j − x min j ) , (3) where r ∈ [0 , 1] is a uniformly distr ibuted ra ndo m num be r . The evolutionary pr ocess implement s a n iterative scheme to evolv e the initial po pulation. At each iteratio n of this pro cess, known as the gener a tion, a new po pulation of individuals is g enerated fro m the previous one. Each individual is used to build a new vector b y a pplying the mut a tion and cros so ver op erato r s: – Mutation. Thr ee randomly chosen individuals are linearly combined as fol- lows: v i = x j 1 + α ( x j 2 − x j 3 ) , (4) where α is a user- speciﬁed constant. 4 Evgeny Doloto v and Nikolai Zolotykh – Crossov er. The mut a ted vector is reco m bined with the targ et vector to build the trial vector: u i j = ( v i j , if r ≤ C R or j = l x i j , otherwise (5) where r ∈ [0 , 1] is uniformly distr ibuted random n umber and C R is the crossover rate. – Selection: A one-to-o ne tournament is applied to determine which individual is selected a s a mem b er of the new p opulation. In the ﬁnal step, when a stop condition is fulﬁlled, DE returns the b est individual in the cur ren t popula tio n. 3.3 Ev oluti on s trategies Unlik e the metho d of diﬀere ntial ev o lution, the popula tion in the metho d of evolution strategies [7] co nsists of only one individual: P ∼ x (6) The initial individual is calc ulated as follows: x j = x min j + r ( x max j − x min j ) , (7) where r ∈ [0 , 1] is a unifor mly distributed rando m num b er. W e sample several oﬀ- sets which are represented a s a normal distributed ra ndom vector e 1 , e 2 , ..., e n ∼ N (0 , I ). Then we shift the individual in the direction of the weigh ted sum of the oﬀsets, which approximate the gradient: x ← x + α 1 nσ n X i =1 f ( x + σe i ) e i , (8) where α a nd σ ar e us e r-spe c iﬁe d constants. 3.4 Construction of ensembles Two of the most p opular approaches for constr ucting the ensembles of decision trees is bagg ing and b o osting. Example o f metho d that use bagging approa c h is random forest and exa mple of metho d that use b o osting appr oach is AdaBo ost. In this pa rt of the pap er we prop ose to replace classica l algo rithms to inductio n of decision trees in these metho ds by evolutionary alg orithms descr ibed earlier . Thu s , we o btain tw o new metho ds: evolutionary rando m fore s t (EvoRF) as the analogue of ra ndom for est and EvoBo o st as the a nalogue of AdaBo ost. In ad- dition to this, we consider the metho d (E voEnsemble) in which each individual in the p opulation is the whole ensemble, and repr e sen tation of the ensemble is a larg e real-v alued vector obtained by concatena tion w ith a vector for each tree from the ense m ble. Thus, in this metho d – e volutionary ensemble, we optimize the whole ensemble at once, which theore tica lly should lead to a b etter result. Evol u tionary algorithms for constructin g an ensem b le of decision trees 5 4 Exp erimen t F or exp erimen ts we use several po pular data s ets fr o m UCI r epositor y . Exp eri- men ts a re divided in to tw o parts. First, we ev aluate classiﬁcation ac c uracy of the metho ds ba sed on evolution- ary algorithms and compare their r e sults with the cla ssical metho ds fo r solve classiﬁcation pro blems. Exp eriments show tha t using o f the prop osed metho ds do es not allow to exceed the results o f classical algo rithms for constr ucting deci- sion trees on some datasets, but on the v ast ma jorit y of datas ets us ing evolution strategies a llows to achieve the sig niﬁcan t improvem e n t in the accuracy o f pre- diction b y s ev era l p ercent (T a ble 1). Therefore, we decide to use this algorithm to build ensem bles in subsequent exp eriment s . T able 1. Comparison of p opular classiﬁcatio n algorithms suc h as CAR T [11] and multil aye r p erceptron (MLP) [12] with the p roposed approaches: diﬀerential evol u tion (DE) and evolution strategies (ES ) . Dataset CAR T MLP DE ES Dataset CAR T MLP DE ES car 96.74 98.32 90.59 91.18 molecular-p 75.85 86.54 85.57 86.01 tic-tac-to e 93.65 93.38 87. 96 86.39 diab ets 74.49 73.89 75.03 75.07 glass 71.42 71.36 73.02 73.45 balance-scale 78.07 79.68 80.05 80.04 iris 94.45 97.13 96.9 7 97.24 ionosphere 88.23 89.78 91.32 91.17 australian 85.67 86.12 86.42 86.05 cmc 54.83 56.05 55.89 56.01 wine 92.43 93.75 94.58 94.65 vehicle 69.75 72.31 71.96 72.18 liv er- disoder 67.73 66.96 68.36 68.25 lympth 77.97 78.13 78.42 78. 36 hab erman 73.25 74.89 75.43 75.76 dermatology 94.32 93.56 95.67 95.75 heart-statlog 78.75 75.43 79.34 80.20 sonar 75.33 77.35 76.49 79.43 page-blocks 96.98 95.76 97.35 97.03 c red it-g 7 2.25 75.43 74.3 2 73.85 Second, w e ev aluate classiﬁcation accuracy of s ev er a l approa c hes to con- structing ensem bles of decision trees. F or these exper imen ts w e use datasets which hav e only tw o diﬀerent lab els, in other words, we are solving the problem of binar y classiﬁcation. V ario us h y p erpar ameter o f the ra ndom fo r est algorithm and AdaBo ost such as, the depth of tr e es and maximum num b er of trees was selected using the metho d of grid se a rc h, and then these sa me par ameters were used for their evolutionary analo gues. As well a s in the case of using evolution- ary alg orithms for c o nstructing a sing le decision tree , exp erimen ts show that using of the prop osed metho ds do es not allow to e x ceed the re sults of cla ssical algorithms for co nstructing ensemble of de c ision trees on s ome datasets, but the metho d that represent whole ensem ble as one rea l-v alue d v ecto r are showin g bes t accur acy on most datasets (T able 2). 6 Evgeny Doloto v and Nikolai Zolotykh T able 2. Comparison of random forest (RF) [13] and AdaBo ost [14] with the prop osed approac h es for constructing th e ensembles of decision trees: evolutionary versio n of random forest (EvoRF), Ad aBo ost, EvoBoost and evolution ensemble (EvoE n sem ble) Dataset RF Evo RF EvoEnsemble AdaBo ost Ev oB oost tic-tac-to e 97.48 97.76 97.84 96.31 96.91 australian 92.03 91.59 92.73 91.36 90.93 liv er- disoder 77.32 75.27 76.73 76.31 76.45 molecular-p 89.24 90.64 91.03 90.21 90.40 diab ets 82.31 83.74 83.67 82.23 85.07 ionosphere 92.35 92.89 93.11 91.76 92.17 hab erman 79. 45 80.12 80.79 79.21 80.69 heart-statlog 83.43 84.24 83.78 83.09 83.85 sonar 85.14 86.38 86.19 85.02 86.03 credit-g 77.31 77.24 79.15 79.07 79.63 5 Conclusion and F uture W ork In this pa per, we have prop osed several metho ds that use diﬀerent evolutionary algorithms to co ns truct decision trees a nd their ensem ble s . The main contribu- tion of this pap er is metho d to construct rea l-v a lued vector representation of decision tree that allows to use diﬀerent evolutionary a lgorithms for construct- ing decisio n trees and their ensembles. The prop osed algorithms show b etter quality than class ic a l metho ds such as CAR T, r andom forest a nd AdaBo ost on po pular da tasets fro m UCI r eposito ry , but in o rder to a c hieve such hig h res ults, it takes more time than using classical algor ithms. This is due to the fact that the methods using evolutionary algor ithms during tr aining several times build trees and ev aluate their quality , while classica l a lgorithms do it only once. A detailed ana lysis of the computational p erformance of the prop osed meth- o ds, parallel computatio ns in evolutionary algorithms, initializatio n o f initial po pulation by res ults of the class ical decisio n tree inductions algo rithms and ev o - lutionary analogue of gr adien t bo osting are pos sible a reas for further r esearch. References 1. R. O. Duda, P . E. Hart, and D. G. Stork. Pattern classiﬁcation, 2nd ed., Wiley inters cience, 2001. 2. J. R. Quinlan. Induction of decision trees, Mac hine Learning, vol. 1, no. 1, pp . 81-106, 1986. 3. J.R. Quinlan. C4.5: programs for mac h ine learning. Morgan Kaufmann Publishers Inc., 1993. 4. L. Breiman, J. H. F riedman, R. A. O lshen, an d C. J. S tone, Classi ﬁ cation and Regression T rees, 1984. 5. R. L. De M´ antar as, A distance-based attribu te selection measure for decision tree induction, Mac hine Learning, vol. 6, no. 1, pp . 8192, 1991. Evol u tionary algorithms for constructin g an ensem b le of decision trees 7 6. M. T asgetiren, Y . Liang, M. S evkli, and G. Gencyilmaz, D iﬀeren tial evolution algorithm for p ermutatio n ﬂow shop sequencing problem with mak espan criterion. In Proceedings of t h e 4th International Symp osium on Intell igent Manufa ct u ring Systems(IMS2004), pp. 442-452, 2004. 7. I. Rechenberg and M. Eigen. Evol u tionsstrategi e: Optimierung T echnisc her Sys- teme n ac h Prinzipien der Biologis chen Evol u tion. F rommann-Holzb oog Stuttgart, 1973. 8. D. Janko wski and K. Jack ow sk i. K. Evolutionary algorithm for decision tree in- duction. In IFI P International Conference on Computer Information Systems and Industrial Managemen t, pp. 23-32, 2015. 9. R. R iv era-Lop ez and J. Canul-Reich. Diﬀerential Evolution A lgorithm in the Con- struction of I n terpretable Classiﬁcation Models. In Artiﬁcial Intelligence-Emerging T rends and Applications, 2018. 10. M.P . Basgalupp, A . Carv alho, R.C. Barros, A. F reitas. A Survey of Evolutionary Algorithms for D ecision-T ree Induction. IEEE T ransaction on Sy stems, Man and Cyb ernetics, Part C (A pplications and Reviews), 2012. 11. L. Breiman, J.H. F riedman, R.A. O lshen, and C.J. Stone. Classiﬁcation and Re- gression T rees, W adsworth, Belmon t, CA. R epublished by CRC Press, 1984. 12. S. Haykin. Neural netw orks: a comprehen sive foundation. Pren tice Hall PTR, 1994. 13. L. Breiman. Random forests. Mac hine learning, vol . 45, no. 1, pp 5-32, 2001. 14. Y. F reund and R . E. S c hapire. A d ecisi on- theoretic generalization of on-line learn- ing and an application t o b oosting. Journal of comput er and system sciences, vol. 55, n o.1, pp 119-139, 2001. 15. T. Bac k . Evolutionary algorithms in th eory and p ractice: evolution strategies, evo - lutionary p rogrammi n g, genetic algorithms. Ox ford universit y press, 1996.

Evolutionary algorithms for constructing an ensemble of decision trees

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment