Selection Heuristics on Semantic Genetic Programming for Classification Problems
Individual's semantics have been used for guiding the learning process of Genetic Programming solving supervised learning problems. The semantics has been used to proposed novel genetic operators as well as different ways of performing parent selecti…
Authors: Claudia N. Sanchez, Mario Graff
Selection Heuristics on Seman tic Genetic Programming for Classification Problems Claudia N. S´ anc hez 1 , 2 Mario Graff 1 , 3 1 INF OTEC Cen tro de Inv estigaci´ on e Inno v aci´ on en T ecnolog ´ ıas de la Informaci´ on y Com unicaci´ on, Circuito T ecnopolo Sur No 112, F racc. T ecnop olo P o citos I I, Aguascalien tes 20313, M´ exico 2 F acultad de Ingenier ´ ıa. Universidad P anamericana, Aguascalientes, M ´ exico 3 CONA CyT Consejo Nacional de Ciencia y T ecnolog ´ ıa, Direcci´ on de C´ atedras, Insurgen tes Sur 1582, Cr´ edito Constructor, Ciudad de M´ exico 03940 M´ exico This w ork has b een submitted to the Ev olutionary Computation journal for p ossible publication. Abstract Individual’s semantics ha ve b een used for guiding the learning pro cess of Genetic Pro- gramming solving sup ervised learning problems. The semantics has been used to prop osed no vel genetic operators as well as differen t wa ys of p erforming parent selection. The lat- ter is the fo cus of this con tribution by prop osing three heuristics for paren t selection that replace the fitness function on the selection mechanism en tirely . These heuristics comple- men t previous work by b eing inspired in the c haracteristics of the addition, Naiv e Bay es, and Nearest Centroid functions and applying them only when the function is used to cre- ate an offspring. These heuristics use different similarit y measures among the parents to decide which of them is more appropriate given a function. The similarity functions con- sidered are the cosine similarity , P earson’s correlation, and agreement. W e analyze these heuristics’ p erformance against random selection, state-of-the-art selection sc hemes, and 18 classifiers, including auto-machine-learning tec hniques, on 30 classification problems with a v ariable n umber of samples, v ariables, and classes. The result indicated that the combination of paren t selection based on agreemen t and random selection to replace an individual in the p opulation pro duces statistically better results than the classical selection and state-of-the-art sc hemes, and it is competitive with state-of-the-art classifiers. Finally , the co de is released as op en-source soft ware. 1 In tro duction Classification is a sup ervised learning problem that consists of finding a function that learns the relation b etw een inputs and outputs, where the outputs are a set of lab els. It could b e applied for solving problems like ob ject recognition, medical diagnosis, identification of symbols, among others. The starting p oin t would b e the training set comp osed of input-output pairs, i.e., X = { ( ~ x 1 , y 1 ) , . . . , ( ~ x n , y n ) } , where ∀ ~ x ∈X ~ x ∈ R m , and ∀ y ∈X y ∈ R . The training set X is used to find a function, f , that minimize a loss function, L , that is, f is the function that minimize P ( ~ x,y ) ∈X L ( f ( ~ x ) , y ), where the ideal scenario would b e ∀ ( ~ x,y ) ∈X f ( ~ x ) = y , and, also to accurately predict the lab els of unseen inputs. Genetic programming (GP) is a flexible and p o w erful ev olutionary tec hnique with some features that can b e very v aluable and suitable for the ev olution of classifiers (Esp ejo et al., 2010). The first do cumen t related to GP w as presented b y F riedb erg (1958), but the term was coined b y Koza (1992). The classifiers constructed with 1 GP started app earing around the 2000s (Lov eard and Ciesielski, 2001; Brameier and Banzhaf, 2001), and now adays, they can be comparable with state-of-the-art machine learning techniques for solving hard classification problems. F or example, GP has b een used for medical prop oses (Brameier and Banzhaf, 2001; Olson et al., 2016), image classification (Iqbal et al., 2017), fault classification (Guo et al., 2005), only to men tion some of them. Also, the GP p erformance has b een successfully compared with others machine learning techniques as Neural Net works (Brameier and Banzhaf, 2001), Support V ector Mac hines (Lic ho dzijewski and Heyw o od, 2008; McIn tyre and Heyw o o d, 2011). Sp ecificaly , GP can be applied in the prepro cessing task (Guo et al., 2005; Badran and Rock ett, 2012; Ingalalli et al., 2014; La Ca v a et al., 2019), in mo del extraction (Lov eard and Ciesielski, 2001; Brameier and Banzhaf, 2001; Muni et al., 2004; Zhang and Smart, 2006; F olino et al., 2008; McInt yre and Heywoo d, 2011; Graff et al., 2016; Iqbal et al., 2017), or for building mac hine learning pip elines (Olson et al., 2016). T raditionally , GP uses individuals’ fitness to select the parents that build the next generation of individuals (Poli et al., 2008). Recently , new approaches for paren t selection that use angles among individuals’ semantics ha ve been developed. F or example, Angle-Driven Selection (ADS), prop osed by Chen et al. (2019), chooses paren ts maximizing the angle betw een their relative seman tics aiming to hav e paren ts with differen t b eha viors. V annesc hi et al. (2019) in tro duced a selection scheme based on the angle betw een the error v ectors. Besides, there ha ve b een approaches that analyze evolutiv e algorithms’ b ehavior where the fitness function is replaced with a heuristic. Nguy en et al. (2012) prop osed Fitness Sharing, a technique that promotes dispersion and diversit y of individuals. Lehman and Stanley (2011) proposed Nov elty Search, where the individual’s fitness is related totally to its no velt y without care ab out its target b ehavior. The individual’s no velt y is computed as the a verage distance b et ween its b eha vior and its k -nearest neigh b ors’ b ehavior. Naredo et al. (2016) used No velt y Searc h in GP for solving classification problems. How ever, one of the main c haracteristics of GP is that the function set defines its search space, and to the b est of our knowledge, there are not do cuments that prop ose the use of functions’ prop erties for parent selection. The in tersection b et w een our prop osal and the previous research w orks is on the selection stage of the evolutionary pro cess. Most of the proposed metho ds pro duce a selection mec hanism that considers the semantics of individuals as w ell as the individual’s fitness to choose the parents. The exception is the prop osals using No velt y Searc h, where the selection pro cess is en tirely performed using the semantics. It enhances p opulation div ersit y . Our approac h follo ws this path by aban- doning the fitness function in the selection pro cess. The difference with Nov elt y Search is that our prop osal searc hes for individuals whose semantics migh t help the function used to create the off- spring. Besides, once the individual w as created, lo cal optimization is p erformed using the training outputs, or, traditionally named, the target seman tics. F unctions’ prop erties inspire our s election heuristics; in particular, these were inspired b y the addition and the classifiers Naive Bay es and Nearest Centroid. The prop osed heuristics measure the similarity among parents. They are based on cosine similarity , Pearson’s correlation co efficien t, and agreemen t 1 . Sp ecifically , our heuristic based on P earson’s correlation co efficien t is quite similar to ADS, but the difference is that our prop osal is applied only to sp ecific functions. In this do cumen t, w e presen t the comparison of the use of ADS and our prop osal. The selection heuristics are tested on a steady-state GP called EvoD AG. It w as inspired by the Geometric Seman tic genetic operators proposed by Moraglio et al. (2012) with the implemen tation of V annesc hi et al. (V annesc hi et al., 2013; Castelli et al., 2015a). It has b een successfully applied to a v ariet y of text classification problems (Graff et al., 2020). In steady-state evolution, the selection mechanism is used twice to p erform parent selection and decide the individual b eing replaced by an offspring; w e refer to the later negativ e selection. The proposed selection is used to select the paren t; ho w ever, on the negativ e selection, w e also tested the system’s p erformance using random selection or the traditional selection guided by fitness. W e analyze the performance of differen t GP systems obtained by combining the schemes for paren t selection (our heuristics, random selection, and traditional selection) and negative selec- 1 W e thank the anon ymous reviewer for suggesting this name whic h considerably impro ves the description clarit y . 2 tion (random and traditional selection). Besides, w e compare our selection heuristics against t wo state-of-the-art tec hniques, Angle-Driv en-Selection (Chen et al., 2019), and Nov elty Search (Naredo et al., 2016). T o pro vide a complete picture of the performance of our selection heuris- tics, w e decided to compare them against state-of-the-art classifiers and t wo auto-machine learning algorithms. The results show that our selection heuristic guided by agreement and random neg- ativ e selection outp erforms traditional selection, presenting a statistically significan t difference. On the other hand, our selection heuristics outp erforms the state-of-the-art tec hniques Angle- Driv en-Selection and Nov elty Search. F urthermore, compared with state-of-the-art classifiers, our GP system obtained the second-lo w est rank b eing the first one TPOT, an auto-machine learning tec hnique, although the difference in performance betw een these systems, in terms of macro-F1, is not statistically significant. The rest of the manuscript is organized as follo ws. Section 2 presen ts the related w ork. The GP system used to test the prop osed heuristics is described in Section 3. The prop osed selection heuristics are presented in Section 4. Section 5 presents the exp erimen ts and results. The discus- sion and limitations of the approach are treated in Section 6. Finally , Section 7 concludes this researc h. 2 Related w ork W e can define sup ervised learning as follo ws V annesc hi (2017): given the training set composed of input-output pairs, X = { ( ~ x 1 , y 1 ) , . . . , ( ~ x n , y n ) } , where ∀ ~ x ∈X ~ x ∈ R m , and ∀ y ∈X y ∈ R ; the learning pro cess can b e defined as the problem of finding a function f that minimizes a loss function, L , that is, f is the function that minimize P ( ~ x,y ) ∈X L ( f ( ~ x ) , y ), where the ideal scenario w ould b e ∀ ( ~ x,y ) ∈X f ( ~ x ) = y . T raditionally , the vector composed by all the outputs, ~ t = { y 1 , . . . , y n } , is called target vector. In this wa y , a GP individual P can be seen as a function h that, for eac h input v ector ~ x i returns the scalar v alue h ( ~ x i ), and the ob jective is to find the GP individual that minimizes P ( ~ x,y ) ∈X L ( h ( ~ x ) , y ). On Seman tic Genetic Programming (SGP), for example in (V annesc hi, 2017), eac h individual P can b e represen ted by its seman tics vector ~ S p that corresp onds to all the inputs’ ev aluations in the function h , this is, ~ S p = { h ( ~ x 1 ) , . . . , h ( ~ x n ) } . W e can imagine the existence of t wo spaces: the genotype space, where individuals are represen ted b y their structures, and the phenotype or seman tic space, where individuals are represen ted b y p oin ts, whic h are their semantics. Remark that the target vector ~ t is a p oint in the semantic space. The dimensionality of the semantic space is the n umber of input v ectors. Using this notation, the ob jectiv e of SGP is to find the individual P whose semantics S p is as close as p ossible to the target v ector ~ t in the semantic space. The distance betw een the individual’s semantics ~ S p and the target v ector ~ t is used as the individual’s fitness. In order to provide a complete picture of the research documents related to this con tribution, the section starts by describing seman tic genetic operators; this is follo w ed b y presenting research prop osals proposing selection mec hanisms; and, lastly , some prop osals where GP has been used to dev elop classifiers. 2.1 Seman tic Genetic Op erators Seman tic Genetic Programming uses the target b eha vior, ~ t , to guide the search. Kra wiec (2016) affirmed that seman tically aw are methods make search algorithms b etter informed. Sp ecifically , sev eral crossov er and mutation op erators ha ve been developed with the use of seman tics. Beadle and Johnson (2008) prop osed a crosso v er operator that measures the semantic equiv alence betw een paren ts and offsprings and rejects the offspring that is semantically equiv alent to its paren ts. Uy et al. (2011) developed a semantic crosso ver and a m utation op erator. The crossov er op erator searc hes for a crossov er p oin t in each parent so that subtrees are semantically similar, and the m utation op erator allows the replacement of an individual’s subtree only if the new subtree is 3 seman tically similar. Hara et al. (2012) prop osed the Semantic Con trol Crossov er that uses se- man tics to com bine individuals. A global searc h is p erformed in the first generations and a local searc h in the last ones. Graff et al. used subtrees seman tics and partial deriv atives to prop ose a crosso ver (Graff et al., 2014b; Su´ arez et al., 2015) and a m utation (Graff et al., 2014a) op erator. Moraglio et al. prop osed the Geometric Semantic Genetic Programming (GSGP) (Moraglio and P oli, 2004; Moraglio et al., 2012). Their w ork called the GP scientific comm unit y’s attention b ecause the crosso ver operator pro duces an offspring that stands in the segmen t joining the par- en ts’ seman tics. Therefore, offspring fitness cannot be worse than the worst fitness of the parents. Giv en t wo parents’ semantics ~ p 1 and ~ p 2 , the crossov er operator generates an offspring whose se- man tics is r · ~ p 1 + (1 − r ) · ~ p 2 , where r is a real v alue b et ween 0 and 1. This prop ert y transforms the fitness landscape into a cone. Unfortunately , the offspring is alwa ys bigger than the sum of its paren ts’ size; this mak es the op erator unusable in practice. Later, some op erators were developed to impro ve Moraglio’s GSGP . F or example, Approximately Geometric Seman tic Crossov er (SX) (Kra wiec and Lichocki, 2009), Deterministic Geometric Seman tic Crosso ver (Hara et al., 2012), Lo cally Geometric Crosso ver (LGX) (Kra wiec and P awlak, 2012, 2013), Appro ximated Geomet- ric Crosso ver (A GX) (P awlak et al., 2015), and Subtree Seman tic Geometric Crossov er (SSGX) (Nguy en et al., 2016). Graff et al. (2015b) prop osed a new crosso ver op erator based on pro jections in the phenot yp e space. It creates a plane in the semantic space using the parents’ semantics. The offspring is calculated as the pro jection of the target in that plane. Given the parents’ semantics ~ p 1 and ~ p 2 , and the target seman tics ~ t , the offspring is calculated as α ~ p 1 + β ~ p 2 , where α and β are real v alues that are calculated solving the equation A [ α, β ] 0 = ~ t , where A = ( ~ p 1 , ~ p 1 ). It implies the offspring will b e at least as goo d as the b est parent. Memetic Genetic Programming based on Orthogonal Pro jections in the Phenotype Space was also prop osed by Graff et al. (2015a). In that w ork, they used a linear com bination of k paren ts as P k α k ~ p k , where ~ p k represen ts the semantics of the k parent. The main idea is optimizing the co efficien ts { α i } with ordinary least squares (OLS) to guarantee that the offspring is the b est of its family . As a result, the generated tree’s fitness is alwa ys b etter or equal to any in ternal tree. It was not the first time that parameters w ere added to GP no des; Smart and Zhang (2004) defined the Inclusion F actors as numeric v alues b et w een 0 and 1 assigned to eac h no de in the tree structure, except the ro ot no de. This v alue represen ts the inclusion prop ortion of the no de in the tree. Castelli et al. (2015b) presen ted a m utation op erator, in their w ork called Geometric Seman tic Genetic Programming with Lo cal Searc h, based on Moraglio’s m utation op erator, that also uses parameters. In this operator, an individual’s seman tics ~ p is mo dified with the follo wing equation, α 0 + α 1 ~ p + α 2 ( ~ r 1 − ~ r 2 ), where ~ r 1 and ~ r 2 are the semantics of random trees, and, α i R . { α i } are calculated using the target seman tics and OLS for getting the b etter linear combination of the individual’s semantics and the random trees. They extended their work in (Castelli et al., 2019) applying Lo cal Search to all the individuals during a separate step after m utation and crosso v er. F or each individual, p , they calculated another one as p 0 = αp + β , where α and β are optimized with OLS minimizing the error b et ween the individual’s semantics and target semantics. Moreo ver, they generalized the idea and transformed it into a regression problem p 0 = P j α j f j ( p ), where f j : R → R . All those op erators use semantics to guide the learning pro cess; ho wev er, one of the main c haracteristics of GP is that the function set defines its search space, and to the b est of our kno wledge, there are not do cumen ts that prop ose the use of functions’ prop erties for designing op erators. 2.2 Fitness and Selection in Genetic Programming According to V anneschi et al. (2014), one wa y to promote diversit y in GP is b y using different selection schemes. Nguy en et al. (2012) prop osed Fitness Sharing, a technique that promotes disp ersion and div ersity of individuals. Their prop osal consisted of calculating an individual’s shared fitness as f 0 i = f i ( m i + 1), where f i is the individual’s fitness, and m i is approximately equal to the n umber of individuals that b eha ve similarly to individual i . Galv an-Lop ez et al. (2013) applied crossov er only to those individuals whose difference in b eha vior is greater than a 4 defined threshold for ev ery elemen t on the semantic v ectors. Hara et al. prop osed Deterministic Geometric Seman tic Crossov er (Hara et al., 2012), and later, they prop osed to select the paren ts in such a w ay that the line connecting them is as close as possible to the target in the semantic space (Hara et al., 2016). Rub erto et al. (2014) defined the Error V ector and Error Space. The individual error vector ~ e p is defined as ~ e p = ~ p − ~ t , where ~ p is the individual’s seman tics and ~ t represen ts the target seman tics. The error space contains all the individuals represented b y their error v ectors, where ~ t is the origin. The prop osal is to search, in the error space, for t wo or three individuals aligned, instead of using the fitness function; the rationalit y comes from the fact that given the aligned individuals, then there is a straigh tforward procedure to compute the optimal solution. Ch u et al. (2016, 2018) used the error vectors and the Wilco xon signed-rank test to decide whether to select the fittest or the smaller individual as a parent. Their results show that the prop osed tec hniques aim to enhance seman tic diversit y and reduce the co de bloat in GP . V annesc hi et al. (2019) in tro duced a selection sc heme based on the angle b etw een the error v ectors. Chen et al. (2019) prop osed Angle-Driv en Geometric Seman tic Genetic Programming (ADGSGP). Their work attempts to further explore the geometry of geometric op erators in the search space to impro ve GP for symbolic regression. Their prop osal included Angle-Driv en Selection (ADS) that selects a pair of parents that ha ve go od fitness v alues and are far aw ay from eac h other regarding the angle-distance of their relativ e semantics. The first parent is selected using fitness, and the second one is chosen maximizing the relative angle-distance b etw een its semantics and the seman- tics of the first parent. The angle-distance is defined as γ r = arccos( ( ~ t − ~ p 1 )( ~ t − ~ p 2 ) || ~ t − ~ p 1 |||| ~ t − ~ p 2 || ), where ~ t , ~ p 1 , and ~ p 2 , represent the target seman tics, and seman tics of the first and the second paren t, resp ec- tiv ely . Also, they prop osed Perpendicular Crossov er (PC) and Random Segment Mutation (RSM) that likewise used angles to guide their pro cess. Their experiments show that the angle-driven geometric op erators drive the ev olutionary pro cess to fit the target seman tics more efficien tly and impro ve the generalization p erformance. Our prop osal, as well as these do cumen ts, aims to promote individuals’ div ersity but uses functions’ properties for guiding the paren t selection. 2.3 Genetic Programming for classification Genetic programming (GP) is a flexible and p ow erful ev olutionary tec hnique with some features that can b e v ery v aluable and suitable for the evolution of classifiers (Esp ejo et al., 2010). F or example, GP has b een used for medical prop oses (Brameier and Banzhaf, 2001; Olson et al., 2016), image classification (Iqbal et al., 2017), and fault classification (Guo et al., 2005). Also, GP classifiers hav e b een successfully compared with state-of-the-art classifiers. Brameier and Banzhaf (2001) compared GP against neural netw orks on medical classification problems from a benchmark database. Their results sho w that GP performs comparably in classification and generalization. McIn tyre and Heywoo d (2011) compared their GP framework against SVM classifiers ov er 12 UCI datasets with b et ween 150 and 200,000 training instances. Solutions from the GP framew ork app ear to provide a go o d balance b etw een classification p erformance and model complexity , espe- cially as the dataset instance coun t increases. La Cav a et al. (2019) compared their GP framew ork to several state-of-the-art classification techniques (Random F orest, Neural Net works, and Sup- p ort V ector Machines) across a broad set of problems, and show ed that their technique achiev es comp etitiv e test accuracies while also pro ducing concise mo dels. Data can b e transformed at the prepro cessing stage to increase the quality of the kno wledge obtained, and GP can b e used to p erform this transformation (Espejo et al., 2010). Badran and Ro c k ett (2012) prop osed multi-ob jectiv e GP to evolv e a feature extraction stage for multiple-class classifiers. They found mappings that transform the input space in to a new multi-dimensional decision space to increase the discrimination betw een all classes; the num b er of dimensions of this decision space is optimized as part of the ev olutionary pro cess. Ingalalli et al. (2014) introduced a GP framew ork called Multi-dimensional Multi-class Genetic Programming (M2GP). The main idea is to transform the original space into another one using functions ev olved with GP , then, a 5 cen troid is calculated for each class, and the v ectors are assigned to the class that corresp onds to the nearest centroid using the Mahalanobis distance. M2GP tak es as an argument the dimension of the transformed space. This parameter is evolv ed in M3GP (Munoz et al., 2015) by including sp ecialized searc h operators that can increase or decrease the n umber of feature dimensions pro- duced b y each tree. They extended M3GP and prop osed M4GP (La Cav a et al., 2019) that uses a stack-based represen tation in addition to new selection methods, namely lexicase selection and age-fitness P areto surviv al. Naredo et al. (2016) used Nov elty Search (NS) for evolving GP classifiers based on M3GP , where the difference is the procedure to compute the fitness. Each GP individual is represen ted as a binary v ector whose length is the training set size, and each v ector element is set to 1 if the classifier assigns the class lab el correctly and 0 otherwise. Then, those binary vectors are used to measure the sparseness among individuals, and the more the sparseness, the higher the fitness v alue. Their results show that all their NS v ariants ac hieve comp etitive results relative to the traditional ob jectiv e-based. Auto machine learning consists of obtaining a classifier (or a regressor) automatically . It includes the steps of prepro cessing, feature s election, classifier selection, and hyperparameters tuning. F eurer et al. (2015) dev elop ed a robust automated mac hine learning (AutoML) technique using Bay esian optimization metho ds. It is based on scikit-learn (Pedregosa et al., 2011), using 15 classifiers, 14 feature preprocessing methods, and 4 data prepro cessing metho ds; giving rise to a structured h yp othesis space with 110 hyperparameters. Olson et al. (2016) prop osed the use of GP to dev elop an algorithm that automatically constructs and optimizes machine learning pip elines through a T ree-based Pipeline Optimization T o ol (TPOT). On classification, the ob jective consists of maximizing the accuracy score p erforming a searc hing of the combinations of 14 prepro cessors, fiv e feature selectors, and 11 classifiers; all these tec hniques are implemented on scikit-learn. It is in teresting to note that TPOT uses a tree-based GP programming approach where the different learning pro cess comp onents are no des in a tree, and traditional subtree crossov er is used as a genetic operator. The use of the prop osed selection heuristics improv es EvoD AG, a GP system describ ed in Section 3, making it competitive with the auto-machine learning tec hniques explained abov e and other state-of-the-art classifiers. 3 Genetic Programming System W e decided to implement the prop osed selection heuristics and the selection heuristics of the state-of-the-art in our previously developed GP system called EvoD AG 2 (Graff et al., 2016, 2017). Ev oDA G is inspired b y the implementation of GSGP performed by Castelli et al. (2015a), where the main idea is to keep trac k of all the individuals and their b ehavior, leading to an efficient ev aluation of the offspring whose complexity depends only on the n umber of fitness cases. Let us recall that the offspring, in the geometric seman tic crossov er, is ~ o = r ~ p 1 + (1 − r ) ~ p 2 , where r is a random function or a constan t, and, ~ p 1 and ~ p 2 are the parents’ seman tics. As it w as explained in the previous section, in (Graff et al., 2015b), w e decided to extend this operation b y allowing the offspring to b e a linear combination of the paren ts, that is, ~ o = θ 1 ~ p 1 + θ 2 ~ p 2 , where θ 1 and θ 2 are obtained using ordinary least squares (OLS) minimizing the difference betw een the offspring and the target semantics. Contin uing with this line of researc h, in (Graff et al., 2016), w e in vestigated the case when the offspring is a linear combination of more than tw o parents, and, also, to include the possibility that the parents could be combined using a function randomly selected from the function set. Ev oDA G, as customary , uses a function set F = { P 60 , Q 20 , max 5 , min 5 , √ · , | · | , sin, tan, atan, tanh, h yp ot 2 , NB 5 , MN 5 , NC 2 } , and a terminal set T = { x 1 , . . . , x m } , to create the indi- viduals. It is also included in F classifiers suc h as Naiv e Ba yes with Gaussian distribution (NB 5 ), with Multinomial distribution (MN 5 ), and Nearest Centroid (NC 2 ). The function-set elements are traditional op erations where the subscript indicates the n umber of arguments. EvoD A G’s 2 https://gith ub.com/mgraffg/EvoD AG 6 default parameters, including the num b er of arguments, were defined p erforming a random search (Bergstra and Bengio, 2012), on the parameter space, using as a b enc hmark of classification prob- lems that included different sen timent analysis problems as well as problems taken from UCI rep ository (not included in the problems used to measure the p erformance of the selection heuris- tics). The final v alues were the consensus of the parameters obtaining the b est p erformance in the problems tested. The initial p opulation starts with P = { θ 1 x 1 , . . . , θ m x m , NB( x 1 , . . . , x m ), MN( x 1 , . . . , x m ), NC( x 1 , . . . , x m ) } , where x i is the i -th input, and θ i is obtained using OLS. In the case that the num- b er of individuals is low er than the p opulation size, the pro cess starts including an individual cre- ated by randomly selecting a function from F and the arguments are drawn from the current p op- ulation P . F or example, let hypot b e the selected function, and the first and second argumen ts are θ 2 x 2 , and NB( x 1 , . . . , x m ). Then, the individual inserted to P is θ h yp ot( θ 2 x 2 , NB( x 1 , . . . , x m )), where θ is obtained using OLS. This pro cess contin ues until the population size is reached; Ev oDA G sets population size of 4000. Ev oDA G uses a steady-state evolution; consequently , P is up dated by replacing a current individual, selected using a negativ e selection, with an offspring that can b e selected as a paren t just after b eing inserted in P . The evolution pro cess is similar to the one used to create the initial p opulation, and the difference is in the pro cedure used to select the ar guments. That is, a function f is selected from F , its arguments are selected from P using tournament selection or any of the prop osed selection heuristics, and finally , the parameters θ asso ciated to f are optimized using OLS. The addition is defined as P i θ i x i , where x i is an individual in P . The rest of the arithmetic functions, trigonometric functions, min, and max are defined as θ f ( . . . , x i , . . . ), where f is the function at hand, and x i is an individual in P . F or preven ting ov erfitting, EvoD AG stops the ev olutionary process using early stopping; that is, the training set is split into a smaller training set (50% reduction) and a v alidation set con taining the remaining elements. The training set is used to calculate the fitness and the parameters θ . The evolution stops when the best individual on the v alidation set has not b een updated in a defined num b er of ev aluations; Ev oDA G sets this as 4000. The final mo del corresp onds to the b est individual in the v alidation set found during the whole ev olutionary pro cess. A t this point, it is worth mentioning that EvoD AG uses a one-vs-rest scheme on classification problems. That is, a problem with k different classes is conv erted into k problems; each one assigns 1 to the current class and − 1 to the other labels. Instead of ev olving one tree p er class, as done, for example, in Muni et al. (2004), w e decided to use only one tree and optimize k differen t θ parameters, one for each lab el. The result is that each no de outputs k v alues, and the class is the one with the highest v alue. In the no des represen ting classifiers, like Naiv e Ba yes or Nearest Centroid, the output is the log-likelihoo d. In order to provide an idea of the type of mo dels pro duced by Ev oDA G, Figure 1 presents a mo del of the Iris dataset. The inputs ( x 0 , . . . , x 3 , NB , MN , NC) are at the bottom of the figure. The computation flow goes from b ottom to top; the output node is at the top of the figure, i.e., Naive Ba yes using Gaussian distribution. The figure helps to understand the role of optimizing the k set of parameters, one for each class, where eac h node outputs k v alues; consequently , each no de is a classifier. Ev oDA G uses macro- F1 score to calculate the individuals’ fitness. The macro-F1 score w as chosen because it helps to handle im balanced datasets. The class imbalance problem t ypically o ccurs when, in a classification problem, there are many more instances of some classes than others. In such cases, standard classifiers tend to b e ov erwhelmed by the class with more examples ignoring the less represen ted classes (Cha wla et al., 2004). It is well kno wn that in evolutionary algorithms, there are runs that do not pro duce an ac- ceptable result, to improv e stabilit y , w e decided to use Bagging (Breiman, 1996) in our approac h. W e create 30 differen t mo dels b y randomly selecting 50% samples for training and the remain- ing for v alidation. A bagging estimator can be exp ected to p erform similarly b y either dra wing n elements from the training set with-replacement or selecting n 2 elemen ts without-replacemen t (F riedman and Hall, 2007). In addition, we reduce the learning complexit y that, in Ev oDA G’s case, is measured in terms of training samples. Ev oDA G’s final prediction is the av erage of the mo dels’ predictions. 7 Figure 1: A mo del evolv ed b y Ev oD AG on the Iris dataset. The inputs are in the b ottom of the figure and the output is on the top. 4 Selection Heuristics This do cumen t prop oses selection heuristics for GP tailored to classification problems based on the idea that functions’ prop erties and individuals’ seman tics can guide parent selection. The heuristics replace the fitness function used in the selection pro cedure - tested in particular in tournamen t selection - to select the paren t. Let us recall that in a steady-state evolution, there are tw o stages where selection takes place. On the one hand, the selection is used to choose the paren ts, and on the other hand, the selection is applied to decide which individual, in the curren t p opulation, is replaced by the offspring. This last one is called negative selection. The most p opular selection method in GP is tournamen t selection (F ang and Li, 2010), and also, negative selection is commonly p erformed using the same selection scheme; nonetheless, in the latter case, the winner of the tournament is the one with the worst fitness. In the rest of the section, w e describ e the pro cess of creating an offspring in EvoD A G, and the traditional tournament selection (based on fitness), random selection, and our three prop osed selection heuristics. In Ev oDA G, the pro cess of creating an offspring starts b y selecting a function from the function set F , and then parent selection needs to be p erformed to c ho ose each one of the k argumen ts (or paren ts). Figure 2 sho ws an example of the traditional tournamen t selection (with tournament size t wo) where the function P w as selected, and 3 individuals need to be selected as argumen ts from the population P for creating the new offspring. It can b e seen that for selecting each argumen t, a binary tournament needs to be performed. F or selecting an argument, tw o individuals from the p opulation are randomly chosen, and the one with the highest fitness is selected as the argumen t; eac h tournament is depicted using a different color. The pro cedure represen ted in Figure 2 also helps to describ e random selection where eac h argument of the function P is selected randomly from the p opulation, and a tournamen t is not needed. As can be seen, random selection is the most straigh tforward and less expensive strategy giv en that there is no need to perform the tournamen t. Let us start b y describing the selection heuristics that w ere inspired by functions’ prop erties. The functions are the addition and the classifiers Naiv e Ba yes and Nearest Cen troid. The addition 8 Figure 2: Diagram of tournament selection using the fitness function to decide the winner of the tournamen t. is defined in our GP system as P k θ k p k , where OLS and target seman tics are used to estimate θ k . As can b e seen, to accurately iden tify the k co efficien ts, the exogenous v ariables p k m ust b e linearly indep enden t. In general, kno wing whether a set of vectors is linearly indep enden t requires a non-zero determinan t; how ev er, the trivial case is when these v ectors are orthogonal. Based on Brereton (2016), uncorrelated vectors are also linearly indep enden t. In Naiv e Bay es’s case, its model assumes that given a class, the features are independent. As exp ected, the process to calculate indep endence is exp ensiv e, so instead, w e use the correlation among features in our heuristic. The correlation of t wo statistically independent v ariables is zero, although the inv erse is not necessarily true. Finally , Nearest Centroid (NC) is a classifier representing each class with its centroid calculated with the elements asso ciated with that class. The lab el of a given instance corresp onds to the class of the closest centroid. Therefore, we think it might improv e the p erformance of NC if the div ersity of the inputs is increased. T o sum up, the three functions (addition, Naiv e Bay es, and Nearest Centroid) p erform better when their input vectors are orthogonal, uncorrelated, or indep enden t. The main idea is quite similar to the proposal in No velt y Searc h (Lehman and Stanley, 2011), where instead of promoting fitness, they promote diversit y . In our case, w e wan t fit individuals for the final solution, but w e select parents promoting diversit y among them. The prop osed selection heuristics use the individuals’ seman tics for c ho osing div erse paren ts. The first selection heuristic ideally would select orthogonal v ectors; eviden tly , this ev ent is unlik ely , so a function that measures the closeness to orthogonalit y is needed. The cosine similarit y is suc h a measure, it is defined in Equation 1, where ~ v 1 and ~ v 2 are v ectors, · represen ts the dot pro duct, and k ~ v k the norm of ~ v . Its range is b etw een − 1 and 1, where 1 indicates that the v ectors are in the same direction, − 1 exactly the opposite direction, and 0 indicates that v ectors 9 Figure 3: Diagram of paren t selection based on heuristics. are orthogonal. It is worth mentioning that the absolute of the cosine similarit y is used instead b ecause a cosine-similarity v alue of 1 or − 1 is similar regarding the linear indep endence. C S ( ~ v 1 , ~ v 2 ) = cos ( θ ) = ~ v 1 · ~ v 2 k ~ v 1 k k ~ v 2 k (1) The process of selecting a paren t using the absolute cosine similarity is depicted in Figure 3. Let us recall that tournament selection (with a tournamen t size of t wo) is b eing used, and the selection heuristic replaces the fitness function used traditionally in the tournamen t to select the paren ts. Under th is configuration, the figure depicts the selection of three arguments for b eing used with the addition function. The first of the argumen ts is selected randomly from the population; this is depicted on the red box. Selecting the second argumen t (box in green) requires c ho osing t wo individuals from the p opulation randomly and then comparing them using the absolute cosine similarit y b etw een eac h of the selected individuals and the first argumen t. The second argument is the one with the low est v alue: the closest to a 90-degree angle. The absolute cosine similarity is obtained using the individuals’ semantics, and these are v ectors or a list of v ectors (in case of m ulti-class problems); in the latter case, the av erage of the absolute cosine similarity is used instead. Finally , selecting the last argumen t (blue box) is equiv alen t to the previous one. That is, the individuals selected from the p opulation are compared using the absolute cosine similarit y with resp ect to the first argumen t selected. Although this pro cess do es not guaran tee that all the arguments are unique, the implemen tation ensures that all individuals are different; this is depicted in the figure by representing eac h possible argument with a different tree. The second heuristic uses the P erson’s Correlation Co efficien t to select uncorrelated in- puts. The correlation co efficien t is defined in Equation 2, where ~ v 1 and ~ v 2 are v ectors with the v alues of the v ariables, · represen ts the dot pro duct, ¯ ~ v is the a verage v alue of v ector ~ v , and k ~ v k the norm of ~ v . Pearson’s range is b et ween − 1 and 1, where 1 indicates p ositively correlated, 0 represents no linear correlation, and − 1 represents a total negative linear correlation. It can 10 b e observed that Equations 1 and 2 are similar. The difference is that correlation (Equation 2) subtracts the a verage v alue to the v ectors. It means that the heuristics based on cosine similarity and correlation will b e the same when the data is zero-cen tering. ρ ~ v 1 , ~ v 2 = ( ~ v 1 − ¯ ~ v 1 ) · ( ~ v 2 − ¯ ~ v 2 ) ~ v 1 − ¯ ~ v 1 ~ v 2 − ¯ ~ v 2 (2) Figure 3 depicts the process of selecting three argumen ts for the addition function. The process is similar to the one used for the co sine similarity , being the only difference b etw een them the use of the absolute v alue of the Pearson’s co efficien t instead of the cosine similarity . That is, the first argumen t is selected randomly from the p opulation, whereas the second and third argumen ts are the individuals whose seman tics obtained the lo west absolute P earson’s correlation co efficien t for eac h tournament. The previous selection heuristics increase the v ariety of the inputs using cosine similarit y and P earson’s Correlation coefficient. Ho wev er, these similarities do not consider the prediction lab els. In classification problems, the individual’s outputs are transformed to obtain the lab els taking, for example, the maxim um v alue index in a multiple output represen tation. The idea of the last heuristic is to complemen t the previous selection heuristics b y measuring div ersity using the predicted labels; the measure used with this purp ose is named agreement – defined in the Equation 3 3 , where ~ p 1 and ~ p 2 represen t the lab els v ectors of t wo individuals, n is the num b er of samples, and δ ( . ) returns 1 if its input is true and 0 otherwise. ag r ( ~ p 1 , ~ p 2 ) = 1 n X i δ ( p 1 i == p 2 i ) (3) Figure 3 depicts the pro cedure to select three arguments for the addition function using the agreemen t as the selection heuristic. The pro cess is similar to the ones used for the previous selection heuristics. That is, the first argumen t is an individual randomly selected from the p opulation. The second and third argument is selected p erforming a tournament where the fitness function is replaced by agreement. F or example, the second argument (green b o x) is selected b y first transforming the first argument’s outputs into lab els and transforming the outputs of the tw o individuals selected in the tournament. The lab els obtained are used to compute tw o agreemen t v alues, one for eac h individual in the tournamen t using in common the lab els of the first argumen t, i.e., the individual selected randomly . The second argumen t selected is the individual with the lo w est agreement, i.e., the one that optimizes the v ariety . The third argumen t is selected p erforming another tournament following an equiv alent metho d used on the second argument. T o sum up, three selection heuristics are prop osed, in this contribution, corresponding to the use of the absolute cosine similarity , P earsons’ Correlation co efficient, and agreement. These heuristics replace the fitness function in the tournament selection pro cedure, and, consequently , the selected individuals are the ones with the low est v alues on the particular heuristic used. 5 Exp erimen ts and results The selection heuristics p erformance is analyzed in this section and compared against our GP system with the default parameters, with state-of-the-art selection heuristics, and with traditional classifiers and classifiers using full-mo del selection. 5.1 Datasets The classification problems used as b enc hmarks are 30 datasets taken from the UCI rep ository (Dua and Graff, 2017). T able 1 shows the dataset information. It can be seen that the datasets are heterogeneous in terms of the num b er of samples, v ariables, and classes. Additionally , some of the classification problems are balanced, and others are imbalanced. W e use Shannon’s entrop y to 3 Note that in the case ~ p 2 is the target b eha vior, then the agreement is computing the accuracy of ~ p 1 . 11 indicate the degree of the class-imbalance in the problem. It is defined as H ( X ) = − P i p i log( p i ), where p i represen ts the probability of the category i . W e calculate those probabilities by counting the frequencies of each category . Besides, for normalizing, we use the logarithm base on the n umber of categories. F or example, if the classification problem has four categories, w e calculate the Shannon’s entrop y as H ( X ) = − P i p i log 4 ( p i ). In this sense, if the v alue is equal to 1 . 0, it indicates a p erfect balance problem. In opp osite, the smaller the v alue, the bigger the im balance. T able 1: Datasets used to compare the performance of the algorithms. These problems are tak en from the UCI rep ository . The table includes Shannon’s entrop y to indicate the degree of the class- im balance, where the v alue 1 . 0 indicates that the samples are p erfectly balanced. In opp osite, the smaller the v alue, the bigger the imbalance. Dataset T rain T est V ariables Classes Classes samples samples en tropy ad 2295 984 1557 2 0.58 adult 32561 16281 14 2 0.8 agaricus-lepiota 5686 2438 22 7 0.81 aps-failure 60000 16000 170 2 0.12 banknote 960 412 4 2 0.99 bank 31647 13564 16 2 0.52 bio deg 738 317 41 2 0.91 car 1209 519 6 4 0.6 census-income 199523 99762 41 2 0.34 cmc 1031 442 9 3 0.98 dota2 92650 10294 116 2 1.0 drug-consumption 1319 566 30 7 0.44 fertilit y 69 30 9 2 0.43 IndianLiv erPatien t 407 175 10 2 0.85 iris 105 45 4 3 1.0 krk opt 19639 8417 6 18 0.84 letter-recognition 14000 6000 16 26 1.0 magic04 13314 5706 10 2 0.93 ml-pro ve 4588 1530 56 2 0.98 m usk1 333 143 166 2 0.99 m usk2 4618 1980 166 2 0.61 optdigits 3823 1797 64 10 1.0 page-blo c ks 3831 1642 10 5 0.27 parkinsons 135 59 22 2 0.79 p endigits 7494 3498 16 10 1.0 segmen tation 210 2100 19 7 1.0 sensorless 40956 17553 48 11 1.0 tae 105 45 5 3 0.99 wine 123 53 13 3 0.99 y east 1038 446 9 10 0.76 The performance of the classifiers is measured in a test set. Some of the problems are already split b et ween a training set and a test set in the rep ository . F or those problems that this partition is not present, w e p erformed cross-v alidation; that is, w e randomly split the dataset using 70% of the samples for the training set and 30% for the test set. 12 5.2 Computer Equipmen t The computer characteristics where the exp eriments were executed are shown in T able 2. F or a fair comparison, the exp erimen ts were executed using only one core. T able 2: Characteristics of the computer where the experiments w ere executed Op erating system Ubun tu 16.04.2 L TS Pro cessor (CPU) In tel (R) Xeon(R) CPU E5-2680 v4 Pro cessor (CPU) sp eed 2.5GHz Computer memory size 256 GB Hard disk size 1 TB Cores n umber 14 5.3 P erformance Metrics The classifiers’ p erformance is analyzed in terms of precision and time sp ent on training using tw o metrics: macro-F1 and time (in seconds) p er sample. Accuracy is ma yb e the most used metric for measuring the p erformance of classifiers. Its v alue ranges from 0 to 1, being one the b est p erformance and zero the w orst. It can be seen as the p ercen tage of samples that are correctly predicted. How ev er, if the classes are imbalanced, as the problems in this b enc hmark, accuracy is not reliable. On the other hand, the F1 score measures a binary classifier’s performance taking in to account the p ositiv e class. It is robust to im balanced problems. F or a m ulti-class problem, the F1 score can be extended as the macro-F1 score that corresp onds to the av erage of the F1 score per class. Besides, most of the comparisons are performed based on the rank of macro-F1. It means, for eac h dataset, the classifiers are rank ed according to their performance. The num b er 1 is assigned to the classifier with the highest v alue in macro-F1, num b er 2 corresp onds to the one with the second-highest v alue in macro-F1, and so on. If several classifiers ha ve the same v alue in macro-F1, they got the same rank, and the follo wing rank n umber will b e increased the num b er of repeated v alues. In addition to a classifier’s p erformance for predicting the samples correctly , time is an es- sen tial factor in an algorithm. When the n umber of samples in the training set is small, all the algorithms learn the model quickly . How ever, if the n umber of samples gro ws, some algorithms sp end considerably more and more time, and in some cases, it could b e impossible to wait until the algorithm con verges. As we mention in the previous section, the datasets v ary on the n umber of samples, and, logically , algorithms sp end more time learning big datasets. In that sense, to normalize the time, and with the idea of making comparisons based on this measure, we divided the time (in seconds) that algorithms spend in the training phase by the n umber of samples in the datasets. 5.4 Comparison of the Prop osed Selection Heuristics against Classic T ournamen t Selection W e p erformed a comparison of different selection sc hemes for paren t and negative selection. Sp ecif- ically for paren t selection , we compare the use of the following techniques: (1) traditional tourna- men t selection, which uses the individual’s fitness ( fit ), (2) random selection ( rnd ), (3) tournamen t selection with the absolute of cosine similarity ( sim ), (4) tournamen t selection with the absolute of P earson’s correlation co efficien t ( prs ), and, (5) tournamen t selection with the agreemen t ( agr ). The latest three selection metho ds correspond to the proposed selection heuristics; in this case, the selection heuristics are applied only for the functions that inspired them, i.e., addition ( P ), Naive Ba yes (NB and MN), and Nearest Centroid (NC), in the rest of the functions random selection is used instead. In addition, for negative selection , w e analyze the use of the traditional negative 13 selection, which uses the individual’s fitness to select the worst individual in the tournamen t ( fit ), and random selection ( rnd ). The selection schemes were tested on our GP system (Ev oDA G). T o improv e the reading, w e use the following notation. The selection scheme used for paren t selection is follow ed by the sym b ol “-”, and then comes the abbreviation of the negative selection sc heme. F or example, sim-fit means tournamen t selection with the absolute cosine similarit y for paren t selection and traditional negativ e selection are used. In total, w e analyze the performance of eight combinations: fit-fit, rnd- rnd, sim-fit, sim-rnd, prs-fit, prs-rnd, agr-fit, and agr-rnd. F urthermore, to complete the picture of the proposed selection heuristics and the relation with the functions that served as inspiration, w e decided to include in the comparison the p erformance GP systems when the heuristics are used in all the functions with tw o or more argumen ts. These systems are identified with the symbol *. F or example, agr-rnd indicates that agreemen t is used with the functions P , NB, MN, and NC; whereas, agr-rnd* means that the heuristic is used with the functions: P , Q , max, min, h yp ot, NB, MN, and NC. Figure 4 shows the performance for the different techniques used for parent and negative se- lection on classification tasks. The detailed results can be observ ed on T able 4. Figure 4a shows the p erformance results based on mac ro-F1 ranks o ver the test sets. It can b e seen that the best p erformance is obtained by selecting the paren ts with the agreemen t heuristic and random nega- tiv e selection (agr-rnd), follow ed by the use of the same scheme for parent selection and negative tournamen t selection (agr-fit). The combinations agr-rnd, agr-fit, sim-fit, prs-fit*, prs-rnd, prs-fit, prs-rnd*, sim-fit*, sim-rnd, and sim-rnd* are b etter than fit-fit (i.e., selection using the fitness function) in terms of macro-F1 a verage rank. It means that our prop osed heuristics impro ve the p erformance of the classical selection sc hemes (fit-fit). Random selection for parent and negativ e selection, rnd-rnd, also improv es the p erformance of Ev oDA G using the classical selection schemes based on fitness (fit-fit). It indicates the imp ortance of p opulation diversit y , as mentioned in Nov- elt y Searc h Lehman and Stanley (2011). Besides, as w e men tion in Section 3, once the individual is created, the function parameters are optimized using OLS and the target semantics. Our heuristic based on agreement works w ell b ecause it is quite similar to the Nov elty Search implemented in (Naredo et al., 2016), but instead of impro ving diversit y among all individuals in the population, it enhances the diversit y among paren ts. F rom the figure, it can also b e observed that there exists a tendency when the heuristics (iden tified with the symbol *) are applied to all functions with more than one argument ( P , Q , max, min, hypot, NB, MN, and NC); the results are w orse than those systems that use the prop osed heuristics on the functions that inspired them (i.e., addition, Naive Ba yes and Nearest Cen troid). It affirms the importance of generating heuristics that are specifically designed based on the functions’ prop erties. Comparing b y the time that the classifiers sp end in the training phase (see Figure 4.b), it can b e seen that rnd-rnd is the fastest; this is b ecause it is the most straigh tforward. F or the statistical analysis, w e use the F riedman and Nemenyi tests (Dem ˇ sar, 2006). Macro- F1 ranks v alues were used for the F riedman test where there w as rejected the null hypothesis, with a p-v alue of 4 . 39 e − 22. Based on Nemenyi test, the groups of techniques that are not significan tly different (at p=0.10) are: Group 1 (agr-rnd, agr-fit, rnd-rnd, sim-fit, prs-fit*, prs-rnd, prs-rnd*, prs-fit, sim-fit*, sim-rnd, and sim-rnd*), Group 2 (agr-fit, rnd-rnd, sim-fit, prs-fit*, prs- rnd, prs-rnd*, prs-fit, sim-fit*, sim-rnd, sim-rnd*, and fit-fit), and Group 3 (agr-rnd*, agr-fit*). It indicates that our prop osed heuristic com bination based on the agreement for paren t selection and random negativ e selection (agr-rnd) performs statistically b etter than the classical tournamen t selection using fitness (fit-fit) b ecause they b elong to different groups, i.e., Group 1 and Group 2, resp ectiv ely . The test also indicates that there is not enough evidence to differentiate the systems of Group 2 (except fit-fit) with system agr-rnd. 14 (a) (b) Figure 4: Selection schemes comparison. a) Macro-F1 ranks that are measured ov er the test datasets. b) Time, in seconds, required b y the differen t selection tec hnique combinations in the training phase. The time is divided by the num b er of train samples in the datasets. In b oth figures, green boxplots represent where the selection heuristics (sim, prs, or agr) are applied to all functions. The av erage rank, or time p er sample, sorts the classifiers, and it appears on the left. 15 5.5 Comparison of the Proposed Selection Heuristics against State-of- the-Art Selection Sc hemes As we men tion in Section 2, there are selection heuristics related to this research. Consequently , w e decided to compare our selection heuristics with the tw o most similar metho ds; these are Angle-Driv en-Selection (Chen et al., 2019) and Nov elt y Search (Lehman and Stanley, 2011). In Angle-Driv en Selection (ads), the first individual is selected using traditional tournamen t selection and then replaces the fitness function, in the tournament selection, with their relative angle in the error space. As can b e seen, in Angle-Driv e-Selection, the first individual is chosen using the fitness, whereas, in our prop osal, it is selected randomly . Therefore, we decided to add another parameter to indicate whether the first individual is selected using the fitness (fit) or random (rnd). The combinations of selection tec hniques’ notation is as follows. The sym b ol “-” follo ws the parent selection technique, then comes the abbreviation of the negative selection scheme, and, for our heuristics and ads, at the ending, after the sym b ols “–”, comes the abbreviation of the sc heme to select the first individual. F or example, agr-rnd–fit means that agreemen t is used for paren t selection, the negativ e selection is p erformed randomly , and the first individual is selected using the fitness. Figure 5: Proposed heuristics against state-of-the-art selection schemes based on macro-F1. Bo x- plots present the ranks, and those are measured using macro-F1 o ver the test datasets. Gray b o xplots represent the selection techniques from the state-of-the-art, Nov elty Searc h (n vs) and Angle-Driv en Selection (ads). The av erage rank sorts the classifiers, and it appears on the left. Figure 5 presen ts the results of the comparison, based on macro-F1, of our prop osed heuristics, agreemen t, P earson’s Correlation co efficien t, and cosine similarity (agr, prs, and sim) against state- of-the-art selection tec hniques: Angle-Driven Selection (ads) and Nov elty Search (nvs). T ables 5 and 6 show the detailed results. It can b e observ ed that the p erformance of our heuristics is generally b etter than Angle-Driv en Selection (ads) and Nov elty Search (n vs). Using F riedman and 16 Nemen yi tests (Dem ˇ sar, 2006), it was found that agr-rnd–rnd, agr-fit–fit, agr-fit–rnd, rnd-rnd, sim- fit–rnd, agr-rnd–fit, prs-fit–rnd, prs-rnd–rnd, ads-rnd–rnd, and sim-rnd–rnd are not significantly differen t (at p = 0.10), but agr-rnd–rnd is significantly b etter than no velt y search (nvs-rnd), angle-driv en with the original prop osal of selecting the first individual at random (ads-fit–fit), fit-fit, ads-rnd–rnd*, and ads-fit–fit*. In the original proposal of Angle-Driv en selection (Chen et al., 2019), it is implemen ted in a Geometric Semantic GP system. Ho wev er, in this case, it is applied in EvoD AG. Angle-Driv en selection is quite similar to the prop osed heuristics based on Pearson’s Correlation co efficien t and cosine similarity (alb eit following a different path). These tec hniques use the geometry of individuals’ seman tics for parent selection. ADS measures the angle b et ween relativ e seman tics (see Section 2), while our heuristics measure the angle b et w een the semantics and the centered seman tics (see Section 4). On Figure 5, it can be observed that Angle-Driven selection (ads-rnd– rnd) performs similar to P earson’s Correlation coefficient (prs-rnd–rnd) and cosine similarit y (sim- rnd–rnd); in fact, its rank is in the middle of these tw o systems. Besides, Angle-Drive Selection is b etter when it uses the combination of selection sc hemes ads-rnd–rnd than the original proposal ads-fit–fit. As our heuristics behavior, w e can see that Angle-Driv e Selection works b etter when applied only to the functions addition, Naive Ba yes, and Nearest Cen troid than when applied to all functions with more than one argumen t. On the other hand, Nov elty Searc h was used in a traditional GP system to optimize the inputs of a Nearest Cen troid classifier (Naredo et al., 2016). An individual’s no velt y is calculated from the whole p opulation, not only of the individuals participating in the tournament as done in our prop osed selection heuristics. Nov elty Search’s p erformance is just b elo w our GP system with the default parameters (fit-fit), and it is the third system with the worst rank. The p erformance obtained by Nov elty Search might indicate that it is b etter to use only the information of the individuals participating in the tournament to compute the similarit y . The agreement selection heuristic could b e seen as a w a y to transform the no velt y searc h measure using only the individuals participating in the tournament. 5.6 Comparison of Proposed Selection Heuristics against State-of-the- Art Classifiers After analyzing the different selection schemes’ performance, it is the m omen t to compare our selection heuristics against state-of-the-art classifiers. W e chose the com bination of the selection sc hemes: agr-rnd, rnd-rnd, fit-fit, ads-rnd, n vs-rnd. The reason is that agr-rnd is the combination that giv es the b est results, rnd-rnd represents the simplest schemes b eing also highly comp eti- tiv e, fit-fit represents the traditional tournament selection, and finally , ads-rnd and nvs-rnd are the state-of-the-art selection schemes. W e decided to p erform the comparison against sixteen classifiers of the scikit-learn python library (P edregosa et al., 2011), all of them using their de- fault parameters. Sp ecifically , these classifiers are Perceptron, MLPClassifier, BernoulliNB, Gaus- sianNB, KNeigh b orsClassifier, NearestCen troid, LogisticRegression, LinearSV C, SV C, SGDClassi- fier, P assiveAggressiv eClassifier, DecisionT reeClassifier, ExtraT reesClassifier, RandomF orestClas- sifier, AdaBo ostClassifier and GradientBoostingClassifier. It is also included in the comparison t wo auto-mac hine learning libraries: autosklearn (F eurer et al., 2015) and TPOT (Olson et al., 2016). Figure 6: Comparison of selection heuristics against state-of-the-art classifiers based on macro-F1 rank. The a verage rank sorts the classifiers, and those v alues app ear on the left. The blue b o xplots represen t the selection heuristics. T able 3: Performance using macro-F1 (with ranks) of: tp ot, autosklearn, Selection Heuristics (agr-rnd, rnd-rnd, fit-fit, n vs-rnd, and ads-rnd), Percep- tron (PER), MLPClassifier (MLP), BernoulliNB (NBB), GaussianNB (NB), KNeigh b orsClassifier (KN), NearestCen troid (NC), LogisticRegression (LR), LinearSVC (LSV C), SVC, SGDClassifier (SDG), Passiv eAggressiveClassifier (P A), DecisionT reeClassifier (DT), ExtraT reesClassifier (ET), RandomF orestClassifier (RF), AdaBo ostClassifier (AB) and GradientBoostingClassifier (GB). The symbol - represents that the classifier can not solv e the classification problem. The rest of the systems are in T able 7. tpot acc-rnd autosklearn rnd-rnd GB ads-rnd fit-fit nvs-rnd ET RF MLP DT ad 0.96(4) 0.94(8) 0.96(3) 0.93(14) 0.96(2) 0.96(5) 0.93(12) 0.94(7) 0.94(6) 0.96(1) 0.93(11) 0.94(10) adult 0.81(1) 0.79(4) 0.81(2) 0.79(5) 0.8(3) 0.79(9) 0.79(6) 0.79(8) 0.76(11) 0.77(10) 0.58(18) 0.74(12) agaricus-lepiota 0.67(7) 0.68(2) 0.68(5) 0.68(3) 0.56(16) 0.68(4) 0.68(6) 0.68(1) 0.43(21) 0.45(20) 0.61(11) 0.42(22) aps-failure 0.9(1) 0.83(8) 0.87(2) 0.86(3) 0.85(4) 0.84(6) 0.85(5) 0.83(9) 0.76(18) 0.8(12) 0.82(10) 0.83(7) banknote 1.0(1) 1.0(1) 1.0(10) 1.0(1) 0.99(17) 1.0(1) 1.0(1) 1.0(1) 1.0(12) 0.99(14) 1.0(1) 0.98(19) bank 0.74(6) 0.76(3) 0.71(8) 0.76(1) 0.72(7) 0.76(5) 0.76(4) 0.76(2) 0.68(12) 0.7(11) 0.67(13) 0.7(9) biodeg 0.81(8) 0.84(2) 0.82(6) 0.85(1) 0.79(12) 0.82(7) 0.83(5) 0.84(3) 0.81(10) 0.81(8) 0.8(11) 0.79(13) car 1.0(1) 0.87(6) 0.96(3) 0.86(7) 0.97(2) 0.8(11) 0.84(9) 0.81(10) 0.87(5) 0.85(8) 0.59(15) 0.94(4) census-income 0.78(1) 0.77(3) 0.75(6) 0.77(2) 0.75(5) 0.75(7) 0.75(4) 0.38(23) 0.72(9) 0.49(19) 0.71(10) 0.48(21) cmc 0.54(2) 0.54(4) 0.55(1) 0.53(8) 0.52(9) 0.53(5) 0.53(7) 0.53(6) 0.47(16) 0.48(14) 0.52(10) 0.45(18) dota2 0.59(7) 0.59(2) 0.59(8) 0.59(3) 0.56(13) 0.59(4) 0.59(1) 0.59(5) 0.55(15) 0.54(16) 0.59(10) 0.52(18) drug-consumption 0.18(12) 0.23(2) 0.13(20) 0.2(7) 0.19(11) 0.21(4) 0.2(8) 0.2(5) 0.16(16) 0.17(13) 0.25(1) 0.2(6) fertility 0.44(16) 0.45(4) 0.45(4) 0.45(4) 0.44(16) 0.45(4) 0.44(16) 0.45(4) 0.45(4) 0.45(4) 0.45(4) 0.59(2) IndianLiverP atient 0.55(16) 0.66(3) 0.56(14) 0.69(1) 0.55(15) 0.66(2) 0.64(5) 0.65(4) 0.61(6) 0.57(11) 0.59(8) 0.54(17) iris 0.98(8) 0.98(2) 0.98(2) 0.98(2) 0.94(14) 0.96(10) 0.98(2) 0.96(10) 0.92(17) 0.94(14) 0.98(2) 0.96(10) krkopt 0.91(1) 0.19(11) -(23) 0.2(9) 0.61(6) 0.2(10) 0.15(15) 0.18(12) 0.7(4) 0.76(3) 0.54(8) 0.83(2) letter-recognition 0.97(2) 0.65(11) -(23) 0.66(10) 0.91(7) 0.65(13) 0.65(13) 0.65(12) 0.94(4) 0.93(5) 0.92(6) 0.87(8) magic04 0.87(2) 0.85(5) 0.87(1) 0.85(6) 0.85(3) 0.84(9) 0.83(10) 0.84(8) 0.84(7) 0.85(4) 0.77(13) 0.79(12) ml-prov e 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(13) 1.0(1) 1.0(17) 0.97(20) 1.0(13) 1.0(1) musk1 -(23) 0.88(5) 0.87(10) 0.87(11) 0.91(1) 0.89(4) 0.86(13) 0.9(3) 0.91(2) 0.87(9) 0.81(14) 0.79(16) musk2 0.98(1) 0.94(6) 0.98(2) 0.94(7) 0.93(10) 0.94(9) 0.95(3) 0.92(13) 0.95(4) 0.94(8) 0.95(5) 0.92(12) optdigits 0.98(1) 0.95(7) 0.98(2) 0.96(6) 0.96(4) 0.92(15) 0.94(11) 0.95(8) 0.95(9) 0.94(12) 0.96(5) 0.86(19) page-blocks 0.85(2) 0.83(4) 0.89(1) 0.76(12) 0.82(5) 0.76(11) 0.77(10) 0.78(9) 0.82(6) 0.79(7) 0.65(15) 0.85(3) parkinsons 0.81(5) 0.75(9) 0.82(3) 0.73(11) 0.85(1) 0.75(8) 0.67(15) 0.67(12) 0.84(2) 0.81(5) 0.43(18) 0.74(10) pendigits 0.98(1) 0.94(8) 0.98(3) 0.94(9) 0.96(6) 0.93(11) 0.94(10) 0.92(13) 0.96(5) 0.96(7) 0.97(4) 0.92(12) segmentation 0.95(1) 0.91(7) 0.94(2) 0.91(8) 0.94(4) 0.91(6) 0.9(11) 0.89(12) 0.93(5) 0.94(3) 0.68(16) 0.91(9) sensorless 1.0(1) 0.96(8) 1.0(3) 0.95(10) 0.99(5) 0.95(9) 0.96(7) 0.92(12) 1.0(2) 1.0(4) 0.95(11) 0.98(6) tae 0.51(5) 0.36(14) 0.45(6) 0.32(16) 0.52(4) 0.37(12) 0.44(7) 0.3(18) 0.54(3) 0.6(2) 0.4(8) 0.63(1) wine 1.0(1) 0.98(4) 1.0(1) 0.98(4) 0.98(4) 0.98(4) 0.98(4) 0.98(4) 0.98(4) 1.0(1) 0.14(23) 0.94(14) yeast 0.53(5) 0.47(6) 0.55(2) 0.45(9) 0.59(1) 0.46(7) 0.46(8) 0.45(10) 0.54(3) 0.44(12) 0.05(21) 0.45(11) Average rank 4.8 5.3 5.9 6.4 6.9 7.1 8.0 8.2 8.5 9.2 10.5 10.8 Figure 7: P erformance of Selection Heuristics and state-of-the-art classifiers by dataset based on macro-F1. The classifiers keep their p osition in all the images. Closer classifiers p erform similarly . The macro-F1 v alue is represented b y color, where dark red represen ts 1.0 and dark blue represen ts 0.0. The color scale is represented on the righ t. The systems are: tpot, autosklearn, Selection Heuristics (agr-rnd, rnd-rnd, fit-fit, nvs- rnd, and ads-rnd), Perceptron (PER), MLPClassifier (MLP), BernoulliNB (NBB), GaussianNB (NB), KNeighborsClassifier (KN), NearestCentroid (NC), LogisticRegression (LR), LinearSVC (LSV C), SV C, SGDClassifier (SDG), Passiv eAggressiveClassifier (P A), DecisionT reeClassifier (DT), ExtraT reesClassifier (ET), RandomF orestClassifier (RF), AdaBoostClassifier (AB) and Gradien tBo ostingClassifier (GB) T able 3 and Figure 6 sho w the comparison of classifiers based on macro-F1 ranks. The b est classifier, based on the results of these exp erimen ts, is TPOT, follow ed b y our GP system using agreemen t and random negativ e selection (agr-rnd), autosklearn, and GP with random selection (rnd-rnd) in both tournamen ts (p ositiv e and negativ e). It can be seen that the use of our proposed selection heuristic based on accuracy and negative random selection impro ves the p erformance of traditional selection (fit-fit) and p ositioned it into second place. The agreement selection heuristic with random n egative selection pro duces a system that outperforms the scikit-learn classifiers, and it is comp etitiv e with auto-machine learning libraries, i.e., TPOT and autosklearn. Additionally , T able 3 shows that autosklearn and TPOT cannot solv e some classification problems. F riedman’s test rejects the n ull h yp othesis that all classifiers perform similarly , with a p-v alue of 2 . 3 e − 54. Nemen yi test (following the steps describ ed in Dem ˇ sar (2006)) results can b e ob- serv ed in Figure 8. There w ere no statistical differences b et w een TPOT, our prop osed heuristics (agr-rnd, rnd-rnd, ads-rnd, n vs-rnd), autosklearn, GradientBoosting, ExtraT rees, RandomF orest, and DecisionT ree. It can also b e observed that TPOT and agr-rnd b elong to different groups (i.e., these are statistically different) than LogisticRegression, KNeighbors, AdaBo ost, SVC, Lin- earSV C, Naive Ba yes (Gaussian and Bernoulli), NearestCen troid, Passiv eAggressiv e, Perceptron, and StochasticGradien tDescent. Figure 8: Comparison of all classifiers against each other using the macro-F1 av erage ranks with the Nemenyi test. Groups of classifiers that are not significantly different (at p = 0.10) are connected. In order to analyze the systems’ p erformance, the difference in b eha vior, and the hardness of the problems, we decided to depict this information using our visualization technique proposed in S´ anchez et al. (2019). The idea is to represen t eac h classifier with a p oin t in a plane. Eac h system can be seen as a v ector where each dimension represents a problem, and the v alue is the system’s performance (macro-F1) in that problem; using this representation, the idea is to depict this vector in a plane where the distance is preserved. Figure 7 shows the classifiers’ visualization; the color represen ts the macro-F1, and all the systems conserved the same p osition in all b o xes. Small boxes represen t a problem, and the one in the center is the a verage of all problems. The figure helps to iden tify those problems where all the systems b eha v e similarly . F or example, it 21 can b e observed that all the systems find easy the banknote and ml-pro ve problems. On the other hand, all the systems find complex the drug-consumption problem. F rom the figure, we can observ e that the selection heuristics are close, the classifiers based on decision trees are close to the heuristics, and in opposite extremes are TPOT and autosklearn. Let us draw a line from the left upper corner to the right upp er corner; we can see that the systems on the right of the line (including autosklearn) are in the top group shown in Figure 8 the only system missing is MLP which is on the left. The classifiers’ comparison based on the time sp end in learning the model is presented in Figure 9. T ables 8, 9, and 10 sho w the detailed results. It can be seen that scikit-learn classifiers hav e the best ranks; these sp end from 0.007 to 0.01 seconds p er sample. With the differen t selection sc hemes, our GP system spends more time than scikit-learn classifiers in the learning phase. It sp ends, on av erage, from 0.5 to 5 seconds p er sample. How ever, it is considerably faster than the auto-mac hine learning libraries, autosklearn and TPOT, whic h consume on av erage 11.5 and 57.68 seconds, resp ectiv ely . F riedman’s test rejects the n ull hypothesis that all classifiers sp end the same time, with a p-v alue of 7 . 186 e − 94. Nemenyi test (follo wing Dem ˇ sar (2006)) results can b e observed in Figure 10. The figure sho ws a group formed by TPOT, the selection heuristics (except rnd-rnd), and autosklearn. The only selection heuristic that is statistically differen t from TPOT is rnd-rnd. Figure 9: Comparison of selection heuristics against state-of-the-art classifiers based on the time required by the classifiers’ training phase. The time is presented in seconds, and it is the av- erage time p er sample. The av erage time sorts the classifiers, and those v alues are on the left. The blue b o xplots represen t the selection heuristics. The time, represen ted on the x-axis, gro ws exp onen tially . 6 Discussion In this con tribution, we ha ve analyzed different selection schemes for GP . The system used to p erform the analysis is EvoD AG. So, it is essential to mention that the significan t difference b et w een Ev oDA G and other GP systems is that each no de contains constant(s) optimized using OLS to minimize the error. This characteristic is also present in TPOT and the Nov elty Search Classifier (Naredo et al., 2016), alb eit a different optimizer is used on those researc h works. The results obtained are in line with the results presented on No velt y Search (Lehman and Stanley, 2011; Naredo et al., 2016) where the idea is to abandon the fitness function, although the w ork presen ted by Naredo et al. (2016) and ours optimize constants in the evolutionary process. F or 22 Figure 10: Comparison of all classifiers against eac h other using the time p er sample a verage ranks with the Nemenyi test. Groups of classifiers that are not significan tly differen t (at p = 0.10) are connected. those systems that do not optimize constants in the ev olution, we believe the random selection sc hemes would not b e competitive as these are in the curren t scenario. It is p ertinen t to mention the characteristics of the problems used as benchmarks, particularly the n umber of v ariables. It can b e observ ed in T able 1 that the maxim um n umber of v ariables is 1557, whic h corresp onds to the ”ad” dataset; also, this dataset is easy for the ma jorit y of classifiers (see Figure 7), and the second problem with more features is ”aps-failure” with 170. Consequently , the analysis performed can guide selecting a classifier when the num b er of features is around 100, and it might not b e of help regarding high-dimensional problems. As a side note, in preliminary exp erimen ts, while doing researc h describ ed in Graff et al. (2020), we did compare EvoD AG on text classification problems with a represen tation that con tains more than 10 thousand features, and the result is that EvoD AG is not comp etitiv e against LinearSV C, neither in time nor in p erformance. Figure 7 allo ws us to iden tify a limitation of our approac h, let us lo ok at the problems ”krk opt” and letter-recognition; these problems are easily solved b y TPOT and hard for GP using our selection heuristics. Although there is not enough evidence to draw some conclusions, it is observ ed that these problems are easily solved by classifiers based on Decision T rees (TPOT considers Decision T rees as its base learners) and contained the maxim um num b er of classes, 18 and 26. As can be seen, Decision T rees are utterly differen t from the trees ev olved by GP . P erhaps, the most distinctiv e characteristic is that the computation flow is the complement of GP trees; that is, the starting point is the ro ot, and the output is a leaf. One GP c haracteristic that has captured researc hers’ attention is the ability to create white- b o x mo dels; in this contribution, we ha ve not adequately analyzed whether the mo dels ev olved b y our GP system, using any selection heuristic, are easy or difficult to understand. Nonetheless, w e hav e observed some of the models evolv ed in a few of the problems used as b enc hmarks. Our general impression is that the mo dels ev olved are complex containing at least ten inner no des. F or example, Figure 1 presents a mo del for the Iris problem; clearly , this problem could ha ve b een 23 solv ed with a more straightforw ard tree obtaining similar p erformance. Ho wev er, the system used did not promote the dev elopment of simple models, and we will address this issue in future w ork. 7 Conclusion In this research, we proposed three selection heuristics for parent selection in GP that used in- dividuals’ seman tics and were inspired b y functions’ prop erties. These are describ ed as follo ws. First, tournament sele ction b ase d on c osine similarity (sim) aims to promote the selection of par- en ts whose semantics’ v ectors ideally are orthogonal. T ournament sele ction b ase d on Pe arson ’s Corr elation c o efficient (prs) aims to promote the selection of paren ts whose seman tics’ vectors are uncorrelated. Finally , tournament sele ction b ase d on the agr e ement (agr) tries to select paren ts whose predictions are different based on their prediction lab els. These heuristics were inspired b y the properties of the addition function, and the classifiers Naive Ba y es and Nearest Cen troid. T o the b est of our knowledge, this is the first time in Genetic Programming that functions’ prop erties are tak en in to account to design metho dologies for parent selection. W e compared our prop osed heuristics against the classical parent selection tec hnique, tradi- tional tournamen t selection, and random parent selection. W e also tested tw o state-of-the-art selection sc hemes, Nov elty Search (nvs) and Angle-Driven Selection (ads). F or negative selec- tion, we tested the use of standard negative tournaments and random selection. F urthermore, our selection heuristics w ere compared against 18 state-of-the-art classifiers, 16 of them from the scikit- learn p ython library , and tw o auto-mac hine learning algorithms. The performance w as analyzed on thirt y classification problems taken from the UCI rep ository . The datasets w ere heterogeneous in terms of the n umber of samples, v ariables, and some of them are balanced, and others imbalanced. The results indicate that the selection heuristic using agreemen t com bined with random neg- ativ e selection (agr-rnd) is statistically better than the traditional selection that uses fitness (i.e., the system identified as fit-fit). On the other hand, on the comparison of the selection heuristics against different classifiers, it is observed that agr-rnd is a comp etitive classifier obtaining the second-b est rank; additionally , the difference in p erformance with TPOT, which obtained the best rank, is not statistically significant. F urthermore, it is observed that the selection heuristic iden- tified as agr-rnd is group ed with the classifiers based on ensembles and the auto-mac hine learning algorithms; the group also includes Multila yer P erceptron and Decision T rees. Finally , we ha ve only tested our GP systems in classification problems and left aside regression problems. As can b e observ ed, t wo of the selection heuristics dev elop, namely cosine similarit y and P earson’s correlation, can b e used without any mo dification on selection problems. On the other hand, the agreement heuristic is only defined for classification problems. W e hav e p erformed some preliminary runs on regressions problems. The results indicated that the selection heuristics are competitive; ho wev er, w e do not hav e enough evidence on whether these heuristics are differ- en t from traditional selection sc hemes or random selection on regression. W e will deal with the comparison in regression problems as future w ork. References Badran, K. and Ro ck ett, P . (2012). Multi-class pattern classification using single, multi-dimensional feature-space feature extraction evolv ed by multi-ob jective genetic programming and its application to net work intrusion detection. Genetic Pr o gr amming and Evol vable Machines , 13(1):33–63. Beadle, L. and Johnson, C. G. (2008). Semantically driven crossov er in genetic programming. In 2008 IEEE Congr ess on Evolutionary Computation (IEEE World Congr ess on Computational Intel ligence) , pages 111–116. IEEE. Bergstra, J. and Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine L e arning R ese ar ch , 13(F eb):281–305. Brameier, M. and Banzhaf, W. (2001). A comparison of linear genetic programming and neural net works in medical data mining. IEEE T r ansactions on Evol utionary Computation , 5(1):17–26. 24 Breiman, L. (1996). Bagging predictors. Machine L e arning , 24(2):123–140. Brereton, R. G. (2016). Orthogonality , uncorrelatedness, and linear independence of v ectors. Journal of Chemometrics , 30(10):564–566. Castelli, M., Manzoni, L., Mariot, L., and Saletta, M. (2019). Extending lo cal search in geometric seman tic genetic programming. In EPIA Confer enc e on A rtificial Intel ligenc e , pages 775–787. Springer. Castelli, M., Silv a, S., and V anneschi, L. (2015a). A C++ framew ork for geometric semantic genetic programming. Genetic Pr o gr amming and Evolvable Machines , 16(1):73–81. Castelli, M., T rujillo, L., V anneschi, L., Silv a, S., Z-Flores, E., and Legrand, P . (2015b). Geometric Seman- tic Genetic Programming with Lo cal Search. In Pr o c e e dings of the 2015 on Genetic and Evolutionary Computation Confer enc e - GECCO ’15 , pages 999–1006, New Y ork, New Y ork, USA. ACM Press. Cha wla, N. V., Japko wicz, N., and Kotcz, A. (2004). Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explor ations Newsletter , 6(1):1. Chen, Q., Xue, B., and Zhang, M. (2019). Impro ving Generalization of Genetic Programming for Sym b olic Regression With Angle-Driven Geometric Semantic Op erators. IEEE T ransactions on Evolutionary Computation , 23(3):488–502. Ch u, T. H., Nguyen, Q. U., and O’Neill, M. (2016). T ournament selection based on statistical test in genetic programming. In International Confer enc e on Par al lel Pr oblem Solving fr om Natur e , pages 303–312. Springer. Ch u, T. H., Nguy en, Q. U., and O’Neill, M. (2018). Semantic tournament selection for genetic program- ming based on statistical analysis of error v ectors. Information Scienc es , 436-437:352–366. Dem ˇ sar, J. (2006). Statistical Comparisons of Classifiers ov er Multiple Data Sets. T echnical report. Dua, D. and Graff, C. (2017). UCI Machine Learning Rep ository. Esp ejo, P . G., V en tura, S., and Herrera, F. (2010). A Survey on the Application of Genetic Programming to Classification. IEEE T r ansactions on Systems, Man, and Cyb ernetics, Part C (Applic ations and R eviews) , 40(2):121–144. F ang, Y. and Li, J. (2010). A Review of T ournamen t Selection in Genetic Programming. In International Symp osium on Intel ligenc e Computation and Applications ISICA 2010 , pages 181–192. Springer, Berlin, Heidelb erg. F eurer, M., Klein, A., Eggensp erger, K., Springen b erg, J., Blum, M., and Hutter, F. (2015). Efficient and Robust Automated Mac hine Learning. F olino, G., Pizzuti, C., and Sp ezzano, G. (2008). T raining distributed gp ensemble with a selectiv e algorithm based on clustering and pruning for pattern classification. IEEE T r ansactions on Evolutionary Computation , 12(4):458–468. F riedberg, R. M. (1958). A learning mac hine: Part i. IBM Journal of R ese ar ch and Development , 2(1):2– 13. F riedman, J. H. and Hall, P . (2007). On bagging and nonlinear estimation. Journal of Statistic al Planning and Infer enc e , 137(3):669–683. Galv an-Lop ez, E., Co dy-Kenn y , B., T rujillo, L., and Kattan, A. (2013). Using seman tics in the selection mec hanism in Genetic Programming: A simple metho d for promoting seman tic diversit y. In 2013 IEEE Congr ess on Evolutionary Computation , pages 2972–2979. IEEE. Graff, M., Flores, J. J., and Ortiz, J. (2014a). Genetic Programming: Semantic p oin t mutation op erator based on the partial deriv ative error. In 2014 IEEE International Autumn Me eting on Power, Ele ctr onics and Computing (ROPEC) , pages 1–6. IEEE. Graff, M., Graff-Guerrero, A., and Cerda-Jacob o, J. (2014b). Semantic crosso ver based on the partial deriv ative error. In Eur op e an Confer ence on Genetic Pr o gr amming , pages 37–47. Springer. 25 Graff, M., Miranda-Jim ´ enez, S., T ellez, E. S., and Mo ctezuma, D. (2020). Evomsa: A multilingual ev olutionary approac h for sen timent analysis. Computational Intel ligenc e Magazine , 15:76 – 88. Graff, M., T ellez, E. S., Escalan te, H. J., and Miranda-Jim ´ enez, S. (2017). Seman tic genetic programming for sentimen t analysis. In NEO 2015 , pages 43–65. Springer. Graff, M., T ellez, E. S., Escalante, H. J., and Ortiz-Bejar, J. (2015a). Memetic Genetic Programming based on orthogonal pro jections in the phenotype space. In 2015 IEEE International A utumn Me eting on Power, Electr onics and Computing (ROPEC) , pages 1–6. IEEE. Graff, M., T ellez, E. S., Miranda-Jimenez, S., and Escalan te, H. J. (2016). EvoD AG: A semantic Genetic Programming Python library. In 2016 IEEE International A utumn Me eting on Power, Ele ctr onics and Computing (R OPEC) , pages 1–6. IEEE. Graff, M., T ellez, E. S., Villase˜ nor, E., and Miranda-Jim ´ enez, S. (2015b). Seman tic Genetic Programming Op erators Based on Pro jections in the Phenot yp e Space. In Rese ar ch in Computing Scienc e , pages 73–85. Guo, H., Jack, L. B., and Nandi, A. K. (2005). F eature generation using genetic programming with application to fault classification. IEEE T r ansactions on Systems, Man, and Cyb ernetics, Part B (Cyb ernetics) , 35(1):89–99. Hara, A., Kushida, J.-i., and T ak ahama, T. (2016). Deterministic Geometric Semantic Genetic Program- ming with Optimal Mate Selection. In 2016 IEEE International Confer enc e on Systems, Man, and Cyb ernetics (SMC) , pages 003387–003392. IEEE. Hara, A., Ueno, Y., and T ak ahama, T. (2012). New crossov er op erator based on semantic distance b et ween subtrees in Genetic Programming. In 2012 IEEE International Confer enc e on Systems, Man, and Cyb ernetics (SMC) , pages 721–726. IEEE. Ingalalli, V., Silv a, S., Castelli, M., and V anneschi, L. (2014). A m ulti-dimensional genetic programming approac h for multi-class classification problems. In Eur op e an Confer enc e on Genetic Pr o gr amming , pages 48–60. Springer. Iqbal, M., Xue, B., Al-Sahaf, H., and Zhang, M. (2017). Cross-domain reuse of extracted knowledge in genetic programming for image classification. IEEE T r ansactions on Evolutionary Computation , 21(4):569–587. Koza, J. R. (1992). Genetic pr o gr amming: on the pr o gr amming of c omputers by me ans of natural sele ction . MIT Press. Kra wiec, K. (2016). Semantic Genetic Programming. In Behavior al Pr o gr am Synthesis with Genetic Pr o gr amming , pages 55–66. Springer, Cham. Kra wiec, K. and Lic ho cki, P . (2009). Appro ximating geometric crosso ver in seman tic space. In Pr o c e e dings of the 11th A nnual c onfer enc e on Genetic and evolutionary c omputation - GECCO ’09 , page 987, New Y ork, New Y ork, USA. A CM Press. Kra wiec, K. and P awlak, T. (2012). Lo cally geometric seman tic crossov er. In Pr o c e e dings of the fourteenth international c onferenc e on Genetic and evolutionary computation c onfer enc e c omp anion - GECCO Comp anion ’12 , page 1487, New Y ork, New Y ork, USA. A CM Press. Kra wiec, K. and P a wlak, T. (2013). Locally geometric seman tic crosso ver: a study on the roles of seman tics and homology in recom bination operators. Genetic Pr o gr amming and Evolvable Machines , 14(1):31–63. La Cav a, W., Silv a, S., Danai, K., Sp ector, L., V annesc hi, L., and Moore, J. H. (2019). Multidimensional genetic programming for m ulticlass classification. Swarm and Evolutionary Computation , 44:260–272. Lehman, J. and Stanley , K. O. (2011). Abandoning Ob jectives: Evolution Through the Searc h for Nov elty Alone. Evolutionary Computation , 19(2):189–223. Lic ho dzijewski, P . and Heywoo d, M. I. (2008). Managing team-based problem solving with sym biotic bid- based genetic programming. In Pr o c e e dings of the 10th annual c onfer enc e on Genetic and evolutionary c omputation , pages 363–370. 26 Lo veard, T. and Ciesielski, V. (2001). Representing classification problems in genetic programming. In Pr o c e edings of the 2001 Congr ess on Evolutionary Computation (IEEE Cat. No.01TH8546) , v olume 2, pages 1070–1077. IEEE. McIn tyre, A. R. and Heyw o od, M. I. (2011). Classification as clustering: A pareto co operative-competitive gp approach. Evolutionary Computation , 19(1):137–166. Moraglio, A., Kra wiec, K., and Johnson, C. G. (2012). Geometric seman tic genetic programming. In International Confer enc e on Par al lel Pr oblem Solving fr om Natur e , pages 21–31. Springer. Moraglio, A. and Poli, R. (2004). T op ological interpretation of crossov er. In Genetic and Evolutionary Computation Confer enc e , pages 1377–1388. Springer. Muni, D. P ., Pal, N. R., and Das, J. (2004). A Nov el Approach to Design Classifiers Using Genetic Programming. IEEE T r ansactions on Evolutionary Computation , 8(2):183–196. Munoz, L., Silv a, S., and T rujillo, L. (2015). M3gp–m ulticlass classification with gp. In Eur op e an Confer- enc e on Genetic Pr o gr amming , pages 78–91. Springer. Naredo, E., T rujillo, L., Legrand, P ., Silv a, S., and Mu ˜ noz, L. (2016). Evolving genetic programming classifiers with no v elty search. Information Scienc es , 369:347–367. Nguy en, Q. U., Nguy en, X. H., O’Neill, M., and Agapitos, A. (2012). An inv estigation of fitness sharing with seman tic and syntactic distance metrics. In Eur op e an Confer enc e on Genetic Pr ogr amming , pages 109–120. Springer. Nguy en, Q. U., Pham, T. A., Nguyen, X. H., and McDermott, J. (2016). Subtree semantic geometric crosso ver for genetic programming. Genetic Pr o gr amming and Evolvable Machines , 17(1):25–53. Olson, R. S., Urbanowicz, R. J., Andrews, P . C., Lav ender, N. A., Mo ore, J. H., et al. (2016). Automating biomedical data science through tree-based pip eline optimization. In Eur op e an Confer enc e on the Applic ations of Evolutionary Computation , pages 123–137. Springer. P awlak, T. P ., Wielo c h, B., and Krawiec, K. (2015). Semantic Bac kpropagation for Designing Search Op erators in Genetic Programming. IEEE T r ansactions on Evolutionary Computation , 19(3):326–340. P edregosa, F., V aro quaux, G., Gramfort, A., Mic hel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P ., W eiss, R., Dubourg, V., V anderplas, J., Passos, A., Cournap eau, D., Brucher, M., Perrot, M., and Duc hesnay , (2011). Scikit-learn: Machine Learning in Python. Journal of Machine L e arning R ese ar ch , 12(Oct):2825–2830. P oli, R., Langdon, W. B., and McPhee, N. F. N. (2008). A field guide to genetic pr o gr amming . Published via lulu.com and freely av ailable at www.gp-field-guide.org.uk. Rub erto, S., V anneschi, L., Castelli, M., and Silv a, S. (2014). Esagp - a seman tic gp framew ork based on alignment in the error space. In Eur op e an Confer enc e on Genetic Pr o gr amming , pages 150–161. Springer. S´ anchez, C. N., Dom ´ ınguez-Sob eranes, J., Escalona-Buend ´ ıa, H. B., Graff, M., Guti´ errez, S., and S´ anchez, G. (2019). Liking pro duct landscap e: going deeper into understanding consumers’ hedonic ev aluations. F o o ds , 8(10):461. Smart, W. and Zhang, M. (2004). Con tinuously ev olving programs in genetic programming using gradient descen t. In Pro c e e dings of THE 7th Asia-Pacific Confer enc e on Complex Systems . Su´ arez, R. R., Graff, M., and Flores, J. J. (2015). Semantic crossov er operator for gp based on the second partial deriv ative of the error function. R ese ar ch in Computing Scienc e , 94:87–96. Uy , N. Q., Hoai, N. X., O’Neill, M., McKay , R. I., and Galv´ an-L´ op ez, E. (2011). Semantically-based crosso ver in genetic programming: application to real-v alued sym b olic regression. Genetic Pro gr amming and Evolvable Machines , 12(2):91–119. V annesc hi, L. (2017). An In tro duction to Geometric Semantic Genetic Programming. In Oliver Sch¨ utze, Leonardo T rujillo, Pierric k Legrand, and Y azmin Maldonado, editors, Ne o 2015 , pages 3–42. Springer, Cham. 27 V annesc hi, L., Castelli, M., Manzoni, L., and Silv a, S. (2013). A new implementation of geometric seman tic gp and its application to problems in pharmacokinetics. In Eur ope an Confer enc e on Genetic Pr o gr amming , pages 205–216. Springer. V annesc hi, L., Castelli, M., Scott, K., and T rujillo, L. (2019). Alignment-based genetic programming for real life applications. Swarm and evolutionary c omputation , 44:840–851. V annesc hi, L., Castelli, M., and Silv a, S. (2014). A surv ey of semantic metho ds in genetic programming. Genetic Pr o gr amming and Evolvable Machines , 15(2):195–214. Zhang, M. and Smart, W. (2006). Using gaussian distribution to construct fitness functions in genetic programming for m ulticlass ob ject classification. Pattern R e c o gnition L etters , 27(11):1266–1274. 8 App endix A This app endix contains all the detailed results. T able 4: Selection sc hemes comparison based on macro-F1, the ranks are in paren thesis. Macro-f1 v alues w ere measured o ver test datasets. acc-rnd agr-fit rnd-rnd sim-fit prs-fit* prs-rnd prs-rnd* prs-fit sim-fit* sim-rnd sim-rnd* fit-fit agr-rnd* agr-fit* ad 0.94(1) 0.93(3) 0.93(8) 0.93(6) 0.93(10) 0.93(6) 0.92(12) 0.93(9) 0.93(3) 0.93(2) 0.92(11) 0.93(3) 0.55(13) 0.55(14) adult 0.79(2) 0.79(5) 0.79(3) 0.79(11) 0.79(12) 0.79(6) 0.79(4) 0.79(1) 0.79(7) 0.79(8) 0.79(9) 0.79(10) 0.69(13) 0.69(13) agaricus-lepiota 0.68(5) 0.68(3) 0.68(11) 0.68(4) 0.68(6) 0.68(8) 0.68(9) 0.68(1) 0.68(2) 0.68(10) 0.68(7) 0.68(12) 0.04(13) 0.04(13) aps-failure 0.83(11) 0.84(4) 0.86(1) 0.84(6) 0.84(7) 0.83(10) 0.85(2) 0.84(9) 0.84(5) 0.83(12) 0.84(8) 0.85(3) 0.74(13) 0.74(13) banknote 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 0.82(13) 0.82(13) bank 0.76(5) 0.76(8) 0.76(1) 0.76(2) 0.76(10) 0.76(4) 0.76(9) 0.76(3) 0.76(11) 0.76(7) 0.76(6) 0.76(12) 0.71(13) 0.71(13) biodeg 0.84(5) 0.84(7) 0.85(1) 0.84(3) 0.83(10) 0.82(12) 0.83(8) 0.85(2) 0.84(3) 0.84(6) 0.83(8) 0.83(11) 0.63(13) 0.63(13) car 0.87(5) 0.91(1) 0.86(6) 0.83(12) 0.87(3) 0.85(8) 0.86(7) 0.87(4) 0.89(2) 0.83(11) 0.84(10) 0.84(9) 0.29(13) 0.29(13) census-income 0.77(7) 0.51(12) 0.77(1) 0.77(3) 0.77(2) 0.77(4) 0.77(9) 0.51(11) 0.77(5) 0.77(8) 0.77(6) 0.75(10) 0.42(13) 0.42(13) cmc 0.54(9) 0.55(3) 0.53(12) 0.54(8) 0.54(7) 0.55(2) 0.55(1) 0.55(4) 0.55(5) 0.53(10) 0.54(6) 0.53(11) 0.47(13) 0.47(13) dota2 0.59(3) 0.59(7) 0.59(6) 0.6(1) 0.59(9) 0.59(12) 0.59(10) 0.59(11) 0.59(8) 0.59(4) 0.59(5) 0.59(2) 0.47(14) 0.48(13) drug-consumption 0.23(2) 0.2(10) 0.2(13) 0.22(3) 0.23(1) 0.21(8) 0.21(5) 0.21(9) 0.21(6) 0.22(4) 0.21(7) 0.2(14) 0.2(11) 0.2(11) fertility 0.45(1) 0.45(1) 0.45(1) 0.45(1) 0.45(1) 0.45(1) 0.45(1) 0.44(11) 0.45(1) 0.45(1) 0.45(1) 0.44(11) 0.4(13) 0.4(13) IndianLiverP atient 0.66(4) 0.71(1) 0.69(2) 0.65(8) 0.65(6) 0.64(12) 0.67(3) 0.65(6) 0.66(5) 0.64(10) 0.65(9) 0.64(11) 0.63(13) 0.63(13) iris 0.98(1) 0.98(1) 0.98(1) 0.98(1) 0.98(1) 0.98(1) 0.98(1) 0.96(10) 0.96(10) 0.98(1) 0.96(10) 0.98(1) 0.96(10) 0.96(10) krkopt 0.19(2) 0.18(4) 0.2(1) 0.14(11) 0.16(7) 0.16(6) 0.16(5) 0.18(3) 0.14(9) 0.14(12) 0.14(9) 0.15(8) 0.12(13) 0.12(13) letter-recognition 0.65(2) 0.65(4) 0.66(1) 0.65(5) 0.65(5) 0.65(5) 0.65(5) 0.65(3) 0.65(5) 0.65(5) 0.65(5) 0.65(5) 0.65(5) 0.65(5) magic04 0.85(4) 0.85(1) 0.85(7) 0.85(5) 0.84(9) 0.85(3) 0.84(8) 0.85(2) 0.84(11) 0.85(6) 0.84(10) 0.83(12) 0.65(13) 0.65(13) ml-prov e 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(12) 0.7(13) 0.7(13) musk1 0.88(5) 0.86(12) 0.87(9) 0.88(6) 0.89(3) 0.88(4) 0.88(7) 0.86(11) 0.87(8) 0.89(2) 0.9(1) 0.86(10) 0.37(13) 0.37(13) musk2 0.94(4) 0.91(12) 0.94(8) 0.94(6) 0.94(7) 0.95(3) 0.94(10) 0.94(5) 0.94(9) 0.95(2) 0.94(11) 0.95(1) 0.46(13) 0.46(13) optdigits 0.95(3) 0.95(6) 0.96(2) 0.95(8) 0.95(4) 0.95(7) 0.95(5) 0.96(1) 0.94(12) 0.95(9) 0.94(11) 0.94(10) 0.73(13) 0.73(13) page-blocks 0.83(1) 0.79(6) 0.76(12) 0.77(9) 0.8(4) 0.8(3) 0.81(2) 0.79(5) 0.76(11) 0.77(7) 0.77(10) 0.77(8) 0.76(13) 0.76(13) parkinsons 0.75(2) 0.75(1) 0.73(6) 0.73(3) 0.73(3) 0.72(7) 0.73(3) 0.65(14) 0.7(10) 0.72(7) 0.72(7) 0.67(13) 0.67(11) 0.67(11) pendigits 0.94(3) 0.95(2) 0.94(4) 0.93(9) 0.94(6) 0.95(1) 0.94(7) 0.94(5) 0.92(12) 0.93(10) 0.92(11) 0.94(8) 0.8(13) 0.8(13) segmentation 0.91(4) 0.9(6) 0.91(5) 0.9(10) 0.92(1) 0.9(8) 0.91(2) 0.91(3) 0.89(12) 0.9(9) 0.89(11) 0.9(7) 0.82(13) 0.82(13) sensorless 0.96(6) 0.97(1) 0.95(10) 0.95(9) 0.96(4) 0.96(5) 0.96(8) 0.96(3) 0.96(7) 0.95(12) 0.95(11) 0.96(2) 0.8(13) 0.8(13) tae 0.36(6) 0.32(12) 0.32(8) 0.42(3) 0.3(14) 0.32(7) 0.32(8) 0.42(4) 0.38(5) 0.3(13) 0.47(1) 0.44(2) 0.32(8) 0.32(8) wine 0.98(2) 0.98(2) 0.98(2) 0.98(2) 0.96(13) 0.98(11) 0.96(13) 0.98(11) 0.98(2) 0.98(2) 1.0(1) 0.98(2) 0.98(2) 0.98(2) yeast 0.47(1) 0.45(9) 0.45(8) 0.46(6) 0.46(3) 0.45(10) 0.45(11) 0.43(14) 0.46(2) 0.44(13) 0.44(12) 0.46(7) 0.46(4) 0.46(4) Average rank 3.6 4.9 5.1 5.4 5.7 5.9 5.9 5.9 6.3 6.8 7.2 7.6 11.7 11.7 T able 5: Prop osed heuristics against state-of-the-art selection schemes based on macro-F1, the ranks are in parenthesis. The table contin ues in T able 6. agr-rnd–rnd agr-fit–fit agr-fit–rnd rnd-rnd sim-fit–rnd agr-rnd–fit prs-fit–rnd prs-rnd–rnd ads-rnd–rnd sim-rnd–rnd ads-fit–fit ad 0.94(5) 0.94(6) 0.93(9) 0.93(13) 0.93(11) 0.93(15) 0.93(14) 0.93(11) 0.96(1) 0.93(7) 0.93(8) adult 0.79(2) 0.79(11) 0.79(4) 0.79(3) 0.79(8) 0.79(9) 0.79(1) 0.79(5) 0.79(13) 0.79(6) 0.79(15) agaricus-lepiota 0.68(7) 0.69(2) 0.68(4) 0.68(11) 0.68(6) 0.69(1) 0.68(3) 0.68(9) 0.68(13) 0.68(10) 0.68(15) aps-failure 0.83(10) 0.84(7) 0.84(3) 0.86(1) 0.84(4) 0.83(8) 0.84(6) 0.83(9) 0.84(5) 0.83(13) 0.82(14) banknote 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) bank 0.76(8) 0.76(4) 0.76(10) 0.76(1) 0.76(5) 0.76(3) 0.76(6) 0.76(7) 0.76(12) 0.76(9) 0.75(14) biodeg 0.84(5) 0.85(2) 0.84(10) 0.85(1) 0.84(4) 0.84(9) 0.85(2) 0.82(14) 0.82(15) 0.84(8) 0.83(13) car 0.87(4) 0.86(5) 0.91(1) 0.86(7) 0.83(12) 0.87(2) 0.87(3) 0.85(8) 0.8(15) 0.83(11) 0.86(6) census-income 0.77(4) 0.76(6) 0.51(12) 0.77(1) 0.77(2) 0.51(13) 0.51(11) 0.77(3) 0.75(8) 0.77(5) 0.5(14) cmc 0.54(7) 0.53(11) 0.55(2) 0.53(15) 0.54(6) 0.53(12) 0.55(3) 0.55(1) 0.53(9) 0.53(10) 0.54(8) dota2 0.59(3) 0.59(10) 0.59(6) 0.59(5) 0.6(1) 0.59(13) 0.59(7) 0.59(11) 0.59(9) 0.59(4) 0.59(14) drug-consumption 0.23(1) 0.22(5) 0.2(12) 0.2(14) 0.22(2) 0.21(6) 0.21(9) 0.21(8) 0.21(10) 0.22(3) 0.2(13) fertility 0.45(1) 0.45(1) 0.45(1) 0.45(1) 0.45(1) 0.45(1) 0.44(12) 0.45(1) 0.45(1) 0.45(1) 0.45(1) IndianLiverP atient 0.66(7) 0.67(4) 0.71(1) 0.69(2) 0.65(9) 0.67(5) 0.65(8) 0.64(13) 0.66(6) 0.64(11) 0.68(3) iris 0.98(1) 0.98(1) 0.98(1) 0.98(1) 0.98(1) 0.98(1) 0.96(13) 0.98(1) 0.96(13) 0.98(1) 0.98(1) krkopt 0.19(5) 0.19(3) 0.18(8) 0.2(1) 0.14(13) 0.18(9) 0.18(6) 0.16(10) 0.2(2) 0.14(14) 0.19(4) letter-recognition 0.65(4) 0.66(2) 0.65(6) 0.66(1) 0.65(11) 0.66(3) 0.65(5) 0.65(11) 0.65(11) 0.65(11) 0.65(9) magic04 0.85(6) 0.85(1) 0.85(2) 0.85(9) 0.85(7) 0.85(4) 0.85(3) 0.85(5) 0.84(12) 0.85(8) 0.84(11) ml-prov e 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) 1.0(1) musk1 0.88(8) 0.88(5) 0.86(15) 0.87(12) 0.88(9) 0.88(10) 0.86(14) 0.88(7) 0.89(3) 0.89(2) 0.87(11) musk2 0.94(4) 0.94(8) 0.91(15) 0.94(7) 0.94(6) 0.93(10) 0.94(5) 0.95(3) 0.94(9) 0.95(2) 0.93(11) optdigits 0.95(3) 0.95(5) 0.95(7) 0.96(2) 0.95(9) 0.95(4) 0.96(1) 0.95(8) 0.92(12) 0.95(10) 0.92(13) page-blocks 0.83(1) 0.79(6) 0.79(5) 0.76(13) 0.77(10) 0.81(2) 0.79(4) 0.8(3) 0.76(11) 0.77(8) 0.76(12) parkinsons 0.75(6) 0.73(8) 0.75(4) 0.73(8) 0.73(7) 0.73(8) 0.65(15) 0.72(11) 0.75(4) 0.72(11) 0.76(2) pendigits 0.94(5) 0.95(3) 0.95(2) 0.94(6) 0.93(10) 0.94(4) 0.94(7) 0.95(1) 0.93(13) 0.93(11) 0.94(8) segmentation 0.91(3) 0.91(5) 0.9(8) 0.91(4) 0.9(13) 0.91(6) 0.91(2) 0.9(10) 0.91(1) 0.9(12) 0.91(7) sensorless 0.96(7) 0.97(2) 0.97(1) 0.95(12) 0.95(10) 0.96(8) 0.96(4) 0.96(6) 0.95(11) 0.95(14) 0.96(5) tae 0.36(8) 0.36(7) 0.32(13) 0.32(12) 0.42(3) 0.34(9) 0.42(4) 0.32(10) 0.37(5) 0.3(15) 0.44(1) wine 0.98(1) 0.98(1) 0.98(1) 0.98(1) 0.98(1) 0.98(1) 0.98(12) 0.98(12) 0.98(1) 0.98(1) 0.98(1) yeast 0.47(1) 0.45(8) 0.45(6) 0.45(5) 0.46(3) 0.44(12) 0.43(15) 0.45(7) 0.46(2) 0.44(11) 0.45(10) Average rank 4.3 4.7 5.7 5.7 6.2 6.3 6.6 6.9 7.6 7.7 8.2 T able 6: Prop osed heuristics against state-of-the-art selection sc hemes based on macro-F1, the ranks are in paren thesis. fit-fit nvs-rnd ads-rnd–rnd* ads-fit–fit* ad 0.93(9) 0.94(4) 0.95(2) 0.94(3) adult 0.79(7) 0.79(10) 0.79(14) 0.79(12) agaricus-lepiota 0.68(14) 0.68(5) 0.68(8) 0.68(12) aps-failure 0.85(2) 0.83(11) 0.83(12) 0.81(15) banknote 1.0(1) 1.0(1) 1.0(1) 1.0(1) bank 0.76(11) 0.76(2) 0.75(15) 0.75(13) biodeg 0.83(11) 0.84(7) 0.83(12) 0.84(6) car 0.84(9) 0.81(14) 0.83(10) 0.81(13) census-income 0.75(7) 0.38(15) 0.74(9) 0.74(10) cmc 0.53(14) 0.53(13) 0.54(4) 0.54(5) dota2 0.59(2) 0.59(12) 0.59(8) 0.59(15) drug-consumption 0.2(15) 0.2(11) 0.21(7) 0.22(4) fertility 0.44(12) 0.45(1) 0.44(12) 0.44(12) IndianLiverP atient 0.64(12) 0.65(9) 0.62(15) 0.63(14) iris 0.98(1) 0.96(13) 0.98(1) 0.98(1) krkopt 0.15(11) 0.18(7) 0.15(12) 0.13(15) letter-recognition 0.65(11) 0.65(7) 0.65(8) 0.65(10) magic04 0.83(15) 0.84(10) 0.84(13) 0.84(14) ml-prov e 1.0(14) 1.0(1) 1.0(1) 1.0(14) musk1 0.86(13) 0.9(1) 0.88(5) 0.89(4) musk2 0.95(1) 0.92(13) 0.91(14) 0.93(12) optdigits 0.94(11) 0.95(6) 0.92(14) 0.92(15) page-blocks 0.77(9) 0.78(7) 0.76(15) 0.76(14) parkinsons 0.67(14) 0.67(13) 0.76(2) 0.78(1) pendigits 0.94(9) 0.92(15) 0.93(14) 0.93(12) segmentation 0.9(9) 0.89(15) 0.89(14) 0.9(11) sensorless 0.96(3) 0.92(15) 0.95(13) 0.96(9) tae 0.44(2) 0.3(14) 0.37(6) 0.32(11) wine 0.98(1) 0.98(1) 0.96(14) 0.96(14) yeast 0.46(4) 0.45(9) 0.44(13) 0.43(14) Average rank 8.5 8.7 9.6 10.2 T able 7: Performance using macro-F1 (with ranks) of: LogisticRegression (LR), LinearSVC (LSVC), SVC, SGDClassifier (SDG), Passiv eAggressive- Classifier (P A), DecisionT reeClassifier (DT), ExtraT reesClassifier (ET), RandomF orestClassifier (RF), AdaBo ostClassifier (AB) and GradientBoost- ingClassifier (GB). The b egining of the table app ears on T able 3. LR KN AB SVC LSV C NB NC P A PER NBB SGD ad 0.94(9) 0.9(16) 0.93(13) 0.79(18) 0.88(17) 0.7(20) 0.77(19) 0.5(21) 0.46(22) 0.92(15) 0.46(22) adult 0.64(15) 0.63(16) 0.79(7) 0.44(23) 0.58(19) 0.64(14) 0.46(22) 0.61(17) 0.52(21) 0.68(13) 0.55(20) agaricus-lepiota 0.62(10) 0.53(18) 0.2(23) 0.56(15) 0.64(9) 0.57(13) 0.46(19) 0.56(14) 0.66(8) 0.58(12) 0.55(17) aps-failure 0.79(14) 0.78(16) 0.8(11) 0.53(21) 0.79(13) 0.63(20) 0.77(17) 0.78(15) 0.49(23) 0.65(19) 0.49(22) banknote 0.99(14) 1.0(10) 1.0(1) 1.0(1) 0.99(16) 0.82(21) 0.69(23) 0.99(13) 0.98(18) 0.82(22) 0.97(20) bank 0.63(16) 0.65(15) 0.7(10) 0.47(23) 0.51(22) 0.67(14) 0.54(20) 0.53(21) 0.59(17) 0.59(18) 0.55(19) biodeg 0.84(3) 0.77(17) 0.78(15) 0.78(16) 0.79(14) 0.72(19) 0.62(21) 0.57(22) 0.72(20) 0.74(18) 0.55(23) car 0.26(23) 0.74(12) 0.71(13) 0.69(14) 0.26(22) 0.32(20) 0.37(17) 0.41(16) 0.32(19) 0.34(18) 0.3(21) census-income 0.68(11) 0.68(12) 0.73(8) 0.48(20) 0.56(18) 0.59(17) 0.63(14) 0.45(22) 0.66(13) 0.61(15) 0.61(16) cmc 0.48(13) 0.49(12) 0.5(11) 0.54(3) 0.48(15) 0.46(17) 0.35(20) 0.17(23) 0.28(21) 0.44(19) 0.26(22) dota2 0.59(6) 0.52(17) 0.58(11) 0.59(9) 0.35(21) 0.56(14) 0.5(19) 0.35(22) 0.41(20) 0.56(12) 0.34(23) drug-consumption 0.16(15) 0.16(14) 0.13(23) 0.13(21) 0.14(17) 0.14(18) 0.19(10) 0.14(19) 0.2(9) 0.23(3) 0.13(22) fertility 0.45(4) 0.45(4) 0.52(3) 0.45(4) 0.45(4) 0.41(21) 0.6(1) 0.38(23) 0.41(21) 0.44(16) 0.43(20) IndianLiverP atient 0.5(18) 0.57(12) 0.59(7) 0.43(20) 0.41(21) 0.57(10) 0.56(13) 0.41(22) 0.47(19) 0.41(22) 0.58(9) iris 0.88(19) 0.98(8) 0.94(14) 1.0(1) 0.9(18) 0.96(10) 0.98(2) 0.64(20) 0.53(22) 0.13(23) 0.56(21) krkopt 0.18(13) 0.66(5) 0.1(18) 0.58(7) 0.16(14) 0.13(16) 0.12(17) 0.05(21) 0.08(19) 0.04(22) 0.08(20) letter-recognition 0.71(9) 0.94(3) 0.19(21) 0.97(1) 0.6(16) 0.64(15) 0.59(17) 0.45(19) 0.36(20) 0.08(22) 0.45(18) magic04 0.75(15) 0.76(14) 0.82(11) 0.41(22) 0.66(17) 0.65(18) 0.63(20) 0.48(21) 0.63(19) 0.39(23) 0.67(16) ml-prov e 1.0(1) 0.94(21) 1.0(1) 0.99(18) 1.0(1) 1.0(16) 0.73(23) 1.0(1) 0.99(19) 0.85(22) 1.0(15) musk1 0.88(6) 0.88(6) 0.88(8) 0.37(22) 0.86(12) 0.77(18) 0.68(20) 0.81(14) 0.57(21) 0.72(19) 0.78(17) musk2 0.91(14) 0.92(11) 0.9(15) 0.73(21) 0.9(16) 0.75(20) 0.61(23) 0.85(17) 0.83(18) 0.66(22) 0.76(19) optdigits 0.95(10) 0.98(3) 0.53(23) 0.64(22) 0.93(13) 0.79(21) 0.89(18) 0.93(14) 0.91(17) 0.84(20) 0.92(16) page-blocks 0.79(8) 0.71(13) 0.46(19) 0.35(21) 0.61(16) 0.65(14) 0.22(22) 0.5(18) 0.36(20) 0.19(23) 0.5(17) parkinsons 0.76(7) 0.67(14) 0.82(4) 0.49(17) 0.3(22) 0.67(12) 0.6(16) 0.21(23) 0.41(21) 0.43(18) 0.43(18) pendigits 0.89(14) 0.98(2) 0.55(22) 0.08(23) 0.81(18) 0.82(17) 0.77(19) 0.86(15) 0.84(16) 0.6(21) 0.77(20) segmentation 0.9(10) 0.8(13) 0.33(23) 0.36(22) 0.42(20) 0.79(14) 0.69(15) 0.56(18) 0.57(17) 0.4(21) 0.45(19) sensorless 0.5(15) 0.11(22) 0.33(17) 0.26(18) 0.61(14) 0.76(13) 0.07(23) 0.19(20) 0.2(19) 0.48(16) 0.18(21) tae 0.34(15) 0.38(10) 0.37(13) 0.38(9) 0.3(19) 0.17(22) 0.31(17) 0.24(20) 0.38(11) 0.13(23) 0.2(21) wine 0.98(12) 0.73(15) 0.98(4) 0.21(21) 0.71(17) 0.96(13) 0.72(16) 0.28(20) 0.44(19) 0.18(22) 0.46(18) yeast 0.31(13) 0.28(15) 0.3(14) 0.26(16) 0.05(18) 0.54(4) 0.05(20) 0.05(19) 0.03(22) 0.12(17) 0.02(23) Average rank 11.7 12.2 12.8 15.6 16.0 16.0 17.4 18.3 18.4 18.5 19.2 T able 8: Performance using time (with ranks) of: LogisticRegression (LR), LinearSV C (LSVC), SVC, SGDClassifier (SDG), Passiv eAggressiveClas- sifier (P A), DecisionT reeClassifier (DT), ExtraT reesClassifier (ET), RandomF orestClassifier (RF), AdaBoostClassifier (AB) and Gradien tBo osting- Classifier (GB). P art 1. NBB DT LR AB P A KN ET LSVC ad 6.16e-03(6) 7.58e-03(10) 7.17e-03(8) 7.49e-03(9) 5.34e-03(3) 5.82e-03(5) 5.06e-03(2) 8.79e-03(14) adult 1.24e-04(3) 1.33e-04(6) 1.33e-04(7) 1.58e-04(11) 1.25e-04(4) 1.41e-04(9) 1.58e-04(10) 2.46e-04(15) agaricus-lepiota 5.03e-04(11) 2.75e-04(4) 5.45e-04(12) 4.77e-04(6) 4.92e-04(8) 2.84e-04(5) 2.72e-04(3) 6.6e-04(14) aps-failure 2.19e-01(5) 2.19e-01(4) 2.24e-01(9) 2.22e-01(7) 2.34e-01(11) 2.33e-01(10) 2.39e-01(12) 2.21e-01(6) banknote 5.53e-04(9) 4.86e-04(7) 2.39e-04(4) 4.16e-04(5) 7.93e-04(14) 1.79e-04(1) 4.63e-04(6) 6.67e-04(10) bank 1.19e-04(6) 1.43e-04(9) 1.30e-04(8) 1.47e-04(10) 1.06e-04(2) 1.60e-04(13) 1.52e-04(12) 2.45e-04(15) biodeg 2.66e-04(5) 1.24e-04(1) 7.29e-04(15) 7.27e-04(13) 2.49e-04(4) 6.82e-04(11) 2.70e-04(6) 3.34e-04(7) car 7.45e-05(5) 6.55e-05(2) 7.88e-05(6) 1.75e-04(13) 6.96e-05(3) 7.08e-05(4) 8.94e-05(8) 2.82e-04(14) census-income 3.72e-04(2) 3.74e-04(3) 5.14e-04(12) 5.11e-04(11) 4.14e-04(7) 1.39e-03(15) 3.69e-04(1) 7.54e-04(14) cmc 1.86e-04(6) 1.27e-04(3) 8.77e-04(12) 7.63e-04(10) 1.36e-04(4) 1.05e-03(15) 1.57e-04(5) 3.50e-04(8) dota2 4.43e-04(5) 5.23e-04(11) 4.28e-04(3) 4.95e-04(8) 4.21e-04(1) 1.13e-03(15) 4.99e-04(9) 7.27e-04(14) drug-consumption 2.47e-04(4) 2.7e-04(6) 3.68e-04(12) 3.86e-04(13) 2.89e-04(7) 3.05e-04(9) 3.07e-04(10) 7.5e-04(14) fertility 2.57e-04(6) 1.47e-04(1) 1.82e-04(2) 1.59e-03(15) 1.89e-04(4) 5.13e-04(8) 6.62e-04(12) 1.86e-04(3) IndianLiverP atient 1.94e-04(8) 2.48e-04(12) 1.05e-04(5) 4.38e-04(15) 5.56e-05(1) 1.46e-04(7) 2.10e-04(9) 4.21e-04(14) iris 3.76e-04(7) 8.10e-05(1) 3.48e-04(6) 1.87e-03(13) 1.52e-04(4) 5.32e-04(8) 1.58e-04(5) 1.95e-03(16) krkopt 6.54e-05(4) 5.31e-05(2) 1.21e-04(11) 1.48e-04(12) 7.25e-05(10) 6.8e-05(7) 6.64e-05(6) 1.21e-03(13) letter-recognition 5.26e-05(1) 5.95e-05(3) 4.87e-04(12) 2.31e-04(11) 9.84e-05(9) 1.93e-04(10) 8.62e-05(6) 2.01e-03(14) magic04 7.55e-05(7) 8.07e-05(10) 7.87e-05(9) 1.65e-04(14) 7.82e-05(8) 7.39e-05(6) 7.14e-05(5) 1.64e-04(13) ml-prov e 2.63e-04(5) 2.75e-04(8) 2.15e-04(2) 5.17e-04(14) 2.41e-04(4) 3.87e-04(11) 2.67e-04(6) 4.67e-04(13) musk1 1.20e-03(8) 1.08e-03(7) 1.31e-03(12) 2.24e-03(14) 1.28e-03(11) 1.03e-03(4) 8.95e-04(2) 1.43e-03(13) musk2 6.9e-04(2) 7.28e-04(3) 1.22e-03(12) 1.45e-03(13) 8.74e-04(5) 1.21e-03(11) 6.25e-04(1) 8.15e-04(4) optdigits 3.51e-04(4) 3.64e-04(6) 7.34e-04(14) 4.47e-04(10) 1.92e-04(1) 5.26e-04(12) 3.87e-04(9) 5.22e-04(11) page-blocks 7.74e-05(9) 6.02e-05(5) 1.66e-04(12) 1.4e-04(11) 7.09e-05(8) 4.50e-05(2) 6.11e-05(6) 3.38e-04(13) parkinsons 2.35e-04(3) 1.65e-04(1) 2.11e-04(2) 1.03e-03(16) 3.41e-04(6) 7.01e-04(13) 4.48e-04(11) 2.96e-04(4) pendigits 9.51e-05(4) 9.49e-05(3) 3.25e-04(13) 1.9e-04(11) 9.90e-05(6) 1.34e-04(10) 9.83e-05(5) 3.53e-04(14) segmentation 1.22e-03(12) 8.18e-04(4) 1.05e-03(9) 1.25e-03(14) 1.13e-03(10) 9.49e-04(6) 9.67e-04(8) 1.23e-03(13) sensorless 2.60e-04(6) 3.32e-04(9) 2.46e-03(13) 5.18e-04(11) 2.74e-04(7) 4.88e-04(10) 1.51e-04(2) 6.67e-03(15) tae 3.98e-04(10) 2.64e-04(4) 4.72e-04(11) 1.38e-03(14) 6.79e-04(12) 2.76e-04(6) 3.68e-04(9) 3.5e-04(8) wine 1.18e-04(1) 3.02e-04(7) 3.37e-04(8) 1.02e-03(15) 6.90e-04(14) 1.31e-04(4) 6.17e-04(13) 4.03e-04(10) yeast 2.62e-04(11) 2.02e-04(4) 2.69e-04(12) 2.16e-04(7) 2.06e-04(6) 2.05e-04(5) 2.44e-04(9) 8.52e-04(15) Average time per sample 0.0078 0.0078 0.0082 0.0083 0.0083 0.0084 0.0084 0.0085 Average rank 5.8 5.2 9.1 11.5 6.5 8.4 6.9 11.7 T able 9: Performance using time (with ranks) of: LogisticRegression (LR), LinearSV C (LSVC), SVC, SGDClassifier (SDG), Passiv eAggressiveClas- sifier (P A), DecisionT reeClassifier (DT), ExtraT reesClassifier (ET), RandomF orestClassifier (RF), AdaBoostClassifier (AB) and Gradien tBo osting- Classifier (GB). P art 2. RF SGD GB PER NB NC MLP SV C ad 7.62e-03(11) 5.78e-03(4) 7.76e-03(12) 4.97e-03(1) 6.58e-03(7) 8.48e-03(13) 1.10e-02(16) 1.03e-02(15) adult 1.58e-04(12) 1.23e-04(2) 2.17e-04(13) 1.31e-04(5) 1.40e-04(8) 1.17e-04(1) 2.29e-04(14) 1.38e-02(16) agaricus-lepiota 2.71e-04(1) 4.87e-04(7) 1.07e-03(16) 4.97e-04(9) 2.71e-04(2) 4.97e-04(10) 9.10e-04(15) 6.49e-04(13) aps-failure 2.43e-01(13) 2.52e-01(14) 2.22e-01(8) 2.78e-01(16) 2.8e-01(18) 2.78e-01(17) 2.62e-01(15) 2.96e-01(19) banknote 7.02e-04(11) 7.97e-04(15) 5.26e-04(8) 7.45e-04(12) 1.81e-04(3) 1.80e-04(2) 3.11e-03(16) 7.53e-04(13) bank 1.2e-04(7) 1.17e-04(5) 1.95e-04(14) 1.15e-04(4) 1.01e-04(1) 1.10e-04(3) 1.48e-04(11) 1.30e-02(16) biodeg 6.77e-04(9) 2.36e-04(3) 5.86e-04(8) 2.36e-04(2) 6.83e-04(12) 7.28e-04(14) 1.53e-03(16) 6.82e-04(10) car 8.90e-05(7) 1.00e-04(10) 5.86e-04(15) 1.03e-04(11) 3.65e-05(1) 9.79e-05(9) 5.24e-03(16) 1.39e-04(12) census-income 4.52e-04(9) 4.07e-04(4) 5.30e-04(13) 4.09e-04(6) 4.07e-04(5) 4.47e-04(8) 4.98e-04(10) 1.08e-01(17) cmc 8.76e-04(11) 1.89e-04(7) 5.84e-04(9) 5.81e-05(1) 9.60e-04(14) 6.01e-05(2) 2.33e-03(16) 9.2e-04(13) dota2 5.14e-04(10) 4.45e-04(6) 7.14e-04(13) 4.26e-04(2) 4.56e-04(7) 4.37e-04(4) 5.82e-04(12) 3.82e-02(16) drug-consumption 1.77e-04(1) 2.34e-04(2) 1.83e-03(15) 2.42e-04(3) 2.61e-04(5) 2.95e-04(8) 2.86e-03(16) 3.36e-04(11) fertility 8.51e-04(13) 1.93e-04(5) 1.00e-03(14) 5.13e-04(7) 5.28e-04(10) 5.15e-04(9) 2.94e-03(16) 6.23e-04(11) IndianLiverP atient 2.26e-04(10) 5.91e-05(2) 3.7e-04(13) 8.41e-05(3) 1.24e-04(6) 8.68e-05(4) 1.24e-03(16) 2.4e-04(11) iris 8.33e-04(11) 1.45e-04(2) 1.86e-03(12) 1.51e-04(3) 8.22e-04(10) 1.92e-03(15) 1.89e-03(14) 5.40e-04(9) krkopt 6.59e-05(5) 6.95e-05(9) 2.48e-03(16) 6.83e-05(8) 5.09e-05(1) 6.19e-05(3) 2.34e-03(15) 1.63e-03(14) letter-recognition 5.61e-05(2) 7.25e-05(5) 4.15e-03(16) 7.20e-05(4) 8.98e-05(7) 9.04e-05(8) 2.48e-03(15) 1.06e-03(13) magic04 8.97e-05(11) 6.80e-05(4) 2.21e-04(15) 6.76e-05(3) 4.79e-05(1) 5.07e-05(2) 1.47e-04(12) 4.46e-03(16) ml-prov e 3.34e-04(10) 2.00e-04(1) 5.86e-04(16) 2.38e-04(3) 2.71e-04(7) 3.03e-04(9) 4.43e-04(12) 5.45e-04(15) musk1 1.21e-03(10) 1.07e-03(6) 2.56e-03(15) 9.35e-04(3) 8.49e-04(1) 1.05e-03(5) 2.78e-03(16) 1.21e-03(9) musk2 8.82e-04(6) 9.37e-04(10) 1.83e-03(14) 8.97e-04(7) 9.07e-04(9) 9.05e-04(8) 4.20e-03(16) 2.92e-03(15) optdigits 2.37e-04(2) 3.58e-04(5) 4.13e-03(16) 3.78e-04(8) 2.48e-04(3) 3.68e-04(7) 6.09e-04(13) 1.33e-03(15) page-blocks 9.29e-05(10) 5.12e-05(3) 7.04e-04(15) 5.32e-05(4) 4.16e-05(1) 7.07e-05(7) 6.82e-04(14) 1.92e-03(16) parkinsons 3.97e-04(8) 4.14e-04(10) 9.03e-04(15) 4.11e-04(9) 3.13e-04(5) 5.81e-04(12) 8.42e-04(14) 3.94e-04(7) pendigits 8.28e-05(1) 1.14e-04(9) 1.17e-03(15) 1.11e-04(8) 1.01e-04(7) 9.39e-05(2) 2.92e-04(12) 1.22e-03(16) segmentation 7.14e-04(2) 1.25e-03(15) 4.14e-03(16) 7.36e-04(3) 8.34e-04(5) 6.83e-04(1) 9.66e-04(7) 1.2e-03(11) sensorless 2.91e-04(8) 2.22e-04(5) 4.21e-03(14) 1.83e-04(3) 2.15e-04(4) 1.35e-04(1) 1.09e-03(12) 1.22e-02(16) tae 8.21e-04(13) 1.51e-04(3) 2.38e-03(16) 1.48e-04(2) 2.75e-04(5) 1.44e-04(1) 1.74e-03(15) 2.81e-04(7) wine 2.88e-04(6) 1.25e-04(3) 2.30e-03(16) 1.22e-04(2) 3.88e-04(9) 5.04e-04(11) 5.90e-04(12) 2.83e-04(5) yeast 2.31e-04(8) 1.2e-04(2) 1.31e-03(16) 2.44e-04(10) 1.54e-04(3) 1.15e-04(1) 5.97e-04(14) 4.02e-04(13) Average time per sample 0.0087 0.0089 0.0091 0.0097 0.0099 0.0099 0.0105 0.0172 Average rank 7.9 5.9 13.8 5.4 5.9 6.6 13.9 13.0 T able 10: Performance using time (with ranks) of: LogisticRegression (LR), LinearSVC (LSV C), SVC, SGDClassifier (SDG), Passiv eAggressiveClas- sifier (P A), DecisionT reeClassifier (DT), ExtraT reesClassifier (ET), RandomF orestClassifier (RF), AdaBoostClassifier (AB) and Gradien tBo osting- Classifier (GB). P art 3. EvoD AG rnd-rnd EvoD AG fit-fit EvoD AG agr-rnd EvoDA G nvs-rnd EvoD AG ads-rnd autosklearn tp ot ad 3.25e-01(17) 5.17e-01(18) 8.48e-01(19) 4.91e+00(22) 2.4e+00(21) 1.57e+00(20) 2.82e+02(23) adult 2.16e-01(19) 3.47e-01(20) 8.44e-01(21) 2.05e-01(18) 2.40e+00(22) 1.10e-01(17) 9.85e+00(23) agaricus-lepiota 4.56e-01(17) 5.27e-01(18) 8.36e-01(20) 3.53e+00(21) 6.17e+00(22) 6.32e-01(19) 1.14e+01(23) aps-failure 1.39e-01(2) 1.23e-01(1) 5.71e-01(21) 3.70e-01(20) 1.99e+00(22) 1.58e-01(3) 3.38e+01(23) banknote 2.82e-02(17) 1.34e-01(19) 8.38e-02(18) 4.25e-01(20) 1.81e+00(21) 3.75e+00(22) 4.43e+00(23) bank 1.4e-01(18) 1.95e-01(20) 5.21e-01(21) 1.80e-01(19) 1.91e+00(22) 1.14e-01(17) 1.40e+01(23) biodeg 4.36e-01(18) 3.86e-01(17) 8.12e-01(19) 2.42e+00(20) 3.02e+00(21) 4.87e+00(22) 2.42e+01(23) car 3.18e-01(17) 5.5e-01(18) 8.44e-01(19) 2.85e+00(20) 4.83e+00(22) 2.97e+00(21) 3.07e+01(23) census-income 1.66e-01(19) 1.45e-01(18) 5.09e-01(21) 4.78e-01(20) 1.79e+00(22) 1.85e-02(16) 1.15e+01(23) cmc 3.28e-01(17) 5.04e-01(18) 6.93e-01(19) 2.75e+00(20) 3.29e+00(21) 3.49e+00(22) 2.37e+01(23) dota2 3.29e-01(19) 2.59e-01(18) 6.62e-01(21) 5.35e-01(20) 1.98e+00(22) 3.95e-02(17) 1.14e+01(23) drug-consumption 2.38e-01(17) 2.65e-01(18) 4.31e-01(19) 2.25e+00(20) 4.76e+00(22) 2.73e+00(21) 5.53e+01(23) fertility 1.24e+00(18) 1.65e+00(19) 1.20e+00(17) 3.01e+00(20) 3.94e+00(21) 5.21e+01(22) 7.42e+01(23) IndianLiverP atient 3.07e-01(17) 5.61e-01(18) 1.06e+00(19) 3.17e+00(20) 3.41e+00(21) 8.83e+00(22) 1.87e+01(23) iris 4.76e-01(17) 7.5e-01(18) 9.98e-01(19) 1.69e+00(20) 4.77e+00(21) 3.43e+01(22) 5.08e+01(23) krkopt 1.28e+00(18) 1.05e+00(17) 1.39e+00(19) 3.39e+00(20) 1.08e+01(21) 5.21e+01(22) 7.76e+01(23) letter-recognition 1.46e+00(18) 1.19e+00(17) 1.75e+00(19) 3.07e+00(20) 1.18e+01(21) 5.21e+01(23) 3.92e+01(22) magic04 3.95e-01(18) 4.19e-01(19) 1.30e+00(20) 4.38e+00(22) 2.91e+00(21) 2.70e-01(17) 5.18e+01(23) ml-prov e 5.73e-02(17) 1.83e-01(19) 1.38e-01(18) 4.78e-01(20) 8.52e-01(22) 7.84e-01(21) 1.69e+01(23) musk1 6.55e-01(18) 6.22e-01(17) 9.56e-01(19) 3.10e+00(20) 3.48e+00(21) 1.08e+01(22) 2.82e+02(23) musk2 2.77e-01(17) 3.08e-01(18) 7.44e-01(19) 3.75e+00(22) 2.57e+00(21) 7.8e-01(20) 5.68e+01(23) optdigits 7.58e-01(17) 1.54e+00(20) 9.40e-01(18) 3.27e+00(21) 6.66e+00(22) 9.41e-01(19) 6.32e+01(23) page-blocks 1.46e-01(17) 1.77e-01(18) 3.3e-01(19) 1.86e+00(21) 3.01e+00(22) 9.39e-01(20) 3.98e+01(23) parkinsons 5.73e-01(17) 6.22e-01(18) 8.49e-01(19) 2.55e+00(20) 4.75e+00(21) 2.66e+01(22) 3.37e+01(23) pendigits 6.80e-01(18) 1.84e+00(20) 1.24e+00(19) 2.95e+00(21) 1.22e+01(22) 4.8e-01(17) 5.68e+01(23) segmentation 8.57e-01(17) 1.23e+00(19) 1.09e+00(18) 2.95e+00(20) 6.92e+00(21) 1.71e+01(22) 1.03e+02(23) sensorless 9.22e-01(18) 1.48e+00(19) 2.16e+00(20) 2.26e+00(21) 1.55e+01(22) 8.83e-02(17) 2.23e+01(23) tae 1.03e+00(17) 1.64e+00(18) 1.64e+00(19) 3.65e+00(20) 9.55e+00(21) 3.43e+01(22) 8.20e+01(23) wine 6.02e-01(17) 7.02e-01(18) 1.04e+00(19) 1.82e+00(20) 3.87e+00(21) 2.93e+01(22) 8.69e+01(23) yeast 4.68e-01(17) 5.61e-01(18) 8.27e-01(19) 3.02e+00(20) 8.11e+00(22) 3.46e+00(21) 6.25e+01(23) Average time per sample 0.5102 0.683 0.9104 2.3754 5.0449 11.5265 57.6855 Average rank 17.0 17.8 19.2 20.3 21.5 19.7 23.0 References Badran, K. and Ro ck ett, P . (2012). Multi-class pattern classification using single, multi-dimensional feature-space feature extraction evolv ed by multi-ob jective genetic programming and its application to net work intrusion detection. Genetic Pr o gr amming and Evol vable Machines , 13(1):33–63. Beadle, L. and Johnson, C. G. (2008). Semantically driven crossov er in genetic programming. In 2008 IEEE Congr ess on Evolutionary Computation (IEEE World Congr ess on Computational Intel ligence) , pages 111–116. IEEE. Bergstra, J. and Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine L e arning R ese ar ch , 13(F eb):281–305. Brameier, M. and Banzhaf, W. (2001). A comparison of linear genetic programming and neural net works in medical data mining. IEEE T r ansactions on Evol utionary Computation , 5(1):17–26. Breiman, L. (1996). Bagging predictors. Machine L e arning , 24(2):123–140. Brereton, R. G. (2016). Orthogonality , uncorrelatedness, and linear independence of v ectors. Journal of Chemometrics , 30(10):564–566. Castelli, M., Manzoni, L., Mariot, L., and Saletta, M. (2019). Extending lo cal search in geometric seman tic genetic programming. In EPIA Confer enc e on A rtificial Intel ligenc e , pages 775–787. Springer. Castelli, M., Silv a, S., and V anneschi, L. (2015a). A C++ framew ork for geometric semantic genetic programming. Genetic Pr o gr amming and Evolvable Machines , 16(1):73–81. Castelli, M., T rujillo, L., V anneschi, L., Silv a, S., Z-Flores, E., and Legrand, P . (2015b). Geometric Seman- tic Genetic Programming with Lo cal Search. In Pr o c e e dings of the 2015 on Genetic and Evolutionary Computation Confer enc e - GECCO ’15 , pages 999–1006, New Y ork, New Y ork, USA. ACM Press. Cha wla, N. V., Japko wicz, N., and Kotcz, A. (2004). Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explor ations Newsletter , 6(1):1. Chen, Q., Xue, B., and Zhang, M. (2019). Impro ving Generalization of Genetic Programming for Sym b olic Regression With Angle-Driven Geometric Semantic Op erators. IEEE T ransactions on Evolutionary Computation , 23(3):488–502. Ch u, T. H., Nguyen, Q. U., and O’Neill, M. (2016). T ournament selection based on statistical test in genetic programming. In International Confer enc e on Par al lel Pr oblem Solving fr om Natur e , pages 303–312. Springer. Ch u, T. H., Nguy en, Q. U., and O’Neill, M. (2018). Semantic tournament selection for genetic program- ming based on statistical analysis of error v ectors. Information Scienc es , 436-437:352–366. Dem ˇ sar, J. (2006). Statistical Comparisons of Classifiers ov er Multiple Data Sets. T echnical report. Dua, D. and Graff, C. (2017). UCI Machine Learning Rep ository. Esp ejo, P . G., V en tura, S., and Herrera, F. (2010). A Survey on the Application of Genetic Programming to Classification. IEEE T r ansactions on Systems, Man, and Cyb ernetics, Part C (Applic ations and R eviews) , 40(2):121–144. F ang, Y. and Li, J. (2010). A Review of T ournamen t Selection in Genetic Programming. In International Symp osium on Intel ligenc e Computation and Applications ISICA 2010 , pages 181–192. Springer, Berlin, Heidelb erg. F eurer, M., Klein, A., Eggensp erger, K., Springen b erg, J., Blum, M., and Hutter, F. (2015). Efficient and Robust Automated Mac hine Learning. F olino, G., Pizzuti, C., and Sp ezzano, G. (2008). T raining distributed gp ensemble with a selectiv e algorithm based on clustering and pruning for pattern classification. IEEE T r ansactions on Evolutionary Computation , 12(4):458–468. F riedberg, R. M. (1958). A learning mac hine: Part i. IBM Journal of R ese ar ch and Development , 2(1):2– 13. F riedman, J. H. and Hall, P . (2007). On bagging and nonlinear estimation. Journal of Statistic al Planning and Infer enc e , 137(3):669–683. Galv an-Lop ez, E., Co dy-Kenn y , B., T rujillo, L., and Kattan, A. (2013). Using seman tics in the selection mec hanism in Genetic Programming: A simple metho d for promoting seman tic diversit y. In 2013 IEEE Congr ess on Evolutionary Computation , pages 2972–2979. IEEE. Graff, M., Flores, J. J., and Ortiz, J. (2014a). Genetic Programming: Semantic p oin t mutation op erator based on the partial deriv ative error. In 2014 IEEE International Autumn Me eting on Power, Ele ctr onics and Computing (ROPEC) , pages 1–6. IEEE. Graff, M., Graff-Guerrero, A., and Cerda-Jacob o, J. (2014b). Semantic crosso ver based on the partial deriv ative error. In Eur op e an Confer ence on Genetic Pr o gr amming , pages 37–47. Springer. Graff, M., Miranda-Jim ´ enez, S., T ellez, E. S., and Mo ctezuma, D. (2020). Evomsa: A multilingual ev olutionary approac h for sen timent analysis. Computational Intel ligenc e Magazine , 15:76 – 88. Graff, M., T ellez, E. S., Escalan te, H. J., and Miranda-Jim ´ enez, S. (2017). Seman tic genetic programming for sentimen t analysis. In NEO 2015 , pages 43–65. Springer. Graff, M., T ellez, E. S., Escalante, H. J., and Ortiz-Bejar, J. (2015a). Memetic Genetic Programming based on orthogonal pro jections in the phenotype space. In 2015 IEEE International A utumn Me eting on Power, Electr onics and Computing (ROPEC) , pages 1–6. IEEE. Graff, M., T ellez, E. S., Miranda-Jimenez, S., and Escalan te, H. J. (2016). EvoD AG: A semantic Genetic Programming Python library. In 2016 IEEE International A utumn Me eting on Power, Ele ctr onics and Computing (R OPEC) , pages 1–6. IEEE. Graff, M., T ellez, E. S., Villase˜ nor, E., and Miranda-Jim ´ enez, S. (2015b). Seman tic Genetic Programming Op erators Based on Pro jections in the Phenot yp e Space. In Rese ar ch in Computing Scienc e , pages 73–85. Guo, H., Jack, L. B., and Nandi, A. K. (2005). F eature generation using genetic programming with application to fault classification. IEEE T r ansactions on Systems, Man, and Cyb ernetics, Part B (Cyb ernetics) , 35(1):89–99. Hara, A., Kushida, J.-i., and T ak ahama, T. (2016). Deterministic Geometric Semantic Genetic Program- ming with Optimal Mate Selection. In 2016 IEEE International Confer enc e on Systems, Man, and Cyb ernetics (SMC) , pages 003387–003392. IEEE. Hara, A., Ueno, Y., and T ak ahama, T. (2012). New crossov er op erator based on semantic distance b et ween subtrees in Genetic Programming. In 2012 IEEE International Confer enc e on Systems, Man, and Cyb ernetics (SMC) , pages 721–726. IEEE. Ingalalli, V., Silv a, S., Castelli, M., and V anneschi, L. (2014). A m ulti-dimensional genetic programming approac h for multi-class classification problems. In Eur op e an Confer enc e on Genetic Pr o gr amming , pages 48–60. Springer. Iqbal, M., Xue, B., Al-Sahaf, H., and Zhang, M. (2017). Cross-domain reuse of extracted knowledge in genetic programming for image classification. IEEE T r ansactions on Evolutionary Computation , 21(4):569–587. Koza, J. R. (1992). Genetic pr o gr amming: on the pr o gr amming of c omputers by me ans of natural sele ction . MIT Press. Kra wiec, K. (2016). Semantic Genetic Programming. In Behavior al Pr o gr am Synthesis with Genetic Pr o gr amming , pages 55–66. Springer, Cham. Kra wiec, K. and Lic ho cki, P . (2009). Appro ximating geometric crosso ver in seman tic space. In Pr o c e e dings of the 11th A nnual c onfer enc e on Genetic and evolutionary c omputation - GECCO ’09 , page 987, New Y ork, New Y ork, USA. A CM Press. Kra wiec, K. and P awlak, T. (2012). Lo cally geometric seman tic crossov er. In Pr o c e e dings of the fourte enth international c onferenc e on Genetic and evolutionary computation c onfer enc e c omp anion - GECCO Comp anion ’12 , page 1487, New Y ork, New Y ork, USA. A CM Press. Kra wiec, K. and P a wlak, T. (2013). Locally geometric seman tic crosso ver: a study on the roles of seman tics and homology in recom bination operators. Genetic Pr o gr amming and Evolvable Machines , 14(1):31–63. La Cav a, W., Silv a, S., Danai, K., Sp ector, L., V annesc hi, L., and Moore, J. H. (2019). Multidimensional genetic programming for m ulticlass classification. Swarm and Evolutionary Computation , 44:260–272. Lehman, J. and Stanley , K. O. (2011). Abandoning Ob jectives: Evolution Through the Searc h for Nov elty Alone. Evolutionary Computation , 19(2):189–223. Lic ho dzijewski, P . and Heywoo d, M. I. (2008). Managing team-based problem solving with sym biotic bid- based genetic programming. In Pr o c e e dings of the 10th annual c onfer enc e on Genetic and evolutionary c omputation , pages 363–370. Lo veard, T. and Ciesielski, V. (2001). Representing classification problems in genetic programming. In Pr o c e edings of the 2001 Congr ess on Evolutionary Computation (IEEE Cat. No.01TH8546) , v olume 2, pages 1070–1077. IEEE. McIn tyre, A. R. and Heyw o od, M. I. (2011). Classification as clustering: A pareto co operative-competitive gp approach. Evolutionary Computation , 19(1):137–166. Moraglio, A., Kra wiec, K., and Johnson, C. G. (2012). Geometric seman tic genetic programming. In International Confer enc e on Par al lel Pr oblem Solving fr om Natur e , pages 21–31. Springer. Moraglio, A. and Poli, R. (2004). T op ological interpretation of crossov er. In Genetic and Evolutionary Computation Confer enc e , pages 1377–1388. Springer. Muni, D. P ., Pal, N. R., and Das, J. (2004). A Nov el Approach to Design Classifiers Using Genetic Programming. IEEE T r ansactions on Evolutionary Computation , 8(2):183–196. Munoz, L., Silv a, S., and T rujillo, L. (2015). M3gp–m ulticlass classification with gp. In Eur op e an Confer- enc e on Genetic Pr o gr amming , pages 78–91. Springer. Naredo, E., T rujillo, L., Legrand, P ., Silv a, S., and Mu ˜ noz, L. (2016). Evolving genetic programming classifiers with no v elty search. Information Scienc es , 369:347–367. Nguy en, Q. U., Nguy en, X. H., O’Neill, M., and Agapitos, A. (2012). An inv estigation of fitness sharing with seman tic and syntactic distance metrics. In Eur op e an Confer enc e on Genetic Pr ogr amming , pages 109–120. Springer. Nguy en, Q. U., Pham, T. A., Nguyen, X. H., and McDermott, J. (2016). Subtree semantic geometric crosso ver for genetic programming. Genetic Pr o gr amming and Evolvable Machines , 17(1):25–53. Olson, R. S., Urbanowicz, R. J., Andrews, P . C., Lav ender, N. A., Mo ore, J. H., et al. (2016). Automating biomedical data science through tree-based pip eline optimization. In Eur op e an Confer enc e on the Applic ations of Evolutionary Computation , pages 123–137. Springer. P awlak, T. P ., Wielo c h, B., and Krawiec, K. (2015). Semantic Bac kpropagation for Designing Search Op erators in Genetic Programming. IEEE T r ansactions on Evolutionary Computation , 19(3):326–340. P edregosa, F., V aro quaux, G., Gramfort, A., Mic hel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P ., W eiss, R., Dubourg, V., V anderplas, J., Passos, A., Cournap eau, D., Brucher, M., Perrot, M., and Duc hesnay , (2011). Scikit-learn: Machine Learning in Python. Journal of Machine L e arning R ese ar ch , 12(Oct):2825–2830. P oli, R., Langdon, W. B., and McPhee, N. F. N. (2008). A field guide to genetic pr o gr amming . Published via lulu.com and freely av ailable at www.gp-field-guide.org.uk. Rub erto, S., V anneschi, L., Castelli, M., and Silv a, S. (2014). Esagp - a seman tic gp framew ork based on alignment in the error space. In Eur op e an Confer enc e on Genetic Pr o gr amming , pages 150–161. Springer. S´ anchez, C. N., Dom ´ ınguez-Sob eranes, J., Escalona-Buend ´ ıa, H. B., Graff, M., Guti´ errez, S., and S´ anchez, G. (2019). Liking pro duct landscap e: going deeper into understanding consumers’ hedonic ev aluations. F o o ds , 8(10):461. Smart, W. and Zhang, M. (2004). Con tinuously ev olving programs in genetic programming using gradient descen t. In Pro c e e dings of THE 7th Asia-Pacific Confer enc e on Complex Systems . Su´ arez, R. R., Graff, M., and Flores, J. J. (2015). Semantic crossov er operator for gp based on the second partial deriv ative of the error function. R ese ar ch in Computing Scienc e , 94:87–96. Uy , N. Q., Hoai, N. X., O’Neill, M., McKay , R. I., and Galv´ an-L´ op ez, E. (2011). Semantically-based crosso ver in genetic programming: application to real-v alued sym b olic regression. Genetic Pro gr amming and Evolvable Machines , 12(2):91–119. V annesc hi, L. (2017). An In tro duction to Geometric Semantic Genetic Programming. In Oliver Sch¨ utze, Leonardo T rujillo, Pierric k Legrand, and Y azmin Maldonado, editors, Ne o 2015 , pages 3–42. Springer, Cham. V annesc hi, L., Castelli, M., Manzoni, L., and Silv a, S. (2013). A new implementation of geometric seman tic gp and its application to problems in pharmacokinetics. In Eur ope an Confer enc e on Genetic Pr o gr amming , pages 205–216. Springer. V annesc hi, L., Castelli, M., Scott, K., and T rujillo, L. (2019). Alignment-based genetic programming for real life applications. Swarm and evolutionary c omputation , 44:840–851. V annesc hi, L., Castelli, M., and Silv a, S. (2014). A surv ey of semantic metho ds in genetic programming. Genetic Pr o gr amming and Evolvable Machines , 15(2):195–214. Zhang, M. and Smart, W. (2006). Using gaussian distribution to construct fitness functions in genetic programming for m ulticlass ob ject classification. Pattern R e c o gnition L etters , 27(11):1266–1274.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment