Improving generalisation of AutoML systems with dynamic fitness evaluations
A common problem machine learning developers are faced with is overfitting, that is, fitting a pipeline too closely to the training data that the performance degrades for unseen data. Automated machine learning aims to free (or at least ease) the dev…
Authors: *정보가 제공되지 않음* (논문에 저자 정보가 명시되지 않았습니다.)
Abstract A common problem mac hine learning dev elop ers are faced with is o v er- fitting, that is, fitting a pipeline to o closely to the training data that the p erformance degrades for unseen data. Automated mac hine learning aims to free (or at least ease) the developer from the burden of pip eline creation, but this o v erfitting problem can p ersist. In fact, this can become more of a problem as we lo ok to iterativ ely optimise the p erformance of an internal cross-v alidation (most often k -fold). While this internal cross-v alidation hop es to reduce this ov erfitting, we sho w we can still risk o verfitting to the particular folds used. In this work, we aim to remedy this problem by in- tro ducing dynamic fitness ev aluations which approximate rep eated k -fold cross-v alidation, at little extra cost o ver single k -fold, and far lo wer cost than t ypical rep eated k -fold. The results show that when time equated, the prop osed fitness function results in significant improv emen t ov er the curren t state-of-the-art baseline metho d which uses an internal single k - fold. F urthermore, the prop osed extension is very simple to implemen t on top of existing evolutionary computation methods, and can pro vide essen tially a free b o ost in generalisation/testing p erformance. 1 Impro ving generalisation of AutoML systems with dynamic fitness ev aluations Benjamin Patric k Ev ans, Bing Xue, and Meng jie Zhang Sc ho ol of Engineering and Computer Science Victoria Universit y of W ellington New Zealand { b enjamin.evans,bing.xue,mengjie.zhang } @e cs.vuw.ac.nz 1 In tro duction With a rising demand for machine learning coming from a v ariet y of applica- tion areas, machine learning talent is struggling to keep up. This has spurred the dev elopmen t of Automated Machine Learning (AutoML), which hopes to sa v e time and effort on rep etitive tasks in ML [1] by allo wing data scientists to w ork on other imp ortant comp onen ts such as “dev eloping meaningful hypothe- sis” or “comm unication of results” [2]. The usefulness of k -fold cross-v alidation (CV) has recently b een doubted in mo del ev aluation researc h [3, 4], how ev er, iterativ e improv ement on an internal single k -fold CV remains at the core of man y AutoML optimisation problems. In this work, w e aim to in troduce an efficien t approach to approximating rep eated k -fold CV ( r × k -fold), whic h has b een shown to offer improv ed error estimation o v er typical k -fold CV [3, 4]. This is achiev ed b y prop osing a no v el dynamic fitness function, whic h adjusts the calculation at each generation in an effort to preven t ov erfitting to any one static function. The fitness of an individual is then measured as the individual’s a v erage performance throughout its existence (i.e. a v eraged o v er the individ- ual’s lifetime). The prop osed approach also do es so at far lo w er computational cost than typical repeated k -fold CV, by utilising the generation mechanism of ev olutionary learning. F rom an ev olutionary learning persp ective, the prop osed approach can be seen as a form of regularisation, which prefers y ounger individuals throughout the evolutionary pro cess. F rom a statistical p ersp ective, the dynamic fitness function can b e seen as improving the robustness of the approximation of the true testing p erformance (i.e. impro v ed generalisation). The motiv ation of this w ork is that curren t approaches to automated ma- c hine learning risk ov erfitting due to iterative improv emen t ov er a fixed fit- 2 ness function. The goal of automated mac hine learning is to impro v e the un- seen/generalisation of a pip eline, so mitigating this o v erfitting is extremely im- p ortan t. The main con tribution of this w ork is a nov el idea of fitness, whic h serves as a regularisation tec hnique while aiming to appro ximate a rep eated k -fold CV, and th us helps improv e generalisation p erformance. W e exp erimentally sho w this to b e useful in automated mac hine learning, but the usefulness also holds for man y ev olutionary computation (EC) techniques which rep eatedly optimise a fixed fitness function in an attempt to improv e the unseen p erformance, particularly in large searc h spaces plagued with lo cal optima. The prop osed extension is simple to implement and can serve as a nearly computationally free improv ement to many existing EC metho ds. The remainder of the pap er is organised as follo ws: Section 2 pro vides an o v erview of related AutoML w orks, Section 3 outlines the newly prop osed metho d, Section 4 compares the new fitness function to the current baseline, Section 5 analyses these differences in-depth, and Section 6 pro vides the con- clusions and outlines future work. 2 Bac kground and Related W ork Automated Mac hine learning is a new research area, which essentially uses ma- c hine learning to p erform mac hine learning [5]. The cyclic definition may b e confusing, but the idea is simple – automate the creation of mac hine learning pip elines, by treating the construction as an optimisation problem. The goal is to b e able to replace the difficult pro cess of selecting an appropriate pip eline with an automatic approac h, where all the user needs to do is to specify a dataset and an amoun t of time to train for, and an appropriate pip eline is returned automatically . The current top-p erforming approac hes to AutoML are based on EC meth- o ds (e.g. TPOT [6]), or Bay esian optimisation (e.g. auto-sklearn [7], auto-wek a [8, 9]). Research has found no significant difference in the p erformance of suc h metho ds [10], and as such the aforementioned methods can all b e considered the current state-of-the-art approaches. Here we fo cus on the EC metho ds, due to the p opulation mechanism which allo ws for the prop osed expansions without drastically increasing computational costs. Ev olutionary Computation (EC) is an area of nature-inspired techniques that appro ximate a global search. The s earc h space is effectively (but not exhaus- tiv ely) explored and exploited using a combination of m utation and crossov er op erators. Mutation op erators randomly modify an individual, while crossov er op erations combining individuals to pro duce offspring (children). In this sense, EC techniques can b e considered a guided population-based extension of random searc h [11]. TPOT [6, 12, 2] is an example of an EC technique based on Genetic Pro- gramming [13] and represents the current state-of-the-art EC-based approac h to AutoML. Machine learning pip elines are represented as tree structures, where 3 the ro ot no de of the tree is an estimator and all other no des are prepro cessing steps. Prepro cessing steps can p erform such transformations as feature selec- tion, principal components analysis, scaling, feature construction etc. Fitness is measured by tw o ob jectiv es, the score and the complexity . The score is max- imised, and the complexity is minimised using NSGA-I I. Here, our main fo cus is on improving the score, so w e refer to fitness as just ob jective 1 for simplicity , but note that the multiple ob jectives are still optimised in Section 3, and we analyse the size in more depth in Section 5. An area related to AutoML is neural arc hitecture searc h (NAS), which is essen tially AutoML for deep neural netw orks. Real et al. [14] prop ose an EC tec hnique for NAS which uses a no v el approac h for measuring fitness. Rather than fitness b eing measured as the p erformance or loss of a neural net w ork, the fitness is just a measure of age. The younger an individual, the fitter it is considered. As a result, new er mo dels are fa v oured in the evolutionary pro cess. This is referred to as “ageing evolution” or “regularised evolution”. This may w ork w ell for neural netw orks, considering a go o d random initialisation can result in a go o d p erformance “by chance”. Ho w ev er, what is more imp ortant, are neural netw orks that retrain well (remo ving the luck of random initialisation). In this sense, with ageing evolution, only mo dels which retrain well can p ersist. With general AutoML systems, man y of the comp onents are deterministic or at least not ov erly sensitiv e to the randomisation (i.e. Random F orests), so the ageing comp onent is not as imp ortant directly . How ever, the idea of retraining w ell is an imp ortant consideration. F or example, the fitness is computed as the av erage performance o v er an internal k -fold CV on the training set, yet a pip eline may just p erform well on this set of folds but not necessarily on another random set of folds, or more imp ortantly , the unseen test set. W e adopt this idea of “retraining well” b y in tro ducing a form of rep eated k -fold CV and introduce a nov el concept lo osely based on age. This is describ ed in more details in Section 3. It is important to men tion the goal here is not to compare neural approac hes (i.e. NAS) with more general classification pip elines (i.e. AutoML), but rather to improv e an existing approach to general classification pip elines. In this work, w e lo ok at adopting ideas of regularization and a dynamic fitness function and implementing these in to a current state-of-the-art AutoML system, TPOT, to inv estigate whether these ideas can improv e the p erformance of TPOT. 3 Prop osed Approac h W e prop ose a new metho d where the fitness function is dynamic and the p er- formance is av eraged ov er the lifetime of an individual. This is shown in Fig. 1. F rom this, w e can see that at eac h generation, the fitness of an individual can c hange. This is in contrast to the typical approac h, where an individual has a fixed fitness v alue. F or an individual to b e competitive, it, therefore, must hav e p erformed 4 Fold 1 Fold 2 Fold k Performance = Average result Shuffled training data (seed = generation number) Fold 1 Fold 2 Fold k Fitness = Average result Shuffled training data (fixed seed) Fitness = Life performance / # generations existed If existed in generation g-1 Life performance += Performance Y es No a. Proposed Fitness Calculation b. Standard Fitness Calculation Figure 1: A comparison of the newly prop osed fitness calculation (a.) vs the standard fitness calculation (b.). In the standard calculation (b.) the fitness is measured once and the function is fixed across all generations (the av erage in ternal test accuracy from a single k-fold split). In the prop osed approach (a.), the fitness is dynamic and changes throughout an individuals lifetime. The fitness is then the a v erage p erformance throughout an individual’s lifetime (i.e. an approximation to rep eated k-fold CV). highly throughout its existence. Y ounger (newer) individuals hav e a higher c hance of surviv al, as their performance is less thoroughly ev aluated than its predecessors (fewer rep etitions). This is based on the assumption that individuals cr e ate d fr om cr ossover or mutation on wel l-p erforming individuals ar e mor e likely to b e high p erforming than a r andomly gener ate d individual . This is a fair assumption, as ev olutionary computation, in general, is based on this idea. If the assumption did not hold, then w e w ould b e b etter p erforming a random search at every generation and k eeping only the best. The result is that an individual created randomly (or a close descendan t of a random individual) will b e more thoroughly ev aluated ov er the en tire evolu- tionary pro cess than an individual existing later in the pro cess. An individual whic h is generated late in the evolution requires fewer ev aluations, as it is the offspring or mutation of an individual whic h has already p erformed w ell on these previous ev aluations. There are t w o wa ys to think about this pro cess, one from an ev olutionary p ersp ectiv e, and one from a statistical p ersp ective. These are examined in the follo wing sections. 5 3.1 Ev olutionary P erspective F rom an ev olutionary standp oint, individuals hav e a lifespan (maxim um age), and this lifespan is based on the performance of an individual. If an individual p erforms well throughout its life, then the lifespan is high and it p ersists through generations. How ever, if an individual p erforms p o orly in some (or all) stages in its life, then it dies out and is unable to keep spreading its genes into later generations. In this sense, the lifespan is dynamic and c hanges throughout an individuals life based on how it performs. 3.2 Statistical P ersp ectiv e F or a given dataset D , the data is first split into a training set D train and a test set D test . D train is then giv en to the AutoML pro cess, and D test is not seen until after the learning/optimisation has finished. F rom the training set, an internal CV is performed. D train is split in to k = 5 equally sized folds F i , i ∈ [1 , 2 , 3 , 4 , 5], where the distribution of the predictor v alues D train [ y ] is prop ortionate in each fold F i (i.e. stratified). Each fold is then used as an internal testing set exactly once (note: this is not D test , it is a synthetic test set made from D train ), with the remaining folds b ecoming the in ternal training set. The performance of an individual i is then measured as the mean p erformance (in this case f1-score, discussed in Section 4.1) across the k folds, whic h we represent as ¯ x i . ¯ x i measures ho w well an individual p erforms on the given folds and is used as an estimate to how the individual will p erform on D test , i.e. µ i . W e refer to this pro cess as single k -fold CV. With TPOT (and other AutoML systems), the goal b ecomes to optimise ¯ x . This is achiev ed with selection that ranks individuals based on ¯ x . The problem b ecomes that optimising ¯ x i do es not necessarily optimise µ i (the classical def- inition of ov erfitting), as ¯ x often has a high v ariance. Although ¯ x itself is an a v erage to help mitigate o v erfitting to a single fold, we still risk o v erfitting to the sp ecific k folds since we iteratively try and impro v e on the exact folds used (o v er p otentially h undreds or thousands of generations). That is, the maximum ac hiev ed ¯ x increases monotonically throughout evolution, without necessarily resulting in an increase in µ . This can b e seen easily if w e consider the selection of folds F to b e noisy or unrepresentativ e of data seen in D test . The main approach to fixing this in mo del ev aluation literature is to do rep eated k -fold cross-v alidation, in an effort to reduce the v ariance and improv e the stability of single k -fold. Zhang and Y ang [3] suggest a rep eated k -fold ov er the single k -fold if the primary goal is “prediction error estimation”, a similar sen timen t is shared b y Krsta jic et al. [4] who conclude “selection and assessment of predictive mo dels require rep eated cross-v alidation”. These b ecome even more imp ortant concerns when we are iteratively impro ving on a single k -fold, as is done in AutoML since the risk increases with each generation. T o in tegrate this repetitive cross-v alidation into AutoMl, at eac h fitness ev al- uation, rather than p erforming single k -fold CV, w e would p erform r × k -fold v alidation, where r is the rep eating factor ( r > 1). How ever, with AutoML, 6 function ev aluations are already exp ensive (training and ev aluating a mo del on eac h fold), and rep eating this for every individual would lead to an increase in computation by a factor of r , and also requires deciding on a v alue for r , a v alue which is too high means unnecessary computation p er individual and thus less time for guided evolutionary search for b etter-p erforming individuals, and a v alue to o lo w we risk the ov erfitting discussed abov e for single k -fold. Instead, we propose a rep etition metho d which integrates nicely with EC tec hniques, where r do es not need to be specified and has a lo w er computational cost than typical r × k -fold. A t each generation, a new rep etition is p erformed (new selection of F) for the p opulation, and the p erformance av eraged ov er the lifetime for each individual. 3.3 No v el Fitness F unction The k ey con tribution of this work is a nov el fitness function. A flo w c hart is giv en in Fig. 1 which sho ws the new calculation (a.) compared to the original approac h (b.). Mathematically , the p erformance of an individual is given in Eq. (1). c represen ts a particular class, C the set of all classes, and | c | the n um ber of instances in class c . This is the w eigh ted f1-score, although the fitness calculation is indep endent of the particular p erformance (or scoring) function used. per f or mance = P c ∈ C | c | × 2 × precision × r ecall ( precision + r ecall P c ∈ C | c | (1) The fitness is then meas ured as the a v erage p erformance ov er the lifetime of an individual, as is shown in Eq. (2). Note that there are tw o ob jectiv es, with the second ob jective (complexity) remains the same as the original measure (n um b er of comp onents in the pip eline). f itness = P death i = birth per f or mance P death i = birth 1 , complexity (2) Since a P areto front of solutions is main tained throughout evolution, this fron tier must b e cleared at each generation to remov e the saving of individuals whic h happ ened to p erform well at only a single p oin t in time (and not in general). The simplified pseudo-code is giv en in Algorithm 1. The model chosen from the p opulation is the one in the frontier with the highest ob jective 1 score at the end of evolution. The function set, terminal set, and evolutionary parameters all remain the same as in original TPOT, with a full description giv en in [6]. F or this reason, these are not expanded here. 3.3.1 Computational Cost F or the single k -fold (default), the total num ber of models trained is k × gener ations ×| offspring | , where in this case (the default b ehaviour), | offspring | = 7 Algorithm 1: Pseudo Co de for the algorithm def evaluate(individuals: list, se e d: int) : for ind ∈ individuals do score = k fold(ind, training data, seed); if no ind.sc or es then ind.scores = []; end ind.scores += [score]; ind.fitness = mean(ind.scores), length(ind); end def evolve(p opulation size: int) : p opulation = [random individual() * p opulation size]; ev aluate(p opulation, random seed=0); for gen ∈ gener ations do offspring = apply genetic op erators(p opulation); ev aluate(offspring, random seed=gen); p opulation = NSGA I I(p opulation + offspring, population size); pareto front = fron tier(population); end mo del = max(pareto front) | p opulation | . F or rep eated k -fold, r × k × gener ations × | offspring | would b e p erformed. F or the prop osed approac h, k × gener ations × ( | offspring | + | p opulation | ) ev al- uations are p erformed, whic h can b e rewritten as k × gener ations × | offspring | × 2), since by default | offspring | = | p opulation | . W e can see for r > 2, the prop osed b ecomes more efficien t in terms of num b er of mo del ev aluations for a giv en num ber of generations and population size. In the case where | offspring | 6 = | p opulation | , then as long as | offspring | + | p opulation | ≤ r × | offspring | then the prop osed metho d requires few er ev aluations. This reduction in computation comes at the fact that individuals are only ev aluated throughout their lifetime, and not for generations b efore and after they w ere alive. F or example, if there are 50 total generations, and an individual is created in generation 5 and dies in generation 10, then r × k -fold will only b e rep eated r = 5 times, not r = 50. Therefore, the prop osed metho d is b oth more computationally feasible (for an y r > 2, in terms of the total num ber of folds ev aluated), and also remo v es the need to specify a r v alue which ma y p otentially waste computational time. 3.3.2 Regularisation This prop osed idea, where b ehaviour is av eraged o v er a lifetime, can be seen as a form of regularisation. Real et al. [14] define regularisation in the broader sense to b e “additional information that preven ts ov erfitting to training noise”. 8 W e can see that in this sense, av eraging p erformance o v er the lifetime of an individual can b e seen as a type of regularisation. The additional information comes from the randomisation/dynamic nature of the fitness function. The regularisation effect happ ens b ecause for an individual to b e selected it must either a.) P erform w ell across random rep eated CV, or b.) Be a modification of an individual which itself p erformed well across random rep eated CV. This remo v es (or at least mitigates) the risk of an individual only p erforming well on the sp ecific set of folds used throughout the en tire evolutionary pro cess with the original (static) fitness function. This is visualised in Fig. 2. F rom this figure, we can see w e risk selecting a particular mo del only b ecause it p erforms well on a sp ecific set of randomly c hosen folds (i.e. for a given seed for k -fold CV), and not in general. F or example, in 13 of the 30 cases (with r =30), Fig. 2 (b) would ha v e a higher fitness. In 17 of the cases, Fig. 2 (a) would hav e a higher fitness. T aking the a v erage ov erall repetitions helps to preven t the selection of a mo del that has o v erfit to closely to a giv en set of folds, and thus aims to regularise the mo del. a.) Individual One 0.76 0.78 0.80 0.82 0.84 Score 0.8097 = 14 b.) Individual Two 0.76 0.78 0.80 0.82 0.84 0.8106 = 16 Figure 2: A visual ov erview of the effect rep eated k-fold can hav e on the fitness. The x-axis represents v arious runs of k -fold CV (with different seeds). Grey lines represen t result for a particular seed. The blue line represents the av erage o v er all seeds (i.e. r × k -fold). Red asterisks indicate performing the b est out of the 2 mo dels for a giv en seed. This particular scenario was constructed for demonstration purp oses. 9 4 Comparisons 4.1 Setup F or comparisons, w e begin with the 42 datasets chosen in [10] for AutoML b enc hmarking. How ever, we find many of these do not generate results within the allow ed computational budget. Datasets which had not p erformed at least t w o generations b efore the time limit w as reac hed w ere excluded, lea ving 28 datasets. These datasets were excluded as the effects of the dynamic fitness function would not b e evident for only a single generation(as it would function the same a fixed fitness function. W e use the most recent v ersion of TPOT (#8b71687) as the baseline and compare to the prop osed metho d, whic h is the same v ersion of TPOT with an up dated fitness function. The comparisons are all run on equiv alen t hardware, using 2 cores, and the sp ecified am oun t of training time (1 hour, 3 hours or 6 hours). All co de is written and run in Python3.7. As we are interested in the effect of the new fitness function alone (and not differen t optimisation metho ds, searc h spaces etc), w e only compare to TPOT. F or example, comparing to AutoW ek a w e would b e comparing entirely differen t searc h spaces, lik ewise comparing to NAS approac hes we would b e comparing neural net w orks vs “traditional” classification algorithms, comparing to auto- sklearn w e would b e comparing differen t optimisers (EC vs Bay esian). There ha v e also already b een sev eral studies comparing the v arious AutoML algorithms [10, 1, 15], so the goal is not to compare these algorithms again, or prop ose an entirely new metho d, but rather in v estigate the usefulness of a dynamic fitness function in TPOT. F or these reasons, the only v ariation betw een the tw o metho ds is the fitness function to ensure any differences in performance are a direct result of the new fitness function. All parameters are fixed to the default v alues. The exception to this is the scoring function, whic h is accuracy by default. Here, we used the w eigh ted f1 score for b oth metho ds instead, as w e can not assume equal class distributions as is done with accuracy (the default). Again, w e w ould lik e to reiterate the prop osed method is robust to the selection of the scoring function, and changing out the scoring function in the fitness calculation is trivial. The underlying search spaces and parameters are therefore equal for both metho ds. TPOT uses the static (default) fitness, whereas the prop osed uses the new fitness describ ed in Section 3. 5 × 2-cv is used to generate the results, where the results are presented as mean ± std . General significance testing is p erformed using the Wilco xon Signed rank-sum test [16], with α = 0 . 05, pairing eac h dataset b et w een the tw o metho ds as suggested in [17]. W e do not provide p er dataset significance testing, as we are in terested in the general p erformance of the proposed method, and the increased lik eliho o d of false results making such per datasets comparisons heavily doubted for general comparisons [18]. Likewise, when discussing wins/losses/draws we do not count significance, as “counting only significant wins and losses do es not mak e the tests more but rather less reliable, since it draws an arbitrary threshold of p < 0.05 b etw een what coun ts and what do es not” [18]. As we are providing 10 three tests, the p v alues are also adjusted with the Bonferroni correction (i.e. m ultiplied by 3). In this section, we lo ok at what impact this new idea of fitness can ha v e on the resulting pip elines, with all other factors fixed. 4.2 Results W e run each metho d for 1 hour, 3 hours, and 6 hours. Doing so ensures the results do not just hold true at a sp ecific p oin t in time, and also allow us to consider if there are any effects o v er time. T able 1: Average weigh ted f1-scores. Scaled to [0 , 100] for readability . Presented as mean ± standard deviation from 5x2 cv. A blank (grey) cell indicates the metho ds only had time to p erform a single generation (or another problem o ccurred), so comparisons would b e meaningless. The final row indicates p- v alues from Wilco xon Signed rank test (as describ ed in Section 4.1). Green indicates that the prop osed metho d is significan tly b etter than the baseline at α = 0 . 05. 1 Hour 3 Hour 6 Hours Proposed TPOT Proposed TPOT Prop osed TPOT adult 86.65 ± 0.20 86.71 ± 0.29 86.68 ± 0.26 86.73 ± 0.23 86.71 ± 0.18 86.68 ± 0.32 anneal 98.98 ± 0.66 98.44 ± 0.68 99.02 ± 0.48 98.89 ± 0.74 98.41 ± 0.89 98.92 ± 0.40 apsfailure 99.34 ± 0.04 99.34 ± 0.04 arrhythmia 69.55 ± 3.52 68.84 ± 3.38 68.98 ± 2.71 69.15 ± 3.04 69.99 ± 2.72 69.84 ± 3.86 australian 86.11 ± 1.05 86.15 ± 1.17 86.29 ± 1.19 85.95 ± 0.95 86.43 ± 1.30 85.47 ± 0.96 bank-marketing 90.06 ± 0.17 86.72 ± 0.37 90.16 ± 0.30 86.73 ± 0.31 90.20 ± 0.21 87.19 ± 0.54 bloo d 76.79 ± 1.81 72.60 ± 3.13 76.76 ± 1.89 74.15 ± 3.59 76.84 ± 1.76 73.27 ± 4.08 car 98.34 ± 0.82 93.18 ± 2.83 98.64 ± 0.62 93.22 ± 3.28 98.27 ± 0.72 93.14 ± 1.87 cnae-9 94.41 ± 0.50 94.48 ± 0.62 94.39 ± 0.80 94.44 ± 0.91 94.99 ± 0.40 94.37 ± 0.63 connect-4 84.24 ± 0.36 72.44 ± 1.69 84.05 ± 0.39 71.77 ± 0.33 credit-g 73.02 ± 1.53 73.02 ± 1.84 72.76 ± 1.46 72.85 ± 1.70 73.59 ± 1.44 72.56 ± 1.72 dilbert 96.04 ± 0.57 96.36 ± 0.55 helena 30.07 ± 0.76 30.00 ± 0.71 higgs 72.03 ± 0.39 72.00 ± 0.32 72.15 ± 0.21 72.10 ± 0.30 jannis 70.01 ± 0.51 70.41 ± 0.50 jasmine 80.06 ± 1.38 80.14 ± 0.96 80.82 ± 0.97 80.72 ± 0.52 81.36 ± 0.64 81.37 ± 0.63 jungle 88.14 ± 1.31 83.84 ± 0.87 90.41 ± 1.57 84.21 ± 1.01 93.04 ± 1.92 85.07 ± 1.55 kc1 83.36 ± 1.50 82.30 ± 1.05 83.68 ± 1.24 82.39 ± 0.76 83.53 ± 1.64 82.37 ± 1.31 kr-vs-kp 99.31 ± 0.15 99.24 ± 0.29 99.41 ± 0.22 98.82 ± 0.79 99.40 ± 0.19 99.31 ± 0.16 mfeat-factors 97.11 ± 0.55 96.75 ± 0.45 97.55 ± 0.28 97.07 ± 0.39 97.45 ± 0.21 97.46 ± 0.54 minibo one 94.21 ± 0.07 94.23 ± 0.07 94.24 ± 0.08 94.27 ± 0.08 nomao 96.67 ± 0.18 96.61 ± 0.19 96.79 ± 0.10 96.60 ± 0.18 numerai 51.73 ± 0.16 51.70 ± 0.26 51.73 ± 0.13 51.78 ± 0.14 phoneme 89.39 ± 0.35 89.26 ± 0.61 89.61 ± 0.41 89.55 ± 0.65 89.65 ± 0.35 89.58 ± 0.65 segment 92.93 ± 1.09 92.87 ± 0.79 93.34 ± 0.94 92.83 ± 0.95 93.21 ± 0.65 92.84 ± 0.62 shuttle 99.95 ± 0.03 99.97 ± 0.01 99.97 ± 0.01 99.97 ± 0.01 99.97 ± 0.01 99.97 ± 0.01 sylvine 94.86 ± 0.33 94.78 ± 0.38 95.31 ± 0.40 95.22 ± 0.47 95.63 ± 0.59 95.42 ± 0.65 vehicle 81.04 ± 2.58 80.13 ± 1.99 80.58 ± 1.36 80.68 ± 2.22 81.17 ± 1.62 80.68 ± 2.29 Significance p = 0.01455 p = 0.01173 p = 0.01068 F rom lo oking at the results in T able 1, we can see that at 1 hour, the prop osed metho d’s a v erage score is b etter on 13 of the datasets, worse on 6 11 of the datasets, and on 7 datasets the results w ere not generated in time. In general, the prop osed metho d p erforms significan tly b etter when considering a paired Wilcoxon Signed-Rank test, as shown in T able 1. W e can see that in cases where the prop osed metho d beats the original, it do es so by a muc h larger margin than when the original b eats the prop osed, which is reflected in the statistical test by the very small p v alues indicating high significance. W e can see similar results when considering the 3-hour runs. On 17 of the datasets, the prop osed metho d has a higher av erage score than the original. On 9 of the datasets, the prop osed metho d has a low er a v erage score than the original. Again, we can see that in general, when viewing the significance tests in T able 1, the prop osed metho d do es significan tly b etter than the baseline. Again, similar results are also seen at the 6-hour point. The proposed metho d had a higher av erage score on 19 of the datasets, and a low er av er- age score on 9 of the datasets. F rom this, we can conclude the proposed fitness function provides a signifi- can t improv emen t ov er the single k -fold fitness, and this is reflected throughout the several time p oints trialled. There is no reason to b elieve the patterns w ould b e different at higher time frames. In fact, the prop osed metho d should, in theory , p erform b etter as the time go es on (less ov erfitting). 5 F urther Analysis In this section, w e analyse the results from Section 4 in more depth. W e consider some of the underlying c haracteristics, to understand what effect the new fitness function has on resulting mo dels. F or this, we use the result of the 6-hour run from the trials in Section 4 as this gives the largest set of results (more datasets) and allows us to p otentially find trends ov er a longer p erio d of time. 5.1 Age and Generations The age of an individual often gets little consideration in EC algorithms, in fa v our of just analysing fitness (or p erformance). How ever, there is some exist- ing research into age. F or example, [19] sho w that considering an age-la y ered p opulation (whic h regularly up dates the oldest models with new randomised ones) can help to a v oid local-optima by promoting div ersit y in the p opulation. [14] also make interesting discov eries when using age alone as a measure of fitness, rather than the p erformance. They found improv ed results due to the implicit regularisation of the individuals, as it means individuals were retraining w ell to p ersist in the p opulation. It is clear that age can b e a useful characteristic for helping to improv e p erformance of EC techniques, and one of the ideas b ehind the prop osed fitness function (av erage p erformance ov er lifetime) is that it will b ecome increasingly difficult for an older individual to exist throughout generations, whic h also serv es to diversify the p opulation by “clearing” out older individuals. 12 T able 2: An analysis of resulting characteristics. F ull descriptions for each column are giv en in Section 5. The main conclusions w e can see are that the prop osed model results in far y ounger best individuals on av erage, and closer predictions to the true testing score. The num b er of generations was significantly lo w er than the original, but this was exp ected due to the extra cost of rep eated CV. General significance testing is p erformed in the final row. Green indicates significan t at α = 0 . 05. Age and generations are rounded to the nearest integer for presentation, but not for significance testing. Age Generations Difference Complexity Proposed TPOT Prop osed TPOT Prop osed TPOT Proposed TPOT adult 0 ± 0 2 ± 1 5 ± 4 6 ± 3 0.32 ± 0.13 0.36 ± 0.23 1.50 ± 0.67 2.00 ± 1.26 anneal 1 ± 0 79 ± 55 119 ± 42 129 ± 47 1.00 ± 0.58 0.54 ± 0.33 2.57 ± 1.59 2.12 ± 0.60 apsfailure 1 ± 0 2 ± 1 2 ± 0 2 ± 0 0.04 ± 0.03 0.04 ± 0.02 1.38 ± 0.70 1.40 ± 0.92 arrhythmia 1 ± 0 5 ± 3 24 ± 8 17 ± 6 5.36 ± 4.23 6.02 ± 4.60 3.10 ± 1.14 2.20 ± 1.25 australian 1 ± 0 322 ± 189 312 ± 115 451 ± 201 2.96 ± 1.86 5.33 ± 1.50 4.70 ± 1.95 3.80 ± 1.89 bank-marketing 1 ± 0 4 ± 4 9 ± 2 30 ± 11 0.23 ± 0.25 0.43 ± 0.45 2.70 ± 0.78 4.70 ± 1.00 bloo d 32 ± 95 421 ± 312 486 ± 112 757 ± 315 3.10 ± 2.34 7.50 ± 4.54 4.00 ± 1.41 4.20 ± 1.99 car 1 ± 0 55 ± 47 62 ± 18 189 ± 82 0.72 ± 0.41 4.29 ± 2.81 4.80 ± 1.40 5.70 ± 1.62 cnae-9 3 ± 9 4 ± 2 29 ± 16 15 ± 8 0.62 ± 0.41 1.26 ± 1.01 3.60 ± 1.28 3.20 ± 1.40 connect-4 0 ± 0 2 ± 1 3 ± 1 3 ± 1 0.52 ± 0.36 4.88 ± 0.55 1.40 ± 0.49 1.90 ± 0.30 credit-g 25 ± 74 117 ± 103 239 ± 138 291 ± 117 3.36 ± 2.41 6.62 ± 2.65 4.10 ± 1.22 5.50 ± 1.96 dilbert 1 ± 0 2 ± 1 2 ± 0 2 ± 0 0.89 ± 0.34 0.71 ± 0.21 1.60 ± 0.66 1.10 ± 0.30 helena 1 ± 0 1 ± 0 1 ± 0 1 ± 0 0.59 ± 0.35 0.69 ± 0.27 1.57 ± 0.73 1.43 ± 0.49 higgs 1 ± 0 2 ± 1 3 ± 0 3 ± 1 0.37 ± 0.23 0.40 ± 0.27 1.90 ± 0.83 1.80 ± 0.75 jannis 0 ± 0 1 ± 1 1 ± 0 1 ± 0 0.35 ± 0.18 0.32 ± 0.32 1.70 ± 1.00 1.80 ± 0.40 jasmine 1 ± 2 5 ± 5 22 ± 6 26 ± 8 1.35 ± 0.59 1.26 ± 0.91 3.00 ± 0.63 3.40 ± 0.66 jungle 1 ± 0 3 ± 3 17 ± 2 15 ± 1 0.88 ± 0.33 8.36 ± 1.57 3.10 ± 0.83 3.00 ± 1.34 kc1 0 ± 0 113 ± 114 136 ± 48 269 ± 115 2.30 ± 1.67 2.72 ± 1.66 4.90 ± 2.70 4.50 ± 1.20 kr-vs-kp 0 ± 0 20 ± 45 44 ± 32 46 ± 52 0.26 ± 0.15 0.91 ± 0.76 3.30 ± 1.27 4.20 ± 1.17 mfeat-factors 1 ± 0 3 ± 2 14 ± 1 15 ± 5 0.69 ± 0.40 0.55 ± 0.59 2.67 ± 0.67 3.20 ± 1.33 minibo one 0 ± 0 2 ± 1 2 ± 1 3 ± 0 0.07 ± 0.04 0.09 ± 0.06 1.80 ± 1.08 1.30 ± 0.46 nomao 0 ± 0 2 ± 1 6 ± 2 7 ± 2 0.18 ± 0.11 1.92 ± 0.29 2.11 ± 0.74 2.50 ± 0.67 numerai 1 ± 0 2 ± 2 7 ± 2 7 ± 1 0.39 ± 0.15 0.38 ± 0.20 3.70 ± 0.90 2.90 ± 0.83 phoneme 1 ± 0 35 ± 28 131 ± 29 139 ± 50 0.54 ± 0.49 0.92 ± 0.72 4.30 ± 1.27 4.40 ± 2.01 segment 1 ± 0 41 ± 47 90 ± 42 94 ± 49 1.09 ± 0.93 1.52 ± 1.05 3.50 ± 1.63 3.40 ± 1.50 shuttle 0 ± 0 2 ± 2 14 ± 1 15 ± 2 0.01 ± 0.01 0.01 ± 0.01 2.90 ± 1.45 2.70 ± 0.90 sylvine 1 ± 0 13 ± 12 74 ± 15 76 ± 16 0.62 ± 0.34 0.63 ± 0.43 5.50 ± 1.43 4.40 ± 1.28 vehicle 1 ± 0 61 ± 136 101 ± 52 132 ± 130 3.81 ± 2.37 4.88 ± 2.29 4.33 ± 1.49 5.60 ± 1.43 Significant p =0 p =0.00313 p =0.00105 p =0.69008 Lo oking at T able 2, w e can see that this is, in fact, the case, and the age of the b est resulting individuals in the prop osed metho d is often either 0 (i.e. generated in the final generation) or 1 (generated in the second to last generation). The notable exception to this is with the blo o d dataset and the credit-g datase t. On the blo o d dataset, the av erage age was 32. Ho w ev er, the num ber of generations here was also the highest (486), and the age is still far low er than the original metho d (198), meaning the individuals are still relativ ely young. Lik ewise, with the credit-g dataset, the av erage age is 25, buts this is also far low er than the a v erage age of 117 from the baseline. When compared to the baseline, the resulting mo dels are far y ounger in general. While this isn’t necessarily useful on its o wn, w e can see b y the results in T able 1 that this y outhfulness has shown useful. The relative age (p ercentage of generations) from the t w o metho ds is sho wn in Fig. 3. 13 0 20 40 60 80 100 Relative Age Datasets Proposed Original Figure 3: Relative Age Rather than analysing age directly (as is done in [19, 14]), the prop osed approac h made it more difficult for older individuals to p ersist by av eraging the p erformance o v er the lifetime. Despite all using different metho ds to bias y ounger individuals, the results here confirm the usefulness of age, whic h is consisten t with the observ ations in b oth [19, 14]. 5.2 Appro ximation of µ (Difference) The ov erall goal of the fitness function is to maximise our true test score of µ . Of course, we can not do this directly so we use ¯ x as a pro xy for µ . The measure of how go o d this proxy is is given in T able 2 as “Difference”. This is just measured directly as abs ( ¯ x − µ ), to measure how muc h “ov er- fitting” is o ccurring. The ideal difference is thus 0, with the w orst b eing 100. Of course, this is not a perfect measure, as we could hav e something such as ¯ x = 0 . 01 ∗ µ , whic h would be a go o d proxy but hav e a large difference. W e assume this can not o ccur since b oth scoring functions are the same just on differen t sets of data. Ho w ev er, there could also b e more complex underlying relationships, which would not b e found with this difference measure. Therefore, this measure alone should b e interpreted with caution, but when paired with the testing results in T able 1 we can get a b etter understanding of the appro ximation. F or example, we could hav e a p erfect proxy ( ¯ x == µ , difference = 0), but if µ is very low then that is not ideal. P airing the appro ximation with the true testing accuracy in T able 1, we can see both a closer approximation of µ than the original method and also higher resulting testing accuracies (b oth statistically significant). This confirms that our approximation of a rep eated k -fold CV is useful for getting a more un biased estimate of µ . This is v ery imp ortant since the goal of AutoML is to impro v e µ indirectly by improving ¯ x (since we can not directly optimise the testing p erformance), so ac hieving an unbiased estimate assists this goal. 14 5.3 Complexit y TPOT (and by extension the prop osed metho d here) already uses NSGA-I I to balance the complexity of the pip elines with the p erformance, where the goal is to minimise complexity and maximise p erformance (f1-score). In this work, we fo cus particularly on improving the p erformance of the pip elines and as such, all discussion up un til this p oint has fo cused on the classification p erformance. Ho w ev er, an imp ortan t concern is that this do es not come at the expense of an increase in complexit y . Therefore, we analyse whether the new regularised evolution has any addi- tional effect on the size of the pip elines. This is shown in the final columns of T able 2, whic h gives the a v erage size of the b est resulting individual from each run of the prop osed metho d and eac h run of the baseline metho d across every dataset. No statistically significant difference in the size was found, which is reassuring. This means dynamically changing one ob jective while leaving the other fixed had no negative impact on the fixed ob jective. This also meant go o d individuals which o ccurred later in the p opulation w ere no more likely to b e larger than individuals which occurred early , which is what can often b e seen in single ob jective GP (see “bloat” [20]). T o v alidate the claims ab ov e that no negative effect was seen on complexity , w e also p erform additional comparison considering both ob jectives. In Fig. 4, w e visualise a dominance plot. A metho d dominates another metho d on a par- ticular dataset if at least one resulting ob jective (complexity or p erformance) is strictly b etter than the other metho d’s corresp onding ob jective, and all other ob jectives are at least as goo d as the other metho ds. W e can see on 11 datasets, the prop osed metho d dominates the baseline. On 5 datasets, the baseline domi- nates the prop osed. On the remaining datasets, neither metho d dominates each other (i.e. one ob jective w as b etter, but the other was w orse – a trade-off ). F ur- thermore, in the ma jority of the cases (4 out of 5) where the prop osed method is dominated, these are on the simpler problems (i.e. close to w ards a p erfect test p erformance with a complexity of 1), whereas on the more difficult problems the improv emen ts b ecome more apparent. As a result, we can conclude that no negative effect can b e seen on the complexit y (i.e. no increase), but a p ositive effect can be seen on the p erfor- mance with the newly prop osed fitness function, particularly on more complex datasets. The result is improv ed pip eline p erformance at no increase in pip eline complexit y . 6 Conclusions and F uture W ork In this work, w e prop osed a no v el fitness function which can b e used to im- pro v e the generalisation ability of AutoML, by serving as an implicit form of regularisation. The fitness function is dynamic and changes at eac h genera- tion. The fitness of an individual is then measured as its a v erage throughout the individual’s lifetime. W e implemen ted this new fitness in place of the stan- 15 1 2 3 4 5 Complexity 30 40 50 60 70 80 90 100 Score Dominance Plot Dominated Baseline (11) Was Dominated (5) No Dominance (12) Figure 4: Dominance Plot. Eac h p oint represents the av erage result on a dataset. Purple p oints are the proposed metho d, and orange p oints are the baseline (original) metho d. Lines pair datasets b etw een metho ds. A green line indicates the proposed metho d dominates the baseline. A red line indicates the baseline dominated the prop osed. A grey line indicates no dominance, i.e. one metho d achiev ed b etter in ob jective 1 but the other metho d did b etter in ob jective 2. dard fitness ev aluations in the current state-of-the-art AutoML metho d TPOT and show ed that we ac hiev e significant improv emen t ov er the standard (static) fitness function in general, on all time equated comparisons. The improv emen t in performance is due to the fact that the new fitness function approximates rep eated k -fold CV, which helps preven t ov erfitting that can o ccur due to iterative improv emen ts ov er a limited num b er of folds, while also av oiding the manual sp ecification of a rep etition factor r . W e empirically sho w this to work well for AutoML problems, but the prop osed fitness function is general enough to b e implemen ted to any EC metho ds with static fitness functions as a “free” improv ement to help impro v e generalisation, particularly in large searches plagued with lo cal optima (such as with AutoML). F or further work, there is already m uc h research into mo del ev aluation sc hemes [21, 4, 3, 18, 22, 23, 17, 24], ho w ev er, a thorough analysis of the impact these w ould ha v e in AutoML has y et to be conducted and is b eyond the scope of this pap er. F or example, should we b e chec king if improv ements are statis- tically significant at each generation? If not, are there b etter wa ys to improv e our approximation of µ for comparing these metho ds directly? 16 Another p oten tial researc h direction is based on ensem ble learning. Here, w e av erage p erformance ov er the lifetime of an individual by altering the fitness function at eac h generation to ensure generalisation p erformance. An alternative approac h could store the b est individual from eac h generation (and thus b est individual for eac h split of the data), and then use the b est resulting individuals from ev ery generation as an ensemble. In this sense, a “free” ensemble could b e constructed, but as a result, the final pip elines would b e far more complex (an ensemble with the size equal to the num ber of generations). Other metho ds could also b e considered based on this idea, where an ensemble could be easily constructed due to the randomness in the fitness function (which indirectly creates diversit y). References [1] A. T ruong, A. W alters, J. Go o dsitt, K. Hines, B. Bruss, and R. F ariv ar, “T ow ards automated mac hine learning: Ev aluation and comparison of au- toml approaches and to ols,” arXiv pr eprint arXiv:1908.05557 , 2019. [2] T. Le, W. F u, and J. Mo ore, “Scaling tree-based automated machine learn- ing to biomedical big data with a feature set selector.” Bioinformatics (Ox- for d, England) , 2019. [3] Y. Zhang and Y. Y ang, “Cross-v alidation for selecting a mo del selection pro cedure,” Journal of Ec onometrics , vol. 187, no. 1, pp. 95–112, 2015. [4] D. Krsta jic, L. J. Buturovic, D. E. Leahy , and S. Thomas, “Cross-v alidation pitfalls when selecting and assessing regression and classification mo dels,” Journal of cheminformatics , vol. 6, no. 1, p. 10, 2014. [5] B. Ev ans, “P opulation-based ensemble learning with tree structures for classification,” 2019. [6] R. S. Olson, N. Bartley , R. J. Urbanowicz, and J. H. Mo ore, “Ev aluation of a tree-based pip eline optimization to ol for automating data science,” in Pr o c e e dings of the Genetic and Evolutionary Computation Confer enc e 2016 , ser. GECCO ’16. New Y ork, NY, USA: A CM, 2016, pp. 485–492. [Online]. Av ailable: http://doi.acm.org/10.1145/2908812.2908918 [7] M. F eurer, A. Klein, K. Eggensp erger, J. Springenberg, M. Blum, and F. Hutter, “Efficien t and robust automated machine learning,” in A dvanc es in Neur al Information Pr o c essing Systems 28 , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Asso ciates, Inc., 2015, pp. 2962–2970. [Online]. Av ailable: http://papers.nips.cc/pap er/ 5872- efficient- and- robust- automated- machine- learning.p df [8] L. Kotthoff, C. Thornton, H. H. Ho os, F. Hutter, and K. Leyton-Brown, “Auto-w ek a 2.0: Automatic mo del selection and hyperparameter optimiza- 17 tion in wek a,” The Journal of Machine L e arning R ese ar ch , vol. 18, no. 1, pp. 826–830, 2017. [9] C. Thorn ton, F. Hutter, H. H. Ho os, and K. Leyton-Brown, “Auto-wek a: Com bined selection and hyperparameter optimization of classification algo- rithms,” in Pr o c e e dings of the 19th ACM SIGKDD international c onfer enc e on Know le dge disc overy and data mining . A CM, 2013, pp. 847–855. [10] P . Gijsb ers, E. LeDell, S. Poirier, J. Thomas, B. Bischl, and J. V anschoren, “An op en source automl b enchmark,” arXiv pr eprint [cs.LG] , 2019, accepted at AutoML W orkshop at ICML 2019. [Online]. Av ailable: [11] L. I. Kunchev a and J. C. Bezdek, “Nearest prototype classification: Clus- tering, genetic algorithms, or random search?” IEEE T r ansactions on Sys- tems, Man, and Cyb ernetics, Part C (Applic ations and R eviews) , vol. 28, no. 1, pp. 160–164, 1998. [12] R. S. Olson, R. J. Urbanowicz, P . C. Andrews, N. A. Lav ender, L. C. Kidd, and J. H. Mo ore, Applic ations of Evolutionary Computation: 19th Eur o- p e an Confer enc e, EvoApplic ations 2016, Porto, Portugal, Mar ch 30 – April 1, 2016, Pr o c e e dings, Part I . Springer International Publishing, 2016, ch. Automating Biomedical Data Science Through T ree-Based Pip eline Opti- mization, pp. 123–137. [13] J. R. Koza, “Genetic programming,” 1997. [14] E. Real, A. Aggarw al, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” in Pr o c e e dings of the AAAI Confer enc e on Artificial Intel ligenc e , vol. 33, 2019, pp. 4780–4789. [15] I. Guyon, L. Sun-Hosoy a, M. Boull ´ e, H. J. Escalan te, S. Escalera, Z. Liu, D. Ja jetic, B. Ray , M. Saeed, M. Sebag et al. , “Analysis of the automl c hallenge series 2015–2018,” in Automate d Machine L e arning . Springer, 2019, pp. 177–219. [16] F. Wilcoxon, “Individual comparisons b y ranking methods,” Biometrics Bul letin , v ol. 1, no. 6, pp. 80–83, 1945. [Online]. Av ailable: http: //www.jstor.org/stable/3001968 [17] J. Dem ˇ sar, “Statistical comparisons of classifiers ov er multiple data sets,” Journal of Machine le arning r ese ar ch , vol. 7, no. Jan, pp. 1–30, 2006. [18] T. G. Dietterich, “Approximate statistical tests for comparing sup ervised classification learning algorithms,” Neur al c omputation , vol. 10, no. 7, pp. 1895–1923, 1998. [19] G. S. Hornb y , “Alps: the age-lay ered p opulation structure for reducing the problem of premature con v ergence,” in Pr o c e e dings of the 8th annual c onfer enc e on Genetic and evolutionary c omputation . A CM, 2006, pp. 815–822. 18 [20] P . A. Whigham and G. Dick, “Implicitly controlling bloat in genetic pro- gramming,” IEEE T r ansactions on Evolutionary Computation , v ol. 14, no. 2, pp. 173–190, 2009. [21] Y. Bengio and Y. Grandv alet, “No unbiased estimator of the v ariance of k-fold cross-v alidation,” Journal of machine le arning r ese ar ch , vol. 5, no. Sep, pp. 1089–1105, 2004. [22] C. Nadeau and Y. Bengio, “Inference for the generalization error,” in A d- vanc es in neur al information pr o c essing systems , 2000, pp. 307–313. [23] R. R. Bouck aert and E. F rank, “Ev aluating the replicability of significance tests for comparing learning algorithms,” in Pacific-Asia Confer enc e on Know le dge Disc overy and Data Mining . Springer, 2004, pp. 3–12. [24] G. V anwinc k elen and H. Blo ck eel, “On estimating mo del accuracy with rep eated cross-v alidation,” in BeneL e arn 2012: Pr o c e e dings of the 21st Belgian-Dutch Confer enc e on Machine L e arning , 2012, pp. 39–44. 19
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment