Reinforcement-based Simultaneous Algorithm and its Hyperparameters Selection
Many algorithms for data analysis exist, especially for classification problems. To solve a data analysis problem, a proper algorithm should be chosen, and also its hyperparameters should be selected. In this paper, we present a new method for the si…
Authors: Valeria Efimova, Andrey Filchenkov, Anatoly Shalyto
JMLR: W orkshop and Conference Pro ceedings 60: 1 – 10 , 201 6 ACML 2 016 Reinforcemen t-based Sim ult aneous Algorithm and i ts Hyp erparameters Selection V aleria Efimo v a efimo v a@rain .ifmo.ru Andrey Filc henk ov afilchenko v@corp.ifmo.r u Anatoly Shalyto shal yto@mail. ifmo.r u ITMO University, St. Petersbur g, Kr onverksky Pr. 49 Abstract Many algorithms for data analysis exist, es pecially for classification pro blems. T o solve a data analysis problem, a proper algo rithm should b e c hosen, and also its hyperpa rameters should be selected. In this paper, we present a new metho d for the simult aneous selection of an algorithm and its hyperpara meters. In order to do so, w e reduced this problem to the multi-armed ba ndit pro blem. W e co nsider a n alg orithm as a n ar m a nd alg orithm hyperpara meters search during a fixed time as the cor resp onding arm pla y . W e also sugg est a problem-s pecific reward function. W e per fo rmed the exp eriment s on 10 rea l datasets and compare the suggested metho d with the exis ting one implemented in Auto-WEKA. The results show that o ur metho d is significa n tly better in most of the cases and never worse than the Auto-WEK A. Keyw ords: algorithm selection, h yper parameter optimization, m ulti-armed bandit, rein- forcement lea rning 1. In tro duction The goa l of sup ervised learning is to find a data mo del for a giv en dataset that allo ws to m ake the most accurate predictions. T o build su c h mo d el, lot s of le arn ing algorithms exist, esp ecially in classification. These algorithms sho w v a rious p erformances on different tasks. It p r ev en ts usage of a single univ ersal alg orithm to bu ild a d ata mo del for all existing datasets. The p erf orm ance of most of these algorithms dep ends on hyp erp ar ameters , the selection of w hic h d r amatica lly affects the p erformance of the algorithms. Automated s imultaneous selection of a learning algorithm and its h yp erparameters is a sophisticat ed problem. Usually , this problem is divided into tw o subproblems that are solv ed ind ep enden tly: algorithm selection and hyp erparameter optimiza tion. The first is to select an alg orith m f r om a set of algo r ithms (algo r ithm p ortfolio). The second is to fin d the b est hyperp arameters for p reselecte d algorithm. The first s u bproblem is t yp icall y solv ed b y testing ea ch of the alg orithms with prec hosen h yp erparameters in the p ortfolio b y man y practitioners. Other m ethod s are also in use, suc h as selecting algorithms r andomly , b y heuristics or using k-fold cr oss-validation ( Ro driguez et al. , 2010 ). But the last metho d r equires running and then comparin g all the algorithms. The other metho ds are not un iv ersally app licable. Ho wev er, this s u bproblem has b een in the scop e of researc h in terest for d ecades. Decision r ules w ere us ed in sev eral decades old pap ers c 2016 V. Efimo v a, A. Filchen k o v & A. Shalyto. Efimov a Filchenko v Sh al yto on algorithm selectio n from a p ortfolio ( Aha , 1992 ). As an example, such rules are created to c h o ose from 8 algorithms in ( Ali and Smith , 2006 ). No wa d a ys, more effectiv e approac hes exists suc h as meta learning ( Giraud-Carrier et al. , 2004 ; Ab dulrah m an et al. , 2015 ). This approac h is to r ed uce the algorithm selection p rob- lem to a sup ervised learning problem. It requires a training set of datasets D . F or all d ∈ D , meta-feature v ector is ev aluated. Meta-features are useful c haracteristics of datasets, such as num b er of categorical or n u m erical features of an ob ject x ∈ d , size of d and m any oth- ers ( Filc henko v an d P end ry ak , 2015 ; Castiello et al. , 2005 ). After that, all the algorithms are run on all the d ataset s d ∈ D . Th us class lab els are formed b ased on empir ical risk ev aluation. Then a m eta-classifier is learnt on th e prepared d ata with dataset s as ob jects and b est algorithms as lab els. It is worth to note that it is b etter to solv e this problem as the learning to r ank problem ( Brazdil et al. , 2003 ; Sun and Pfahringer , 2013 ). The second sub problem is a hyperp arameter optimization that is to fin d hyperp aram- eter vecto r for a learning algo rithm that leads to the b est p erforman ce of this alg orith m for a giv en dataset. F or example, h yp erparameters of the Supp ort V ector Mac h ine (SVM) include k ernel fu nction and its hyper p arameters; for a neural net, they include the n u m b er of hidden la yers and the num b er of neurons in eac h of them. In practice, algorithms hy- p erparameters are usu ally c hosen manually ( Hutter et al. , 2015 ). Moreo ver, sometimes the selection p roblem can b e reduced to a simple optimiz ation problem (primarily for statistical and regression algorithms), as, for instance, in ( Strijo v and W eb er , 20 10 ). Ho wev er, th is metho d is not u niv ersally applicable. Since hyp erparameter optimization of classification algorithms is often app lied man u ally , it requires a lot of time and do n ot lead to acceptable p erformance. There are sev eral algorithms to solv e the second subpr oblem automatically: Grid Searc h ( Bergstra and Bengio , 2012 ), Random Searc h ( Hastie et al. , 200 5 ), Sto c h as- tic Gradien t Descen t ( Bottou , 1998 ), T ree-structured P arzen estimator ( Bergstra et al. , 2011 ), and the Ba y esian Optimization includ ing S equ en tial Mo del-Based Optimizat ion (SMBO) ( Sno ek et al. , 201 2 ). In ( Hutter et al. , 20 11 ), Sequen tial mod el-based algorithm configuration (S MA C) is int r od uced. It is based on SMBO algorithm. Another id ea is imple- men ted in predicting the b est h yp erparameter v ector with meta- learning ap p roac h ( Man to v ani et al. , 2015 ). Reinforcement- based approac h was used in ( Jamieson and T alw alk ar , 2015 ) to op- erate seve ral optimization threads with differen t settings. Solution for the simultaneous selectio n of an algorithm and its hyp erparameters is im- p ortan t for mac h ine learning applications, but only a few of pap ers are dev oted to this searc h. Moreo ver, these pap ers consider only a sp ecial case. One of th e p ossible solutions is to build a huge set of alg orithm s with prec hosen h y- p erparameters and select from it. T his solution w as imp lemen ted in ( Leite et al. , 2012 ), in whic h a s et of ab out 300 algorithms with chosen h yp erparameters was used. Ho we v er, su c h pure alg orithm selection approac h cannot pro v id e an y insurance of these algorithms qualit y for a new problem. This set ma y s im p ly not include a h yp erparameter v ector for one of the present ed learning algorithms w ith th e b est p erformance. Another p ossible solutio n is sequen tial optimizatio n of hyperp arameters for ev ery learn- ing algorithm in p ortfolio and selectio n the b est of them. This solution is implemente d in the Auto-WEKA library ( Thornto n et al. , 2012 ), it allo ws to c ho ose one of the 27 base learning algorithms, 10 meta- algorithms and 2 ensemble algorithms and optimize its hyp erparam- eters with SMA C metho d simultaneously and automatica lly . This metho d is describ ed in 2 Reinf orcement-based Simul t aneous Algorithm and its Hyperp arameters Selection detail in ( Thornto n et al. , 2012 ). It is clear that if we use the m ethod , then it take s en or- mous time and m a y b e referred to as exh austiv e searc h (while, in fact, it is not due to the infinity of hyp erparameter spaces). The goa l of this w ork is to suggest a metho d for sim u ltaneous learning algorithm and its p arameters selection b eing faster than the exhaustiv e searc h without affec ting found solution qualit y . In order to do so, we u s e multi-a rmed bandit-based approac h. The remainder of this pap er is organized as follo w s. In Section 2 , we describ e in details the learning algorithm an d its h yp erparameter selection p roblem and its tw o subproblems. The suggested metho d, based on m ulti-armed bandit problem, is presente d in Section 3 . In Section 4 , exp eriment resu lts are present ed and discu s sed. Section 5 concludes. This pap er extends a pap er accepted to International C onference on Inte lligen t Data Pro cessing: Th eory and Applications 2016. 2. Problem Statemen t Let Λ b e a hyperp arameter space related to a learning al gorithm A . W e will denote the algorithm with pr ec hosen hyp erparameter v ector λ ∈ Λ as A λ . Here is the formal description of the alg orith m selectio n p roblem. W e are giv en a set of algo r ithms with c hosen hyperp arameters A = { A 1 λ 1 , . . . A m λ m } and learning dataset D = { d 1 , . . . d n } , where d i = ( x i , y i ) is a pair consisting of an ob ject and its lab el. W e should c ho ose a parametrized alg orithm A ∗ λ ∗ that is th e most effectiv e with resp ect to a qualit y measure Q . Algorithm efficiency is appraised b y the u se of dataset partition in to learning and test sets with th e fu rther emp ir ical r isk estimation on the test set. Q ( A λ , x ) = 1 | D | X x ∈ D L ( A λ , x ) , where L ( A λ , x ) is a loss function on ob ject x, whic h is usually L ( A λ , x ) = [ A λ ( x ) 6 = y ( x )] for classification pr oblems. The algorithm selection p roblem th us is stated as the empirical risk minimization p rob- lem: A ∗ λ ∗ ∈ argmin A j λ j ∈A Q ( A j λ j , D ) . Hyp erparameter optimiza tion is the pro cess of selecting hyper p arameters λ ∗ ∈ Λ of a learn- ing algorithm A to optimize its p erformance. Therefore, we can write: λ ∗ ∈ argmin λ ∈ Λ Q ( A λ , D ) . In this pap er, we consider the sim u ltaneous algorithm selectio n and h yp erparameters opti- mization. W e are give n learning algorithm set A = { A 1 , . . . , A k } . Eac h learning algorithm A i is asso ciated with h yp erparameter sp ace Λ i . Th e goal is to fi nd algorithm A ∗ λ ∗ minimizing the empirical r isk: A ∗ λ ∗ ∈ argmin A j ∈ A ,λ ∈ Λ j Q ( A j λ , D ) . 3 Efimov a Filchenko v Sh al yto W e assume that h yp er p arameter optimization is p erformed du ring the s equen tial h yp er - parameter optimization pro cess. Let u s giv e f ormal description. Se quential hyp erp ar ameter optimization pr o c e ss for a learning algorithm A i : π i ( t, A i , { λ i j } k j =0 ) → λ i k +1 ∈ Λ i . It is a hyp erparameter optimization metho d ru n on the learning alg orith m A i with time budget t , also it stores b est foun d h yp erp arameter v ectors within previous k iterations { λ j } k j =0 . All of the hyp erparameter optimizatio n metho ds listed in th e introdu ction can b e de- scrib ed as a sequentia l hyperp arameter optimization pro cess, for instance, Grid Searc h or an y of SMBO algorithm family includin g SMAC metho d, w hic h is used in this pap er. Supp ose that a sequent ial hyp erparameter optimization pro cess π i is asso ciate d with eac h learning algorithm A i . Then the previous p roblem can b e solv ed by r u nning all these pro cesses. Ho wev er, a n ew p roblem arises, the b est algorithm searc h time minimization problem. In practice, there is a similar p r oblem that is more interesting in practical terms. It is the p roblem of fin d ing the b est algorithm by fixed time. Let us d escrib e it formally . Let T b e a time bu dget for the b est algorithm A ∗ λ ∗ searc hing. W e should sp lit T in to in terv als T = t 1 + · · · + t m suc h that if we r un pro cess π i with time budget t i w e will get minimal empir ical risk. min j Q ( A j λ j , D ) − − − − − − → ( t 1 ,...,t m ) min , where A j ∈ A , λ j = π j ( t j , A j , ∅ ) and t 1 + . . . + t m = T ; t i ≥ 0 ∀ i. 3. Suggested method In this pr ob lem, the k ey source is a hyperp arameter op timization time limit T . Let u s split it up to q equal sm all in terv als t and call them time budgets . No w w e can solv e time budgets assignmen t problem. Lets ha v e a look at our problem in th e differen t w a y . F or eac h time in terv al, w e should c h o ose a pro cess to b e run du ring this interv al b efore this interv al starts. The qu alit y that will b e r eac h ed by an algorithm on a giv en dataset is a priori unkno wn. On the one hand, the time sp en t for searc hin g hyperp arameters of not the b est learning algorithms is sub tracted fr om the time sp en t to impr ov e hyp er p arameters for the b est le arn- ing al gorithm. On th e other hand, if the time will b e sp ent only for tuning single algorithm, w e ma y miss b etter algorithms. Th us, since th ere is no marginal solution, the problem seems to b e to fin d a tradeoff b et we en exp loration (assigning time for tuning hyper p aram- eters of d ifferen t algo r ithms) and exploitation (assigning time f or tuning hyp erparameters of the cur ren t b est algorithm). T his tradeoff detection is th e classical pr oblem in reinforce- men t learnin g, a sp ecial case of whic h is multi-armed bandit problem ( Sutton and Barto , 1998 ). W e cannot assume that there is a hidden pro cess for stat e transformation that affects p erformance of algorithms, th u s we ma y assume that the environmen t is static. Multi-armed band it problem is a p roblem, in which there are N bandit’s arms. Pla ying eac h of the arms gran ts a certain reward. T h is rewa rd is c hosen according to an unkno wn probabilit y distribu tion, sp ecific to this arm. At eac h iteration k , an agen t c h o oses an arm a i and get a rew ard r ( i, k ). The age nt’s goal is to minimize the total loss by time T . In this pap er, we use the follo wing algorithms solving this problem ( Sutton and Barto , 1998 ): 4 Reinf orcement-based Simul t aneous Algorithm and its Hyperp arameters Selection 1. ε -greedy: on eac h iteration, a verag e rew ard ¯ r a,t is estimated for eac h arm a. T hen the agent p la ys the arm with m aximal a ve r age r ew ard w ith probabilit y 1 − ε , an d a random arm with probabilit y ε. If you pla y eac h arm an infinite n um b er of times, then the a v erage reward conv erges to the real reward with probabilit y 1 . 2. UCB1: initially , the agen t pla ys eac h arm once. On iteration t, it pla ys arm a t that: a t ∈ argmax i =1 ..N r i,t + r 2 · ln t n i , where r i,t is an a ve r age r eward for arm i , n i is the num b er of times arm i was pla yed. 3. Softmax: initially , the agent pla ys eac h arm once. On iteration t, it pla ys arm a i with probabilit y: p a i = e ¯ r i /τ P N j =1 e r j /τ , where τ is p ositiv e temp erature parameter. In this pap er, we associate arms with sequen tial hyperp arameters optimization pro- cesses { π i ( t, A i , { λ k } q k =0 ) → λ i q +1 ∈ Λ i } m i =0 for learning algorithms A = { A 1 , . . . , A m } . After pla yin g arm i = a k at iteration k , w e assign time bu dget t to a pro cess π a k to opti- mize hyp erparameters. When time b u dget runs out, w e receiv e hyper p arameter v ector λ i k . Finally , when select ed pro cess stops, w e ev aluate the result using empirical r isk estimate for pro cess π i at iteration k , that is Q ( A i λ i k , D ). The algorithm w e name MAS S AH ( M ulti- a rmed s im ultanous s electio n of a lgorithm and its h yp erparameters) is pr esen ted listing 1 . There, MABS ol ver is implementing a m u lti-armed b andit pr oblem solution, getConfig ( i ) is a function that returns A i λ q , wh ic h is the b est f ound confi guration b y q iterations to algorithm A i . The question w e need to answe r is ho w to defin e a rewa r d function. T h e first (and sim- plest) wa y is to define a rew ard as th e difference b et we en current empirical r isk and optimal empirical risk foun d during p revious iterations. Ho wev er, we meet sev eral disadv anta ges. When the optimization pro cess finds h yp erparameters that leads to almost optimal algo- rithm p erformance, the reward w ill b e extremely small. Also, the selection of s uc h a r eward function do es not seem to b e a go od option for MABs, since pr obabilit y distribution will dep end on the num b er of iteratio ns. In order to fi nd a rew ard fu nction, such that the corresp onding p robabilit y distrib ution will not change during the algorithm p erformance, w e apply a litt le tric k. In stead of defining rew ard function itself, we will defin e an a verage reward fu nction. In order to do so, we use SMA C algorithm features. Let us describ e SMA C alg orithm. A t eac h iteration, a set of curren t optimal hyper - parameter v ectors is kn own for eac h algorithm. A lo cal searc h is app lied to find hyp erpa- rameter vec tors which ha v e distinction in one p osition w ith an optimal ve ctor and imp ro v e algorithm qualit y . These hyperp arameter vec tors are added to the set. Moreo v er, some random h yp erparameter v ectors are added to the set. Th en sele cted configurations (the algorithms w ith their hyp erparameters) are sorted by exp e cte d impr ovement (EI). S ome of the b est configur ations are run after that. 5 Efimov a Filchenko v Sh al yto Algorithm 1: MASSAH Data : D is the giv en d ataset q is the num b er of iterations, t is time budget for one iteration, { π i } i =1 ,...,N are sequen tial hyperp arameter optimization pro cesses. Result : A λ is algorithm w ith chosen h yp erparameters for i = 1 , . . . , N do λ i ← π i ( t, A i , ∅ ) e i ← Q ( A i λ i , D ) end best er r ← min i =1 ,...,N e i best pr oc ← argmin i =1 ,...,N e i for j = 1 , . . . , q do i ← MABS ol ver ( { π i } i =1 ,...,N ) λ i ← π i ( t, A i , { λ k } j k =1 ) e i ← Q ( A i λ i , D ) if e i < best er r then best er r ← e i best pr oc ← i end end return getConfig ( π best pr oc ) As in SMA C, we use empirical risk exp ectati on at iteration k : E t ( Q ( A i λ i k , D )), where Q ( A i λ i k , D ) is empirical r isk v alue reac hed b y pro cess π i on dataset D at iteration k . Note that pr o cess π i optimizes hyp erparameters f or empirical risk minimization, b ut a m ulti-armed band it problem is maximizatio n problem. Therefore, we defi n e an a verag e rew ard function as: ¯ r i, ( k ) = Q max − E ( k ) ( Q ( A i λ i k , D )) Q max , where Q max is the maximal empirical risk that w as ac hiev ed on a giv en dataset. 4. Exp erimen ts Since Auto-W E KA implemen ts the only existing s olution, we c h o ose it for co m parison. Ex- p eriment s w ere p erf ormed on 10 different real datasets with a p redefined split into training and test data from UCI rep ository 1 . These datasets c h aracte ristics are presente d in T able 1 . The su ggest ed approac h allo ws to use an y h yp erp arameter optimizatio n metho d. In order to p erform comparison prop erly , w e use SMA C metho d that is u sed b y Auto-WEKA. W e consid er 6 w ell-kno wn classification algorithms: k Nearest Neigh b ors (4 categ orical and 1 n umerical hyp erparameters), S upp ort V ector Mac hine (4 and 6), Logistic Regression (0 and 1), Rand om F orest (2 and 3), Perceptron (5 and 2), and C4.5 Decision T r ee (6 and 2). 1. http: //www.cs.u bc.ca/labs/beta/Projects/autoweka/datasets/ 6 Reinf orcement-based Simul t aneous Algorithm and its Hyperp arameters Selection T able 1: Datasets descrip tion. Dataset Num b er of Num b er of Num b er Num b er of Num b er of catego r ical n u merical of classes ob jects in ob jects in features features training set test set Dexter 0 20000 2 420 180 German Cr edit 13 7 2 700 300 Dorothea 0 10000 0 2 805 345 Y east 0 8 10 1039 445 Secom 0 590 2 1097 470 Semeion 0 256 10 1116 477 Car 6 0 4 1210 518 KR-vs-KP 36 0 2 2 238 958 W a veform 0 40 3 3500 1500 Shuttl e 38 192 2 3500 0 15000 As we previously stated, we are giv en time T to find the solution of the main problem. The suggested metho d requires splitting T into small equal interv als t . W e giv e the small in terv al to a selected pro cess π i at eac h iteration. W e compare the metho d p erformance for differen t time bud get t v alues to fi nd the optimal v alue. W e consider time b u dgets from 10 to 60 seconds with 3 seco n d step. After that w e run the suggested metho d on 3 datasets Car, German Credits, KRvsKP describ ed ab o ve . W e use 4 solutions of the m ulti-armed band it problem: UCB1, 0.4-greedy , 0.6-greedy , Softmax. W e ru n eac h confi gu r ation 3 times. Th e results show no regularit y , so w e assum e time bud get t as 30 seconds. In the qualit y comparison, w e consider suggested metho d with the differen t multi-armed bandit problem solutions: UCB1, 0.4-greedy , 0.6-greedy , Softmax w ith the na ¨ ıv e r ew ard function, and t wo solutions U C B 1 E ( Q ) , S of tmax E ( Q ) with the s u ggeste d rew ard fun ction. Time b udget on iteration is t = 30 seconds, the general time limitation is T = 3 hours = 10800 seconds. W e run eac h configuration 12 times with random seeds of S MA C algorithm. Auto-WEKA is also limited to 3 hours and sele cts one of the algorithms w e sp ecified ab ov e. The exp eriment results are sho wn in T able 2 . The r esults s h o w that the suggested m ethod is significantly b etter in most of the cases than Auto-WEKA of the all 10 datasets, b ecause its v ariations reac h the smallest empirical risk. There is no fundamental difference b et ween the r esu lts of the suggested metho d v ariations. Neve r theless, algorithms U C B 1 E ( Q ) S of tmax E ( Q ) , w h ic h u se the suggested rew ard function, ac hieved the smallest empirical r isk in most cases. The exp erimen t results sho w that the su ggeste d approac h impro v es the existing solution of the sim u ltaneous learning alg orithm and its h yp erparameters selection problem. More- o ve r , the suggested approac h do es not imp ose restrictions on a hyp erparameter op timization pro cess, so the searc h is p erformed on th e entire hyper p arameters space f or eac h learning algorithm. I t is signifi can t that the su ggested metho d allo ws to select a learning algorithm with hyperp arameters, wh ose qualit y is not worse th an Au to-W EKA outcome qualit y . W e claim that the suggested m etho d is statistically not worse than Auto-WEKA. T o pro ve this, w e carried out Wilco xon signed-rank test. In exp erimen ts, w e use 10 datasets 7 Efimov a Filchenko v Sh al yto T able 2: Comparison of Auto-WEKA and suggested metho ds for selecting classification al- gorithm and its h yp erparameters for the giv en dataset. W e p erformed 12 indep en- den t run s of eac h configuration and rep ort th e smallest empirical risk Q ac h iev ed b y Au to-W EKA and the suggested metho d v ariations. W e highligh t with b old en tries that are m in imal for the giv en d ataset . Dataset AutoWEKA UCB1 0.4-g reedy 0 .6-greedy Softmax U C B 1 E ( Q ) S of tmax E ( Q ) Car 0.3305 0.1836 0.1836 0.1836 0.1836 0. 1 836 0.1836 Y ea st 34.13 29.81 29.81 33.65 29.81 29.81 29.81 KR-vs-K P 0.2976 0.1488 0.1488 0.1488 0.1488 0. 1 488 0.1488 Semeion 4.646 1.786 1.786 1.786 1.786 1.786 1.786 Sh uttle 0.00766 0.0115 0.0115 0.00766 0.0115 0.0076 0 .0076 Dexter 7.143 2.38 2.381 2.38 1 2.381 2.381 0.16 W aveform 11.28 8.286 8.28 6 8.286 8 .286 8.286 8.286 Secom 4.545 3.636 4.545 4.545 3.636 3.636 3.636 Dorothea 6.6 76 4.938 4.958 4.93 8 4.938 4.32 2.469 German Credits 1 9.29 14. 29 14.29 15.7 1 14.29 14.29 14.29 whic h leads to an appropr iate num b er of pairs. Moreo ve r , other Wilco xon test assum ptions are carried. Therefore, w e hav e 6 te st chec ks : comparison of Auto-W EKA and eac h v ariation of the suggested m ethod . Since the n um b er of samples is 10, w e h av e meaningful resu lts when unt yp ical results sum T < T 0 , 01 = 5. W e consider a minimization p roblem, so we test only the b est of 12 runs for eac h d ataset . Finally , we ha ve T = 3 for th e ε -gree dy algorithms and T = 1 for the others. This pro ves the statistical significance of the ob tained results. 5. Conclusions In this pap er, w e su ggest and examine a n ew s olution for the actual p roblem of an algo- rithm and its hyperp arameters sim ultaneous selection. The prop osed approac h is b ased on a m ulti-armed b andit problem solution. W e s uggest a new rew ard fun ction exploiting h yp erparameter optimization metho d prop erties. The suggested fu nction is b etter than the na ¨ ıve function in applying a m ulti-armed b an d it p roblem solutions to solv e the main problem. The exp eriment resu lt sho ws that the suggested metho d outp erforms the existing metho d implement ed in Auto-WEKA. The suggested m ethod can b e improv ed b y applying m eta- learning in order to ev aluate algorithm qualit y to p repro cess a giv en dataset b efore run ning an y algorithm. This ev al- uation can b e used as a prior kno wledge of an algo r ithm reward. Moreo v er , we can add a con text v ector to hyp erparameters optimization pro cess and use solutions of a con textual m u lti-armed bandit p roblem. W e can select some datasets b y m eta-l earning and th en get the empirical r isk estimate and u se it as con text. 8 Reinf orcement-based Simul t aneous Algorithm and its Hyperp arameters Selection Ac knowledgme n ts Authors would lik e to thank V adim Strijo v and un kno wn reviewers for useful commen ts. The researc h wa s su pp orted b y the Gov ern men t of the Russian F ederation (gran t 074-U01) and the Russ ian F oundation for Basic Researc h (pro ject no. 16-37 -6011 5). References Salisu Mamman Ab dulrahman, P a ve l Brazdil, Jan N v an Rijn , and Joaquin V ans c horen. Algorithm selection via meta-lea r ning an d sample-based activ e testing. In Eur op e an Con- fer enc e on Machine L e arning and P rinci ples and Pr actic e of Know le dge Disc overy in Datab ases; International W orksho p on Meta-L e arning and Algor i thm Sele ction . Univ er - sit y of Po rto, 2015. Da vid W Aha. Generalizing from case studies: A case study . In Pr o c. of the 9th Internatio nal Confer enc e on M achine L e arning , pages 1–10, 1992. Sha w k at Ali and Kate A Smith. On learning algorithm selection for classification. Applie d Soft Computing , 6(2):119–1 38, 2006. James Bergstra and Y oshua Bengio. Random searc h for hyper -p arameter optimizatio n. The Journal of Machine L e arning R ese ar ch , 13(1):281 –305, 2012. James S Bergstra, R´ emi Bardenet, Y oshua Bengio, and Bal´ azs K´ egl. Algorithms for hyp er- parameter optimizati on. In A dvanc es in Ne u r al Informa tion Pr o c essing Systems , pages 2546– 2554, 2011 . L ´ eon Bottou. On line learning and stoc h astic appr oximati ons. On-line le arning in neur al networks , 17(9):142 , 1998. P av el B Brazdil, Carlos Soares, and J oaquim Pinto Da Costa. Rankin g learning algorithms: Using ibl and meta-le arn ing on accuracy and time results. Machine L e arning , 50(3): 251–2 77, 2003 . Ciro Castiello, Gio v ann a Castellano, and Anna Maria F anelli. Meta-data: Ch aracteriza tion of inp ut features for meta-learning. In International Confer enc e on Mo deling De cisions for Artificial Intel ligenc e , pages 457–468. Springer, 2005. Andrey Filc henko v and Arseniy P endryak. Dataset s m eta-fe ature description for recom- mending feature selection algorithm. In Artificial Intel ligenc e and N atur al L anguage and Information Extr action, So ci al Me dia and Web Se ar ch FRUCT Confer e nc e (AINL- ISMW FR U CT), 2015 , pages 11–18. I EEE, 2015. Christophe Giraud -Carrier, Ricardo Vilalta, and P av el Brazdil. Introdu ction to the sp ecial issue on meta-learning. Machine le arning , 54(3):18 7–193, 2004. T revor Ha stie, Rob ert Tibshirani, J erome F riedman, and J ames F ranklin. The ele m en ts of statistical learning: data mining, inference an d prediction. The Mathematic al Intel li- genc e r , 27(2):8 3–85, 2005. 9 Efimov a Filchenko v Sh al yto F rank Hutter, Holger H Hoos, and Kevin Leyton-Bro wn. S equen tial mo del-based optimiza- tion for general algorithm configur ation. In L e arning and Intel ligent Optimization , pages 507–5 23. Springer, 2011. F rank Hutter, J¨ org L ¨ uc ke, and Lars Sc hmid t-Th ieme. Beyo n d m anual tuning of hyp erpa- rameters. KI- K¨ unstliche Intel lige nz , 29(4):3 29–337, 2015 . Kevin Jamieson and Am eet T alwalk ar. Non-sto chastic b est arm iden tification and hyp er- parameter optimization. JMLR , 41:240–248 , 2015. Rui Leite, P av el Brazdil, and Joaquin V ansc horen. Select ing classificatio n algorithms w ith activ e testing. In Machine L e arning and Data Mining in Pattern R e c o gnition , pages 117–1 31. Springer, 2012. Rafael Gomes Mant o v ani, Andr´ e LD Rossi, Joaquin V ansc h oren, Andr ´ e Carlos Ponce de Leon Carv alho, et al. Meta-learning recommendation of default hyp er-parameter v al- ues for svms in classifications tasks. In Eur op e an Confer enc e on Machine L e arning and Principles and Pr actic e of Know le dge Disc overy in Datab ases; International Workshop on Meta-L e arning and Al gorithm Se le ction . Univ ersit y of Porto , 2015. Juan D Ro dr iguez, Aritz Pe rez, and Jose A Lozano. Sensitivit y analysis of k-fold cross v alidation in prediction error estimation. IE E E T r ansactions on Pattern Analysis and Machine Intel ligenc e , 32(3) :569–575, 201 0. Jasp er Sno ek, Hugo Laro c h elle, and Ryan P Adams. Practical ba yesia n optimization of mac hine learning alg orith m s. In A dvanc es in neur al information pr o c essing systems , pages 2951– 2959, 2012 . V adim Strijo v and Gerhard Wilhelm W eb er. Nonlinear r egression m od el generation using h yp erparameter optimization. Com puters & Mathematics with Applic ations , 60(4):98 1– 988, 2010. Quan Sun and Bernhard P f ahringer. Pai rwise meta-rules for b etter meta- learnin g-based algorithm ranking. Machine le arning , 93(1):1 41–161, 2013 . Ric hard S S utton and Andrew G Barto. R einfor c ement le arning: An intr o duction . MIT press, 1998. Chris T horn ton, F rank Hu tter, Holger H Ho os, and Kevin Leyton-Bro wn. Auto-w ek a: Au- tomated selection and h y p er-parameter optimization of classification algorithms. CoRR, abs/120 8.3719 , 2012. 10 This figure "spiral.png" is available in "png" format from:
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment