Reinforcement Learning Approach for Parallelization in Filters Aggregation Based Feature Selection Algorithms
One of the classical problems in machine learning and data mining is feature selection. A feature selection algorithm is expected to be quick, and at the same time it should show high performance. MeLiF algorithm effectively solves this problem using…
Authors: Ivan Smetannikov, Ilya Isaev, Andrey Filchenkov
JMLR: W orkshop and Conference Pro ceedings 60: 1 – 10 , 201 6 ACML 2 016 Reinforcemen t Learning Approac h for P a rallelizat ion in Filters Aggregation B ased F eature Selection Algorithms Iv an Smetannik o v ismet anniko v@corp.ifmo.ru Ily a Isaev isaev@rain.ifmo.ru Andrey Filc henk ov afilchenko v@corp.ifmo.r u ITMO University, St. Petersbur g, Kr onverksky Pr. 49 Abstract One of the classical problems in machine learning and data mining is feature selection. A feature selection algo rithm is ex pected to b e quick, a nd at the same time it should show high per formance. MeLiF algorithm effectively solves this problem using ensembles of ra nking filters. This ar ticle describ es tw o differen t wa ys to improve MeLiF a lgorithm per formance with par a llelization. Exp eriments show that prop osed schemes significantly improv es a lgorithm per formance and increase feature selection quality . Keyw ords: Ma c hine learning, feature selection, rank aggre g ation, mult i-armed bandit, parallel computation, MeLiF, MeLiF+ , PQMeLiF, MAMeLiF. 1. Introduction Almost all business and scien t ific pr ob lems no w ada y s in v olv e pro cessing h uge amoun ts of data with m achine learning algorithms. Due to its u niv ersal applicabilit y , mac hine learning b ecame one of the most promising and researc hed s cien t ific domains. In particular, it has application in b ioinformatics ( Bol´ on-Canedo et al. , 2014 ; Saeys et al. , 2007 ), as gian t amoun ts of d ata ab out gene expr ession of different organisms are obtained in this fi eld. In order to filter data n oise and reduce mo del complexit y , it is necessary to select the most relev ant features. T ec hniques and metho ds ac hieving this goal are called feature selectio n. Gene exp r ession data can enable researchers to sp ot which DNA pieces are resp ons ib le for reactions to particular en vironmen t c hange or some in ternal pro cesses of an organism. The main problem met in pro cessing suc h data is th e high dimensionalit y of instances. Gene expression d atase ts often h a v e a high num b er of features and relativ ely lo w num b er of ob jects. F or a dataset with these pr op erties, it is very hard to build a mo del th at fi ts the data we ll. A feature select ion algo rithm meets sev eral requirements. It is exp ected to w ork fast and sho w go o d p erformance. Ho wev er, no univ ersal algorithm for feature selecti on exists. W rap- p ers ( Koha vi and John , 1997 ) are th e family of metho ds based on searc h in g for an optimal feature sub set that m aximizes preselected classifier effectiv eness. Such pr oblem statemen t leads to h igh p erformance of a found solution. Ho we v er, the size of searc h sp ace gro ws exp onen tially of the instance dimens ionalit y . This fact mak es wrapp er rarely applicable in bioinformatics, as the num b er of features in datasets could b e up to hundreds of thousand s. In th ese cases, other feature selection algorithms kno w n as filters ( S´ anchez- Maro ˜ no et al. , 2007 ) are used. Filters are based on estimation of feature imp ortance. Filters u sually c 2016 I. Smetann iko v, I. Isaev & A. Filc henko v. Smet anniko v Isaev Filchenkov p erform w orse than w r app ers, but they are m uch faster. A sp ecial group of feature selec- tion m ethod s are em b edd ed selectors ( Lal et al. , 2006 ) that u ses p articular pr op erties of a selected classifier. Ensem bling, whic h is the pro cess of build in g a com bination of sev eral simple algorithms, is a widely used tec hn ique in mac hine learning ( Bol´ on-Canedo et al. , 2012 ). MeLiF algo- rithm prop osed in ( Smetannik o v and Filc hen k o v , 2016 ), applies ensem b ling to feature se- lection. This algorithm tries to find such linear com bination of basic rankin g filters, w h ic h selects th e most relev ant features of the dataset. Ranking filter itself consists of t w o sep- arate p arts: a feature imp ortance measure and a cutting rule. Basically , MeLiF tries to tune co efficien ts of feature imp ortance measur e linear com b ination. Th is pro cess in v olv es classifier tr aining, ev aluating and comparing with ranking filters themselve s , th us making it comparativ ely slo w. This is why parallelizatio n can b ecome really handy and helpful to impro ve algorithm compu tational time. The simplest paralleliza tion sc heme called MeLiF+ is describ ed in ( Isaev and Sm etannik o v , 2016 ). The main disadv anta ge of this na ¨ ıv e sc heme is that it d oes not scale well. It starts searc h of the b est co efficien t vect or from sev eral starting p oin ts using a separate thread for eac h p oin t. When one of these optimization pro cesses ends, this thread ju s t stops and its resources sta y un released therefore they cannot b e used for further w ork. Th u s, it is not useful to allo cate a lot of resources for this pro cess as m ost of them will stay un used. T o o v er come this problem, it is necessary to u se cores of pro cessing server more effec- tiv ely . Wh ile p ro cessing, MeLiF visits a lot of p oin ts in th e linear space, so w e can p ro cess p oin ts u sing a task executor. This researc h p rop oses tw o different app roac hes to using parallel co ordinate d escen t in building ensembles of ranking filters called PQ MeLiF and MAMeLiF. The first algorithm stores p oints that should b e pro cessed in a priority queu e. The second alg orithm solves theparallelizatio n problem by red u cing it to the multi-armed bandit prob lem. The r emainder of the pap er is organized as follo ws: Section 2 describ es MeLiF algorithm, Section 3 conta ins the prop osed parallelizatio n sc h emes, Section 4 outlines exp erimen tal setup, Section 5 con tains exp erimen t resu lts, and finally S ectio n 6 con tains conclusion. This pap er is a v ersion of th e pap er accepted to 5th International Y oung Scien tists Conference in HPC and Sim u latio n. 2. Linear com bination of ranking filters Ranking filter f is a p air h m, κ i , wh ere m is a f eature imp ortance measure and κ is a cutting rule. F or eac h ob ject feature, m return its imp ortance for lab el prediction. F or a sorted list of features, κ cuts the wo r st. The core idea of MeLiF is to use sev er al rank in g filters f 1 , . . . , f N in order to merge them in to a single rankin g filter b y finding the most effectiv e linear combination of their feature imp ortance measur es. This com bin ation is a n ew feature imp ortance measure, w h ile the cutting r ule can b e inh er ited from all the ranking filters (usually , it is c hosen empirically). Any p erform ance measure m a y b e used to ev aluate a ranking filter effectiv eness. In this pap er, w e use classifier effectiv eness estimated with F 1 score. Th us, MeLiF simply optimizes a function in the N -dimensional space, where N is the n u m b er of basic ranking filers. Ev aluation of this function is comparativ ely costly (we need 2 RL for p aral leliza tion to run a classifier). Ho w ever, dim en sionalit y of the search space is less b y an order of magnitude than the num b er of features. This detail allo ws classifying an algo r ithm from MeLiF family as a h yb rid of filter and w rapp er, whic h inh erits filter sp eed an d wr app er fo cus on resulting p erformance. The algorithm is parametrized with the follo wing h yp er p arameters: • δ ∈ R , a v alue of grid spacing; • P ∈ 2 N , starting p oin ts; • ev al uate , the function for classifier effec tiv eness ev aluation at giv en p oin t in the searc h space. The original MeLiF p erforms coord in ate descen t in the searc h space. It h as b een ob- serv ed du ring exp eriments th at the b est option is this p articular choi ce of starting p oint s : (1 , 0 , . . . , 0) , (0 , 1 , . . . , 0) , . . . , (0 , 0 , . . . , 1) corresp onding to the only one basic ranking fi lter used, and (1 , 1 , . . . , 1) corresp onding to the equally w eigh ted com bination of all the b asic ranking filters. F or eac h p oin t it r eac hes, MeLiF tries to shift eac h coord inate v alue to + δ and δ . Then it ev aluates effectiv eness of eac h p oin t. If ev aluation result is greater than the curr en t maxim um , the alg orithm assigns the curren t m aximum to b e equal to the coordin ates of this p oin t and starts searc hin g from its first co ordinate. If all co ord inates are shifted to + δ and − δ and no qualit y impro vemen t is observ ed, then the algorithm stops. F or eac h p oin t obtained d uring the co ordinate descent, the algorithm measures the v alue of the resulting linear com b ination of basic filters for eac h feature in the dataset. After that, the results are sorted, and the algo rithm selects m topmost features. They are used to train and test a particular classifier. T he classification q u alit y is treate d as p oint score. It is cac hed and compared to other p oin ts. 3. MeLiF parallel optimizations In this pap er, w e prop ose tw o p arallelization schemes of MeLiF and sho w that some of their configurations h a v e sp eed improv ement gro w ing linearly of the n u m b er of pro cessors. F urthermore, the prop osed schemes show equal or even b etter p erformance qualit y in com- parison with the single-threaded ve rsion of MeLiF. The first prop osed parallelizatio n sc heme is named PQMeLiF. In this name, PQ stands for priorit y q u eue. This algorithm is a v ariation of the b est-first searc h algorithm ( Russell and Norvig , 2009 ). T he algorithm stores p oints that should b e pro cessed in a priority qu eu e. On eac h iteration, th e algorithm p olls a p oin t f rom the qu eue, calculates its score, and puts all its not visited neigh b ors bac k to th e queue w ith the priorit y equal to the calculated score. Before initiating the algorithm, starting p oints should b e put into a queue with maxim um priorit y , whic h is 1 . 0. This ensures that all of them w ill b e pro cessed at the v ery b eginnin g, so all the starting p oint s will b e take n in to acco u n t sim u ltaneously . Unlik e MeLiF and MeLiF+, PQMeLiF enables to tune halting criteria to find a trade-off b et ween feature selection qualit y and algorithm p erform an ce: w e can limit th e n umb er of p oin ts th at should b e pro cessed. Exp eriments determined an optimal num b er that enables the algorithm to p erform b etter than the original MeLiF, but as fast as p ossible. 3 Smet anniko v Isaev Filchenkov Algorithm 1 MeLiF pseudo co de Require: starting p oints, δ , ev aluat e q ∗ = 0 p ∗ for p : points do q = ev al uate ( p ) if q > q ∗ then p ∗ = p q ∗ = q end if end for smthC hang ed = tr ue while smthC hang ed do for dim : p.siz e do p + = p { p [ dim ] + δ } q + = ev aluate ( p +) if q + > q ∗ then q ∗ = q + p ∗ = p + smthC hang ed = tr ue br eak end if p − = p { p [ dim ] − δ } q − = ev aluate ( p − ) if q − > q ∗ then q ∗ = q + p ∗ = p + smthC hang ed = tr ue br eak end if end for end while return p ∗ , q ∗ 4 RL for p aral leliza tion Algorithm 2 PQMeLiF pseud o code Require: starting p oints, δ , ev aluat e , T q = P r ior ity B l ock ing Queue p ∗ for p : points do enq ueue ( q , p, 1 . 0) end for for each new t hr ead r un in T thr eads do while ! mustS top () do p = deq ueue ( q ) scor e = ev aluate ( p ) updateB es tS cor e ( scor e, p ) neig hbour s = g etN eig hbour s ( p, δ ) for p 2 : neig hbour s do enq ueue ( q , p 2 , scor e ) end for end while end for return point ∗ The exp erimen ts describ ed in the next tw o sections sho wed that PQMeLiF p erformed m u c h b etter than MeLiF and MeLiF+. How ev er, they unco vered its d r a wbac ks . Th e algorithm starts w orking initially with only the set of starting p oints in the queue, n u m b er of which is fixed and equals to the num b er of the b asic ranking filters p lus one. So, if a serv er has more cores, th e extra cores sta y unused until new p oints are added to the queue. The p ossible solution for this problem is to pr ocess more starting p oin ts in order to kee p all the server cores busy . This alg orithms dra w b ac k can b e o v ercome with our n ext algorithm. The main idea of this algorithm is to consider th e problem of selecting new p oin ts as a reinforcemen t learning problem, in whic h we need to find a trade-off b et w een exp loratio n (of n ew areas in the searc h space) and exploitation (b y ev aluating p oint s in areas where w e ha ve already found many go o d p oin ts) ( Sutton and Barto , 1998 ). In order to apply th is idea, we adopted the well -kno wn UCB1 algo rithm ( Auer et al. , 2002 ) for our parallelization problem by redu cing it to m ulti-armed bandit problem. This r eduction was p erformed in the follo wing w a y . Firstly , we s plit the searc h space in to differen t areas and corresp ond eac h area with an arm. Then, ev aluating a p oin t in an area is understo o d as pla ying the corresp onding arm . Rew ard obtained by suc h pla yin g is the score of the ev aluated p oin t ( Bub ec k and Slivkins , 2012 ; Desautels et al. , 201 4 ). The pr oblem w e faced du ring implemen tation of reinforcement learning algorithm in m u lti-a gen t en vironm en t is d ela yed feedb ac k. Ba sically , when some thread needs to select whic h arm to u s e, there is no in formation ab out the r esults of other threads at th at m omen t. So it is forced to make a decision with the lac k of inf ormation. Authors of ( Joulani et al. , 2013 ) prop osed to mak e a decision based only on results that were already computed and pro vid e a theoretic al p ro of, that an error function is additiv e an d b ased on the duration of dela y . 5 Smet anniko v Isaev Filchenkov Algorithm 3 MAMeLiF pseud o code Require: starting p oints, δ , ev aluat e , T points = spl itS ear chS pace ( P ) q ueues = [] q ∗ p ∗ for p : points do q = P r ior ity B l ock ing Queue enq ueue ( q , p, 1 . 0) q ueues + = q end for for each new t hr ead r un in T thr eads do while ! mustS top () do q N E = f indN on E mpty ( q ueues ) q = f indB estQ ( q N E ) p = deq ueue ( q ) scor e = ev aluate ( p ) updateB es tS cor e ( scor e, p ) neig hbour s = g etN eig hbour s ( p, δ ) for p 2 : neig hbour s do enq ueue ( q , p 2 , scor e ) end for end while end for return point ∗ 6 RL for p aral leliza tion 4. Exp erimen t al set up As a classifier, w e used SVM with p olynomial ke rnel and soft margin p arameter C = 1 implemen ted in WEKA library . W e used 5-fold cr oss-v alidation. The n umb er of selected features was constant: m = 100. W e ran our exp erimen ts on a mac hine w ith the follo w ing c haracteristics: 32- core CPU AMD Op teron 6272 @ 2.1 GHz, 128 GB RAM. W e used K = 50 threads, where K = 2 pf , p is the n umb er of starting p oints, and f is the n umb er of folds. As the basic filters, w e used Sp earman Rank Correlation, Symmetric Uncertain ty , Fit Criterion, VDM ( Auffarth et al. , 2010 ). W e also executed MeLiF and MeLiF+ and recorded w ork time and p oint with the b est classification result. W e used 36 datasets of different sizes f rom these arc hives: GEO, Broad institute. Can- cer Pr ogram Data Sets, K en t Ridge Bio-Medica l Dataset, F eature S elec tion Dataset s at Arizona State Univ ersit y , RSCTC 2010 Disco v ery Ch allenge All these datasets are DNA- microarra y d atase ts with high n umb er of features (from a few thousand s up to a few d ozens of thousand s) and comparativ ely lo w n umb er of ob jects (less than few h und reds). 5. Results W e used sev eral differen t configurations of halt criteria in our algorithms that describ ed b elo w. F or eac h d ataset , we conducted 10 exp eriments tota l: • B: basic one-thread MeLiF; • P: n aiv e parallelization metho d MeLiF+; • PQ75: P QMeLiF limited to visit 75 p oints; • PQ100: PQMeLiF limited to visit 100 p oints; • PQ125: PQMeLiF limited to visit 125 p oints; • PQrel: P Q MeLiF th at stops when no qualit y in crease was registered in the previous 32 p oin ts; • MA75: MAMeLiF limited to visit 75 p oints; • MA100: MAMeLiF limited to visit 100 p oints; • MA125: MAMeLiF limited to visit 125 p oints; • MArel: MaMeLiF that stops when no qu alit y increase was registered in the previous 32 p oin ts; Eac h one of them halted if it gets the highest score 1 . 0. The results of the basic MeLiF, MeLiF+ are presented in the T able 1 . On ly b est configurations for PQMeLiF and MAMeLiF are present ed in it, that are PQ75 and MArel corresp ond in gly . As it could b e seen from th e T able 1 , PQMeLiF and MAMeLiF strongly outp erform MeLiF+ resulting in ap p ro ximately linear scaling ov er the num b er of computational cores. Also new m ethod s gain in a ve r age small b o ost to the original MeLiF feature selection qualit y . 7 Smet anniko v Isaev Filchenkov T able 1: Algorithms comparison Dataset Time F 1 score MeLiF MeLiF+ PQ75 MArel MeLiF MeLiF+ PQ75 MArel Arizona1 558 85 85 117 0.833 0.833 0.833 0.833 Arizona5 219 67 37 48 0.768 0.79 0.786 0.773 Breast 161 24 27 17 0.844 0.822 0.812 0.802 CNS 33 6 7 6 0.742 0.791 0.899 0.83 Data train0 172 71 28 32 0.853 0.853 0.849 0.839 Data train1 180 60 32 33 0.866 0.901 0.877 0.876 Data4 train 513 124 73 87 0.823 0.823 0.775 0.775 Data5 train 370 70 59 58 0.847 0.847 0.901 0.886 Data6 train 381 69 65 64 0.835 0.835 0.859 0.869 DLBCL 65 13 12 19 0.799 0.734 0.8 0.761 GDS2771 299 81 42 42 0.798 0.798 0.801 0.783 GDS2819 1 303 39 15 17 1 1 1 1 GDS2819 2 436 1 49 60 80 0.948 0.981 0.957 0.921 GDS2901 88 17 4 4 1 1 1 1 GDS2960 33 7 5 6 0.99 0.99 0.977 0.977 GDS2961 49 13 8 5 0.86 0.86 0.829 0.784 GDS2962 45 11 8 7 0.877 0.914 0.924 0.883 GDS3116 142 23 30 32 0.852 0.852 0.868 0.853 GDS3257 131 17 8 9 1 1 1 1 GDS3929 376 74 45 32 0.809 0.809 0.81 0.774 GDS4103 265 71 54 51 0.933 0.933 0.923 0.923 GDS4109 142 38 24 20 0.936 0.936 0.924 0.947 GDS4222 454 84 73 93 0.974 0.974 0.97 0.97 GDS4318 275 64 40 53 0.923 0.923 0.97 0.942 GDS4336 200 66 30 20 0.928 0.928 0.916 0.916 GDS4431 537 1 00 85 134 0.827 0.827 0.817 0.817 GDS4600 472 1 24 94 114 0.983 0.983 0.979 0.979 GDS4837 1 413 1 30 57 51 0.916 0.916 0.828 0.809 GDS4837 3 316 48 54 51 0.96 0.96 0.969 0.967 GDS4901 220 60 44 30 0.931 0.966 0.919 0.913 GDS4968 0 226 40 37 38 0.905 0.905 0.913 0.907 GDS4968 1 224 40 40 36 0.923 0.946 0.939 0.932 GDS5037 0 243 1 40 49 62 0.825 0.857 0.867 0.867 GDS5037 2 293 69 47 73 0.756 0.756 0.789 0.78 GDS5047 185 41 9 11 1 1 1 1 GDS5083 195 60 29 40 0.862 0.862 0.872 0.847 Leuk 3c0 34 5 7 7 0.989 0.989 0.986 0.986 Leuk 3c1 33 5 8 8 0.981 0.981 0.98 0.98 Ov arian 192 23 9 11 1 1 1 1 plySRBCT 17 3 0 1 1 1 1 1 prostate 93 34 16 15 0.919 0.932 0.927 0.921 8 RL for p aral leliza tion 6. Conclusion In this pap er, we p resen ted t wo p aralleliz ation schemes for MeLiF algorithm, wh ic h searc h for the b est com b ination of simple feature selection algorithms. Exp erimen ts sho wed that these tw o sc hemes, namely P QMeLiF and MAMeLiF, d emonstrated linear sp eed impr o v e- men t o v er the num b er of used cores comparing to single-threaded MeLiF w ithout loss in classification qu alit y . F urthermore, these algorithms sometimes sh o w ed b etter feature selec- tion qualit y . This could b e exp lained with the fact, that this metho ds searc h through more p oin ts than original MeLiF metho d. That happ ens due to dela y in thread sy n c hronization. As our future research, we will try to mak e some estimations on searc h sp ace split s ize in MAMeLiF metho d dep ending on the num b er of cores. Prop er split can lead to b etter computational results. Also, we will apply more tests with differen t system configurations in order to find Amdahls optimum for eac h algorithm. Ac knowledgme n ts The r esearc h wa s supp orted by the Go vernmen t of the Russ ian F ederation (gran t 074-U 01) and the Russ ian F oundation for Basic Researc h (pro ject no. 16-37-60 115). References P eter Auer, Nicolo Cesa-Bianc hi, and Paul Fischer. Finite-ti me analysis of the m ultiarmed bandit prob lem. Machine le arning , 47(2-3 ):235–256, 200 2. Benjamin Auffarth, Maite L´ op ez, and Jes ´ us C er q u ides. Comparison of redun dancy and relev ance measures for feature select ion in tissue classificati on of ct images. In Industrial Confer enc e on Data Mini ng , pages 248–2 62. Sp ringer, 2010. V er´ onica Bol´ on-Canedo, No elia S ´ anc hez-Maro ˜ no, and Amparo Alonso-Betanzos. An en- sem ble of filters and classifiers for m icroarra y data classification. P attern R e c o gnition , 45 (1):53 1–539 , 201 2. V er´ onica Bol´ on-Canedo, No elia S´ an chez-Ma r o˜ no, Amparo Alonso-Beta n zos, Jos´ e Manuel Ben ´ ıtez, and F rancisco Herrera. A r eview of microarra y datasets and applied feature selection metho ds. Inf ormation Scienc es , 282:111–1 35, 2014. S ´ ebastien Bub ec k and Aleksandrs Slivkins. The b est of b oth worlds: Stochastic and adv er- sarial bandits. In COL T , pages 42–1, 2012. Thomas Desautels, And reas Krause, and Jo el W Burd ic k. P arallelizing exp loratio n- exploitatio n trad eoffs in gaussian pro cess bandit op timization. Journal of Machine L e arn- ing R ese ar ch , 15(1):38 73–39 23, 2014. Ily a Isaev and Iv an Smetannik ov. Melif+: Optimization of filter ensemble algorithm with parallel computing. In IFIP Internationa l Confer enc e on Artificial Intel ligenc e Applic a- tions and Innovations , pages 341–347. Springer, 2016. P o oria Joulani, A ndr´ as Gy¨ orgy , and Csab a S zepesv´ ari. Online learning un der dela y ed feedbac k. In ICML , p ages 1453 –1461 , 2013. 9 Smet anniko v Isaev Filchenkov Ron Kohavi and George H John. W rapp ers for feature su bset selection. Artificial intel li- genc e , 97( 1):273 –324, 1997. Thomas Navin Lal, Olivier Chap elle, Jason W eston, and Andr´ e Elisseeff. Emb edded meth- o ds. In F e atur e e xtr action , pages 137–16 5. Springer, 2006. Stuart Jonathan R u ssell and Pe ter Norvig. Ar tificial intel ligenc e: a mo dern appr o ach , v olume 2. Prentic e h all, 2009. Yv an S aeys, I ˜ naki Inza, and Pedro Larr a˜ naga. A r eview of feature selecti on tec h n iques in bioinformatics. bioinformatics , 23(19): 2507– 2517, 200 7. No elia S´ anchez-M aro ˜ no, Amparo Alonso-Betanzos, and Mar ´ ıa T ombilla-Sanrom´ an. Fil- ter metho ds for feature selection–a comparativ e study . In Internationa l Confer enc e on Intel lige nt D ata Engi ne ering and Automa te d L e arning , pages 178–187. Springer, 2007. Iv an Smetannik ov and Andrey Filc henko v. Melif: Filter ensem ble learning algorithm for gene selection. A dvanc e d Sci e nc e L etters , page (to b e p ublished), 2016. Ric hard S S u tton and Andr ew G Barto. R einfor c ement le arning: A n intr o duction . MIT press Cam b ridge, 1998. 10 This figure "spiral.png" is available in "png" format from:
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment