Neural Network Conversion of Machine Learning Pipelines

JMLR: W orkshop and Conference Proceedings 1: 1 – 7 , 2018 ICML 2018 AutoML W orkshop Neural Net w ork Con version of Mac hine Learning Pip elines Man-Ling Sung sammi.sung@ra ytheon.com Jan Silo vsky jan.silovsky@ra ytheon.com Man-h ung Siu man-hung.siu@ra ytheon.com Herb ert Gish herb.gish@ra ytheon.com Chinn u Pittapally chinnu.pitt ap all y@ra ytheon.com R aythe on BBN T e chnolo gies, 10 Moulton Str e et, Cambridge, MA 02138 Abstract T ransfer learning and kno wledge distillation has recently gained a lot of atten tion in the deep learning communit y . One transfer approach, the student-teac her learning, has been sho wn to successfully create “small” studen t neural net works that mimic the p erformance of a m uc h bigger and more complex “teac her” net works. In this pap er, w e in vestigate an extension to this approach and transfer from a non-neural-based mac hine learning pip eline as teac her to a neural netw ork (NN) student, which would allow for joint optimization of the v arious pip eline comp onen ts and a single uniﬁed inference engine for m ultiple ML tasks. In particular, we explore replacing the random forest classiﬁer by transfer learning to a student NN. W e experimented with v arious NN topologies on 100 OpenML tasks in whic h random forest has b een one of the best solutions. Our results sho w that for the ma jority of the tasks, the student NN can indeed mimic the teacher if one can select the righ t NN h yp er-parameters. W e also inv estigated the use of random forest for selecting the righ t NN h yp er-parameters. Keyw ords: transfer learning, studen t-teacher learning, neural netw ork 1. In tro duction Our goal is to explore the use of neural netw orks as replacements for ML pip eline, or a p or- tion of these pip eline. W e will accomplish this replacemen t b y having the neural net w ork (NN) learning from the original pip eline. This go es beyond the more usual motiv ation in emplo ying the student teac her paradigm, which go es from large NN to smaller, more de- plo yable ones, e.g., Ba and Caruana ( 2014 ); Hinton et al. ( 2015 ). While smaller and more deplo yable are w elcome c haracteristics, w e ha v e additional considerations. One considera- tion is that our con verted components ma y be part of a larger net work, and c haining v arious con verted comp onen ts to form a larger neural net work will simplify the join t optimization of all parts of our system. Mapping v arious systems into neural net works can hav e several additional b eneﬁts. Sp ecialized hardware, suc h as GPUs can enhance p erformance and a neural netw ork may ha ve b etter generalization p erformance than the original systems. Moreo ver, w e exp ect our systems to op erate in dynamic en vironments and ha ving a uniﬁed approac h to these c hanges can enhance the capabilities of these more deplo yable systems. In particular we can use standard metho ds for regularizing these netw orks, which ties in with generalization capabilities as w ell as methods for adapting the net works to changing conditions. In this w ork, our fo cus is on conv erting other classiﬁers in to NN classiﬁers. W e under- stand that a neural netw ork is not necessarily the best classiﬁer in all situations, esp ecially © 2018 M.-L. Sung, J. Silovsky , M.-h. Siu, H. Gish & C. Pittapally . Sung Silovsky Siu Gish Pitt ap all y in small data problems. How ever, our goal is not to exceed the performance of the teac her but rather attempt to matc h it. In some situations it ma y be necessary to rely mainly on the function appro ximation capabilities of neural netw orks and at other times we may need to train the neural net w ork using metho ds of data augmen tation. By data augmentation w e en vision mo deling of the feature s p ace based on the original training data and generating new samples based on this mo del. The samples, coupled with the lab els pro vided by the teac her provide additional training for the neural net w ork. In some cases, the teacher brings more to the knowledge transfer problem than just generating classiﬁcation resp onses. In some particular cases, w e can hav e kno wledge of the decision boundary in some form as well as metadata regarding the structure of the particular classiﬁer. F or example, for random forest classiﬁers it has b een sho wn ( Sethi ( 1990 ), Biau et al. ( 2016 )) that they can be restructured as multi-la y ered neural netw orks. Additionally , W ang et al. ( 2017 ) sho ws ho w to create random forest inspired neural net work architectures. In our curren t w ork, these c haracteristics are not exploited. In the follo wing, w e discuss in greater detail the student-teac her approach, which for us is con version to a neural net work, follo wed b y a discussion of our experimental results. 2. Neural Net w ork Conv ersion 2.1. Studen t-T eacher Kno wledge Distillation Learning The goal of kno wledge distillation is to transfer knowledge acquired by ”teacher” to a ”studen t” such that the student can p erform as w ell or b etter than the teacher. T ypically , the teac her is a complex system either with large n umber of parameters and/or an ensem ble of classiﬁers while the student is relatively smaller to enable eﬃc ien t inference, e.g. in Hin ton et al. ( 2015 ). F or Inductiv e T ransfer Learning deﬁned in P an and Y ang ( 2010 ), consider training data T = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } where x ’s are the input features with corresp onding labels y ’s. The teac her mo del M is trained using T . M can generate the lab els of a new training set T ′ = { ( x ′ 1 , ˆ y 1 ) , . . . , ( x ′ n , ˆ y n ) } where { ˆ y } is the set of lab el p osteriors generated by M . { x ′ } in T ′ and { x } in T can be diﬀeren t. In Li et al. ( 2014 ), { x ′ } includes additional unlabeled data and in Cui et al. ( 2017 ), { x ′ } and { x } are generated by diﬀerent feature extractors. 2.2. Studen t-T eacher o ver Diﬀeren t System Types In typical distillation framework, b oth teac her and student are neural net w orks, or as in Tyukina et al. ( 2017 ), transfer across v ery sp eciﬁc systems is p ossible by tapping into the internal states of the teacher systems. How ever, the student-teac her formulation in Sec- tion 2.1 can be generalized to distill b et ween t wo diﬀeren t system t yp es with the following considerations. 1. T rainability: The studen t system can be trained using lab el p osteriors ˆ y . 2. F eature Handling: The studen t system can process the t yp e of input feature x . 3. Student Complexit y: The studen t system should hav e enough capacity to learn the decision b oundaries of the teacher system. F or example, using a linear classiﬁer as a 2 Neural Network Conversion of Machine Learning Pipelines studen t will not be able to mimic the classiﬁcation decision of a deep neural netw ork classiﬁer. Other than the particular type of studen t system (i.e. neural netw orks or random forests), the h yp er-parameters of the studen t and the amount of a v ailable training can ha ve signiﬁ- can t impact on distillation eﬀectiv eness. 2.3. T raining Data for Studen t The qualit y of the transfer dep ends hea vily on the amount of training data av ailable and the complexit y of the student mo del. As noted ab o v e, the student can b e trained with a diﬀerent data set from those used for training the teacher. While it can b e diﬃcult or exp ensiv e to obtain manually annotated data, the annotation needed for studen t training, { ˆ y } , can easily b e generated using the teacher mo del. Thus, extending T ′ only inv olv es obtaining more x ′ . This can b e accomplished by 1) collecting more unlabeled data whic h is feasible for man y problems; 2) Using T to estimate the input feature distribution, P ( x ), and then sampling from it. Where P ( x ) can be estimated using either parametric mo dels, suc h as GMM, or non-parametric mo dels, such as KNN or an y kernel-based distribution estimators; 3) Assuming P ( x ) to be a uniform distribution and sampling from it. Such P ( x ) can be suboptimal as discussed in Sc holkopf et al. ( 2012 ) but can b e useful as a smo othing function. 2.4. Initial Approac h In this pap er w e fo cus on a set of random forest teachers and our ability to match the random forest performance with NN classiﬁers. W e selected random forest classiﬁers based on their reputation for pro viding the b est performance on a wide range of problems and also their widespread use. W e perform this exploration on a standard set of problems pro vided b y Op enML. In addition to studen t-teac her p erformance comparisons, we also inv estigate w ays to determine the best choice of neural net work arc hitecture and h yp er-parameters to emplo y on particular problems. 3. Exp erimen ts 3.1. OpenML Op enML (Op en Machine Learning), founded b y V ansc horen et al. ( 2014 ), is a platform for sharing datasets, ready-to-use mo dels, and problems in mac hine learning. It provides cross-language APIs that facilitates the reproduction and comparison of diﬀeren t machine learning architectures. There are 4 main organization groups, 1. Data: collection of data sets a v ailable for deﬁnition of ML problems; 2. T ask: a formulation of a ML problem and sp eciﬁcation of ev aluation criteria; 3. Flo w: describ es a particular solution as a c omp osition of primitives/modules p erforming v arious tasks - e.g. feature extraction, normalization, classiﬁcation, etc. 4. Run: describ es particular c onﬁgur ation of a ﬂo w, most imp ortan tly , h yp erparameters of individual primitiv es. Hence, multiple Runs can b e asso ciated with iden tical Flow and yield diﬀerent p erformance. 3 Sung Silovsky Siu Gish Pitt ap all y 3.2. Experimental setup First, we iden tiﬁed a Flow, Olson ( 2017 ), employing random forest as the back e n d clas- siﬁer, which was ev aluated for many tasks. The ﬂo w we found w as comp osed of three sklearn primitives: prepro cessing.imputation.Imputer, decomposition.p ca.PCA and ensem- ble.forest.RandomF orestClassiﬁer. Next, we selected 100 T asks based on best Runs. A Student system was built simply by substituting the random forest (RF) classiﬁer b y Multi-La yer Perceptron (MLP) 1 . F or each T ask, w e used iden tical set of 600 diﬀeren t conﬁgurations of MLPs acting as diﬀeren t Studen ts. T able 1 tabulates the parameters mo diﬁed in our conﬁgurations. F or parameters not listed, sklearn’s MLP defaults are used. By b ottlenec k, we refer to the middle lay er in systems having 3 or more la yers and the relativ e size presen ted in T able 1 is relativ e to the standard la yer size in the net works 2 . La yers No des in lay er Rel. b ottlenec k size Activ ation Init. learning rate 1,2,3,4,5 10,25,100,200,400 0.2, 0.5, 1.0 relu, tanh 1e-2,1e-3,1e-4,1e-5 T able 1: Overview of diﬀerent conﬁgurations of MLP Students 3.3. Studen t-T eacher Kno wledge T ransfer In this study , we focused on knowledge transfer using the original training inputs. Thus, the Student mo del M ′ is trained with training data T ′ = { ( x 1 , ˆ y 1 ) , . . . , ( x n , ˆ y n ) } , where ˆ y = M ( x ). The OpenML exp erimen ts are designed as 10-fold cross-v alidation and we follow ed this exp erimen tal setup. This means that for each task, 10 diﬀeren t RF T eac hers w ere trained and the knowledge transfer applied indep enden tly for 10 MLP Studen ts with a particular conﬁguration (one of the 600). The ﬁnal task accuracy is then simply an av erage o v er the 10 folds. Fig. 1 illustrates the p erformance diﬀerence of the random forest T eac hers and the MLP Students. The b est p erforming MLP conﬁguration is considered for each task. Over all tasks, 55% of Students p erform equally well or b etter than T eac her. On av erage, the p erformance of Studen ts is w orse by 2.66%. In terms of the median, the Studen ts p erform as well as the T eac hers (0.01% b etter). The shift b et ween the av erage and median is caused b y few outliers as shown in the right side of the ﬁgure. W e plan to further inv estigate wh y MLP performs so p oorly on the few outliers. F or some tasks, the Student surprisingly outp erforms the T eacher by a larger margin. W e attribute this partly to natural statistical v ariations, and partly to the fact that RF par- titions the feature space in rectangular regions while MLP has smo other decision boundary whic h may ﬁt certain problems b etter. 1. W e relied on sklearn’s implementation of MLP classiﬁer P edregosa et al. ( 2011 ) 2. A NN with 3 lay ers, 100 no des p er lay er and relative b ottlene ck size of 0.5, has (100,50,100) no des in its hidden lay ers 4 Neural Network Conversion of Machine Learning Pipelines Figure 1: Histogram on diﬀerences betw een RF and MLP accuracies on 100 tasks 3.4. Studen t V ersatility and Complemen tarity Ha ving large num b er of MLP studen t conﬁgurations (hereafter, we refer to these student conﬁgurations as Students ) is impractical and w e exp ect many to hav e similar p erformance across T asks. It is desirable to keep only a smaller set of complemen tary Students, i.e. studen t conﬁgurations with high performance across many T asks. Fig. 2 depicts ho w v arying the num b er of Student candidates aﬀect the p erformance across T asks. Candidate sets of eac h size w ere formed b y removing the systems with least con tribution to the o verall p erformance. W e found that the single b est system 3 turns out to b e v ery v ersatile across T asks as it p erforms only 0.9% worse on a verage compared to the choice of the best Student out of the full inv entory of 600 Studen ts. How ev er, as shown in the ﬁgure, pic king from 20 Studen ts reduces the gap b y half to 0.45%. 3.5. Automatic Studen t Selection While w e can rely on cross-v alidation exp erimen ts to select the b est Studen t, it may still not b e feasible to train m ultiple Students in some practical applications. Ideally we w ould b e able to automatically select the best Studen t candidate based on c haracteristics of the Data, T ask and the T eacher. W e carried out a set of experiments using random forest for selecting the best studen t candidate. In tuitively , the complexity of selecting the b est Student grows with the num b er of Studen ts candidates and the complexity is further accen tuated by the limited num b er of training samples (100 samples corresp onding to the 100 T asks). The RF system for automatic Studen t selection w as trained with a 10-fold cross-v alidation o ver the T asks. As input features to this system, w e used metadata c haracterizing the datasets as pro- vided b y Op enML ( Op e ( 2018 )). W e excluded features corresp onding to p erformance of other reference classiﬁers, e.g. nearest neighbor. As a result, our input feature v ectors w ere formed by 74 co eﬃcien ts reﬂecting v arious dataset qualities and quantities. Fig. 2 shows the comparison of choice of the b est candidate from the set of a particular size with the automatic choice done by the random forest. W e conclude that the automatic Studen t selection fails to select the b est Student candidates. Our reasoning is that the metadata pro vided by Op enML for dataset c haracterization are not suitable for automatic system selection and the performance is also aﬀected b y the small n umber of samples av ailable. 3. A DNN with t wo hidden la yers (400,400), relu activ ation function and initial learning rate of 1e-2 5 Sung Silovsky Siu Gish Pitt ap all y Figure 2: All accuracies are computed using cross-v alidation. The top line ”MLP” is ob- tained b y selecting the b est (out of 600) Student p er task. The ”SubsetMLP” and ”AutoRF” curv es sho w the p erformance of selecting from a subset of size N. 4. Conclusions and F uture W ork There are multiple b eneﬁts in b eing able to represent mac hine learning pip elines for v arious datasets and tasks in a uniﬁed framew ork based on neural netw orks. In this work, we ﬁrst laid out a solution for conv ersion of generic machine learning pip elines in to neural net works. W e view the conv ersion as a m ulti-stage process where parts of the original pip eline are ﬁrst conv erted separately b efore join t optimization can b e done. W e then fo cused on th e conv ersion of the bac k-end classiﬁer represen ted by random forest in to a NN. W e sho wed that NNs learned emplo ying the studen t-teacher concept p erformed generally as w ell as the original random forests, with a few outliers. While NNs with man y diﬀerent conﬁgurations w ere initially considered, we sho wed that the n umber of NN conﬁgurations can be signiﬁcantly reduced without harming the p erformance. Finally , we inv estigated the p ossibilit y of using a random forest for automatic selection of the b est NN conﬁguration based on the c haracteristics of the data. In contrast to using a single b est conﬁguration, this automatic selection leads to only a marginal impro vemen t for very small sets of Students and the p erformance deteriorates as the num b er of Studen ts gro ws. W e attribute this mainly to the lack of relev ant information in the metadata which is used as input to the automatic selection system, and the lac k of training samples. Exp erimen tal work presen ted in this pap er represen ts just an initial step in our eﬀort and man y asp ects of our prop osed solution will hav e to b e further inv estigated in the future, suc h as substitution of v arious parts of generic ML pip elines (including feature extraction or transformation), augmentation of training data, end-to-end join t optimization and automatic selection of the b est NN conﬁguration for substitution. Ac kno wledgement This w ork is sp onsored by the Air F orce Researc h Lab oratory (AFRL) and DARP A. 6 Neural Network Conversion of Machine Learning Pipelines References List of op enml data attributes and measure, 2018. URL https://www.openml.org/search? type=measure . Accessed: 2018-05-21. Jimm y Ba and Ric h Caruana. Do deep nets really need to be deep? In A dvanc es in neur al information pr o c essing systems , pages 2654–2662, 2014. G. Biau, E. Scornet, and J. W elbl. Neural random forests. In arXiv pr eprint arXiv:1604.07143 , 2016. Jia Cui, Brian Kingsbury , Bhuv ana Ramabhadran, George Saon, T om Sercu, Kartik Au- dhkhasi, Abhina v Sethy and Markus Nussbaum-Thom, and Andrew Rosen b erg. Kno wl- edge distillation across ensem bles of m ultilingual mo dels for lo w-resource languages. In ICASSP , 2017. Geoﬀrey Hin ton, Oriol Viny als, and Jeﬀ Dean. Distilling the kno wledge in a neural netw ork. In arXiv , 2015. Jin yu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong. Learning small-size dnn with output- distribution-based criteria. In Intersp e e ch , 2014. Randal Olson. Op enml ﬂow id: 5909, 2017. URL https://www.openml.org/f/5909 . Sinno Jialin P an and Qiang Y ang. A survey on transfer learning. IEEE T r ansactions on know le dge and data engine ering , 22(10):1345–1359, 2010. F. P edregosa, G. V aro quaux, A. Gramfort, V. Mic hel, B. Thirion, O. Grisel, M. Blon- del, P . Prettenhofer, R. W eiss, V. Dub ourg, J. V anderplas, A. Passos, D. Cournap eau, M. Brucher, M. Perrot, and E. Duchesna y . Scikit-learn: Machine learning in Python. Journal of Machine L e arning R ese ar ch , 12:2825–2830, 2011. Bernhard Sc holkopf, Dominik Janzing, Jonas P eters, Eleni Sgouritsa, Kun Zhang, and Joris Mo oij. On causal and an ticausal learning. In ICML , 2012. I. K. Sethi. En tropy nets: from decision trees to neural netw orks. Pr o c e e dings of the IEEE , 78(10):1605–1613, 1990. Iv an Y u Tyukina, Alexander N. Gorbana, Konstan tin I. Sofeiko v a, and Ily a Romanenko. Kno wledge transfer b et ween artiﬁcial in telligence systems. In arXiv , 2017. Joaquin V ansc horen, Jan N V an Rijn, Bernd Bisc hl, and Luis T orgo. Op enml: netw orked science in mac hine learning. ACM SIGKDD Explor ations Newsletter , 15(2):49–60, 2014. Suhang W ang, Charu Aggaraw al, and Huan Liu. Using a random forest to inspire a neural net work and improving on it. In SIAM International Confer enc e on Data Mining , 2017. 7

Neural Network Conversion of Machine Learning Pipelines

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment