Evolutionary Synthesis of Deep Neural Networks via Synaptic Cluster-driven Genetic Encoding

Evolutionary Synthesis of Deep Neural Networks via Synaptic Cluster -driven Genetic Encoding Mohammad Javad Shaﬁee, Alexander W ong Departmen t of Systems De sign Engineer ing, Uni versity of W aterloo W ater loo, Ontario, Canada {mjsh afiee , a28wo ng}@u water loo.ca Abstract There h as b een signiﬁca nt re cent in terest towards achieving highly efﬁcient deep neural network architecture s. A p romising paradig m for achieving th is is the con- cept o f evolutiona ry deep intelligence , wh ich attempts to mimic biological evolu- tion processes to synthesize highly-efﬁcient deep neural ne tworks over succe ssi ve generation s. An important aspect of e volutionary deep intelligence is the genetic encodin g scheme used to mimic heredity , which can have a signiﬁcant impact on the qu ality of offspring deep neu ral networks. Moti vated by the n eurobio logical pheno menon of synaptic clustering, we introduce a ne w genetic encoding s chem e where synaptic prob ability is driv en tow ards the for mation of a high ly sparse set of synaptic clusters. Exp erimental results for th e task of image classiﬁ cation dem on- strated that the synth esized offspring networks using this sy naptic cluster-driven genetic encodin g scheme can achieve state-of -the-art perform ance wh ile having network architectures th at are n ot only sign iﬁcantly more ef ﬁcient (with a ∼ 125- fold decrease in synapses for MNIST) compa red to the original ancestor network, but als o tailo red for GPU-accelerated machine learning applications. 1 Intr oduction There has been a strong interest toward ob taining highly ef ﬁcient deep neur al network a rchitectures that maintain strong mode ling p ower fo r different applica tions such as self-driving cars an d smart- phone applications where the av ailable comp uting resou rces are practically limited to a combination of lo w-power , em bedded GPUs and CPUs with limited memory and computin g power . Th e optimal brain damage meth od [1] was one of the ﬁrst app roaches in this area, where synap ses were pruned based o n their strengths. Gong et al. [ 2] p ropo sed a network co mpression fram ew ork where vec- tor quantization was leveraged to shr ink the storage req uiremen ts of deep neur al network s. Han et al. [3] utilized pruning , quantizatio n and Huffman coding to fu rther r educe th e storage req uiremen ts of dee p neu ral networks. Hashing is another trick utilized by Chen et al. [4] to c ompress th e network into a sma ller amoun t of storage space. Low rank appro ximation [5, 6] and sp arsity learning [7 – 9] are other strategies used to sparsify deep neural networks. Recently , Shaﬁee et al. [10] tackled th is pr oblem in a very different manner by propo sing a novel framework for synthesizing highly efﬁcient d eep neural networks v ia the idea of e volution ary synthe- sis. Dif fering signiﬁcantly from p ast attempts at leveraging evolutionary computing methods such as gene tic algorithm s fo r creating ne ural networks [1 1, 12], which attempte d to create n eural ne t- works with high modeling capabilities in a direct but highly computationally expensi ve m anner, the propo sed novel evolutionary d eep intelligence appr oach mimics bio logical ev olution mechan isms such as r andom mutation, natural selection, and heredity to synth esize successi ve genera tions o f deep neura l networks with progr essi vely m ore efﬁcient network architectures. The architectur al traits of a ncestor dee p neural network s are encoded via prob abilistic ‘ DN A ’ sequen ces, with new offspring networks possessing diverse network architectures syn thesized stochastically b ased on the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. ‘DN A ’ from the ancestor networks an d computatio nal environmental factor models, thus mimicking random mutation, heredity , and natural selection . Th ese of fsprin g networks ar e then train ed, much like o ne would train a newborn, and hav e more efﬁcient, more d iv erse network arch itectures while achieving p owerful modeling c apabilities. An important aspect of ev olutio nary deep intelligence that is particular inter esting and worth deeper in vestigation is the g enetic encod ing scheme used to mimic heredity , which can have a signiﬁcant impact on the way arch itectural tr aits are passed down f rom g eneration to gene ration and thus impact the quality of descendant d eep neural networks. A more effecti ve gene tic encoding scheme can facilitate for better transfer of impo rtant gen etic info rmation fr om anc estor networks t o allo w f or the synthesis of ev en more ef ﬁcient and powerful d eep neural networks in the next ge neration. As such, a deeper inv estigation and exploration into the incorporatio n of synap tic cluster ing into the genetic encodin g sch eme can b e potentially fruitf ul for synth esizing highly efﬁcient deep neural networks that are mo re gear ed for improving not o nly memor y and storage requir ements, but also be tailored for devices d esigned for highly parallel computation s such as em bedded GPUs. In this study , we intro duce a new synaptic cluster-driven genetic encoding sch eme for synthesizing highly efﬁ cient deep neural networks over successi ve g eneration s. This is ach iev ed through the in- troductio n of a m ulti-factor syn apse pr obability mo del where the synaptic probability is a prod uct of both the pro bability of synthesis of a particular clu ster o f synapses an d the probab ility of syn thesis of a par ticular synapse within th e syn apse cluster . This genetic e ncoding scheme e ffecti vely promotes the fo rmation of synap tic clusters over successi ve gene rations while also promoting the fo rmation of highly efﬁcient deep n eural networks. 2 Methodology The prop osed gen etic encoding scheme d ecompo ses synap tic probability in to a multi-factor p roba- bility mod el, where the architectur al traits of a deep ne ural network ar e e ncoded prob abilistically as a pr oduct of the p robab ility of synthesis of a p articular cluster o f synapses and th e prob ability of synthesis of a particular synapse within the synapse cluster . Cluster -driven Genetic Encoding. Let the network architectur e of a deep neu ral n etwork be ex- pressed by H ( N , S ) , with N deno ting the set of possible ne urons and S denoting the set of possible synapses in the ne twork. E ach neuron n i ∈ N is connected via a set of synapses ¯ s ⊂ S to neuron n j ∈ N such that th e synap tic conne ctivity s i ∈ S is associated with a w i ∈ W denoting its stre ngth. The architectura l traits of a deep neural network in generation g can be enco ded by a co nditional probab ility gi ven its architecture at the previous generatio n g − 1 , d enoted by P ( H g |H g − 1 ) , which can be treated as the probab ilistic ‘DN A ’ seque nce of a deep neural network. W ithout loss o f generality , based on the assump tion that syn aptic con nectivity char acteristics in an ancestor network are de sirable tra its to be inherited by descendant networks, one can instead en- code the gen etic informatio n o f a deep n eural network by synaptic prob ability P ( S g | W g − 1 ) , where w k,g − 1 ∈ W g − 1 encodes the syn aptic strength o f each syn apse s k,g ∈ S g . In the proposed g e- netic enco ding scheme, we wish to take into co nsideration and incorpo rate the neurob iological p he- nomeno n of synap tic clusterin g [13–17], where th e p robab ility o f synap tic co- activ ation increases for correlated synapses encoding similar inform ation th at are close together on the same dendrite. T o explore the idea of prom oting the formation of syn aptic clusters over successiv e gener ations while also pro moting the formation of hig hly efﬁcient deep neu ral networks, the follo wing multi- factor synaptic probability model is introduced : P ( S g | W g − 1 ) = Y c ∈ C h P  ¯ s g,c | W g − 1  · Y i ∈ c P ( s g,i | w g − 1 ,i ) i (1) where the ﬁrst factor ( ﬁrst cond itional proba bility) mod els the pr obability of th e synthesis o f a pa rtic- ular cluster of s yna pses, ¯ s g,c , while the s econ d factor m odels the probability of a par ticular synapse, s g,i , within synaptic cluster c . Mo re spec iﬁcally , the probab ility P ( ¯ s g,c | W g − 1 ) represents the like- lihood that a par ticular synaptic cluster, ¯ s g,c , be synth esized as a p art of the n etwork architecture in generation g given the sy naptic stren gth in generatio n g − 1 . For exam ple, in a d eep conv olutio nal neural n etwork, the synaptic cluster c can be any subset of synapses such as a kernel or a set of kernels within the deep neural network. The prob ability P ( s g,i | w g − 1 ,i ) represents the likelihood of existence of synap se i within the cluster c in generatio n g given its syn aptic stren gth in generatio n g − 1 . As such, the pr oposed s yna ptic pro bability m odel not only promotes th e persistence of strong synaptic c onnectivity in offspring deep neura l networks over successi ve generations, but also pro - 2 motes th e p ersistence of strong syn aptic clusters in o ffspring deep n eural n etworks over successiv e generation s. Cluster -driven Evolutionary Synthesis. In the seminal pape r on e volutionary deep intelligence by Shaﬁee et al. [1 0], th e syn thesis pro bability P ( H g ) is com posed of the syn aptic probability P ( S g | W g − 1 ) , which mimic heredity , and environmental factor mod el F ( E ) which mimic natural selection b y introdu cing quantitative environmental c ondition s that offspring networks must ad apt to: P ( H g ) = F ( E ) · P ( S g | W g − 1 ) (2) In this study , (2) is reformulated in a more gener al way to en able the incorpo ration of dif ferent quan- titati ve environmental factors over both the synth esis of synaptic clusters a s well as each synapse: P ( H g ) = Y c ∈ C h F c ( E ) P  ¯ s g,c | W g − 1  · Y i ∈ c F s ( E ) P ( s g,i | w g − 1 ,i ) i (3) where F c ( · ) and F s ( · ) rep resents e n viron mental factors enf orced at the cluster and synapse levels, respectively . Realization of Cluster-dri ven Ge netic Encoding. I n this study , a simple realizatio n of th e p ropo sed cluster-dri ven genetic encoding scheme is presented to demonstra te the beneﬁts of the prop osed scheme. Here, since we wish to promo te th e persistence of strong s yna ptic clusters in offspring d eep neural n etworks over succe ssi ve g enerations, the p robability o f th e syn thesis of a particular cluster of synapses, ¯ s g,c is modeled as P  ¯ s g,c = 1 | W g − 1  = ex p  P i ∈ c ⌊ ω g − 1 ,i ⌋ Z − 1  (4) where ⌊·⌋ e ncodes the tru ncation of a synap tic weight and Z is a norm alization factor to make (4) a probab ility d istribution, P  ¯ s g,c | W g − 1  ∈ [0 , 1 ] . The trunc ation of sy naptic weigh ts in the model r educes the inﬂuence of very weak syna pses within a synap tic cluster on the genetic en- coding process. The p robab ility of a par ticular syna pse, s g,i , within synaptic cluster c , den oted b y P ( s g,i = 1 | w g − 1 ,i ) can be expressed as : P ( s g,i = 1 | w g − 1 ,i ) = exp  ω g − 1 ,i z − 1  (5) where z is a layer-wise norm alization con stant. By inc orpor ating both o f th e af oremen tioned prob- abilities in the propo sed scheme, the relationship s among st synapses as well as th eir individual synaptic strengths are taken into consideration in the genetic encoding process. 3 Experimental Results Evolutionary synth esis of dee p neural n etworks acro ss se veral ge nerations is performe d using the propo sed genetic enc oding scheme , an d th eir network a rchitectures and accuracies ar e investi gated using thr ee bench mark datasets: MNIST [ 18], STL-1 0 [19] an d CIF AR10 [2 0]. The LeNet-5 archi- tecture [1 8] is selected as the network architectur e of the origin al, ﬁrst gener ation ancestor network for MNIST and STL- 10, while the AlexNet architecture [21] is utilized for the ance stor network for CIF AR10, with the ﬁrst layer modiﬁed to utilize 5 × 5 × 3 kern els instead of 11 × 1 1 × 3 kernels g iv en the sma ller im age size in CIF AR10 . The environmental factor model being imposed at different generatio ns in this study is designed to for m deep n eural n etworks with pr ogressively more efﬁcient network arch itectures than its ancestor network s wh ile maintaining modelin g accu - racy . More spec iﬁcally , F c ( · ) and F s ( · ) is form ulated in this study su ch that an offspring dee p neural n etwork should no t h av e more than 80% o f th e to tal number o f synapses in its d irect ances- tor network. Fur thermor e, in this study , each kernel in the deep neural network is c onsidered as a synaptic cluster in th e synap se pro bability model. In other words, the probab ility of the synthesis of a particular synaptic cluster (i.e, P ( ¯ s g,c | W g − 1 ) ) is modeled as the trunc ated summation of the weights within a kernel. Results & Dis cussion . In this st udy , of fspring deep n eural networks were synthesized in successi ve generation s until the accu racy of th e offspring network exceeded 3 %, so th at we can better study the ch anges in architectural ef ﬁciency in the descend ant networks ov er mu ltiple gener ations. T a- ble 1 shows the architectural e fﬁciency (deﬁned in this study a s th e total num ber of synapses of the original, ﬁrst-g eneration ancestor ne twork divided b y that o f the curren t synthesized network) versus the modelin g accur acy at several gener ations for three datasets. As obser ved in T ab le 1 , the descendan t network at the 13th generation for MNIST was a staggering ∼ 1 25-fo ld more efﬁcient than the original, ﬁrst-gen eration an cestor ne twork with out exhibiting a sig niﬁcant drop in the test 3 T able 1: Arch itectural ef ﬁciency vs. test accu racy f or different g eneration s of synthesized netw ork s. “Gen. ”, “ A- E” and “ A CC. ” den ote generation, architectural efﬁ ciency , and accu racy , respecti vely . MNIST Gen. A-E A CC. 1 1.00X 0.994 7 5 5.20X 0.994 1 7 12.09 X 0.992 8 9 32.23 X 0.988 4 11 62.74 X 0.984 9 13 125.0 9X 0.977 5 STL-10 Gen. A-E ACC. 1 1.00X 0.577 4 3 2.37X 0.593 3 5 5.81X 0.603 9 7 14.99 X 0.60 51 9 38.22 X 0.57 44 10 56.27 X 0.56 58 CIF AR10 Gen. A-E ACC. 1 1.00X 0.866 9 2 1.64X 0.881 4 3 2.82X 0.876 6 4 5.06X 0.868 8 5 8.39X 0.858 8 6 14.39 X 0.84 59 T able 2: Cluster ef ﬁciency of th e conv olution al layers (lay ers 1-3) an d fully con nected layer (lay er 4) at ﬁrst and th e last repor ted generations of deep neural networks for MNIST and STL-1 0. Colu mns ‘E’ show o verall cluster ef ﬁciency for synthesized deep neural networks. MNIST STL-10 Gen. ACC. Layer 1 Layer 2 Layer 3 Layer 4 E Gen. ACC. Layer 1 Layer 2 Layer 3 Layer 4 E 1( baseline ) 0.9947 1X 1X 1X 1X 1 X 1( baseline ) 0.5774 1X 1X 1X 1X 1 X 13 0.9775 3.6X 12.96X 8.83X 13.47 X 9.71X 10 0.5658 3.31X 8.46X 5.49X 6.56X 5.96X T able 3: Cluster efﬁciency o f th e co n volutional lay ers ( layers 1-5) and fully co nnected layer s (lay ers 6-7) at ﬁrst and the last repor ted generation s of d eep neural networks for CIF AR1 0. Gen. A CC. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 L ayer 6 Layer 7 E 1 ( baseline ) 0.8669 1X 1X 1X 1X 1X 1X 1X 1X 6 0.8459 2.01X 3.10X 3.50 3.40X 3.29X 2.97X 1.51X 2.82X accuracy ( ∼ 1.7% dr op). T his trend was consistent with that ob served with the STL-1 0 resu lts, where the descen dant network at the 1 0th gene ration was ∼ 56-fo ld more efﬁcient th an the o riginal, ﬁrst- generation ancestor network without a signiﬁcant drop in test accuracy ( ∼ 1.2% drop). It also worth noting tha t since the training dataset of the STL-1 0 dataset is relatively small, the descendant ne t- works at g eneration s 2 to 8 ac tually achieved higher test accuracies wh en co mpared to the o riginal, ﬁrst-generatio n ancestor network, which illustrates the gener alizability of the descendan t networks compare d to the origin al ancestor network as t he descendan t n etworks h ad fewer p arameters to train. Finally , for the case of CIF AR10 where a d ifferent network arc hitecture was used (AlexNet), the descendan t network at the 6th generation network was ∼ 14.4 -fold more efﬁcient than the o riginal ancestor network with ∼ 2 % dro p in test accu racy , thus demonstratin g th e app licability of the pro - posed scheme for different n etwork architectures. Embedded GPU Ramiﬁcations . T able 2 and 3 shows the cluster ef ﬁciency per layer of the synthe- sized deep neural networks in the last gener ations, wh ere cluster efﬁciency is deﬁne d in this stud y as the total n umber of k erne ls in a layer of the origina l, ﬁrst-generation an cestor network d i vide d b y that of the current synthesized network. It can be obser ved tha t for MNIST , the cluster efﬁ ciency of last-generatio n descen dant network is ∼ 9.7X, wh ich may result in a near 9 .7-fold poten tial sp eed-up in ru nning time on embed ded GPUs b y redu cing the nu mber of a rithmetic o perations by ∼ 9 .7-fo ld compare d to the ﬁrst-gen eration an cestor network, thou gh comp utational overhead in o ther lay ers such as ReLU may lead to a reduction in actual spee d-up. The potential speed -up f rom the last- generation descend ant network for STL- 10 is lower comp ared to MNIST d ataset, with the rep orted cluster efﬁciency in last-g eneration de scendant network ∼ 6X. Finally , th e clu ster efﬁciency for th e last gen eration descendant network for CIF AR10 is ∼ 2.8 X, as shown in T ab le 3. These results demonstra te th at not o nly can the proposed genetic encoding scheme promotes the s ynth esis of deep neural networks that are highly efﬁcient yet maintains modeling accur acy , b ut also promotes the formation of highly sparse sy naptic clusters that make them hig hly tailored for d evices design ed for highly parallel compu tations s uch as emb edded GPUs. Acknowledgments This researc h has bee n suppor ted b y Canad a Research Chair s prog rams, Natural Scien ces and En - gineering Research Council of Canada (NSERC), an d the Ministry of Research and Innovation of Ontario. Th e authors also thank Nvidia f or the GPU hardware used in th is study through the Nvidia Hardware Grant Program. 4 Refer ences [1] Y . LeCun , J. S. Denker , S. A. Solla, R. E. How ard, and L. D. Jackel, “Op timal brain dam age. ” in Advances in Neural Information Pr ocessing Systems (NIPS ) , 1989. [2] Y . Gon g, L. Liu, M. Y ang, an d L . Bourdev , “Comp ressing deep co n volutional networks using vector quantization, ” CoRR, abs/1412.6 115 , 2014 . [3] S. Han, J. Po ol, J. T ran, an d W . Dally , “Lear ning bo th weigh ts an d connectio ns for efﬁcient neural network, ” in Advances in Neur al In formation Pr ocessing Systems (NIP S) , 2015. [4] W . Chen, J. T . W ilson, S. T yre e, K. Q. W einberger, and Y . Chen, “C omp ressing neur al networks with the hashing trick, ” CoRR, abs/1504 .0478 8 , 2 015. [5] Y . Ioanno u, D. Robertson, J. Shotton, R. Cipolla, and A. Criminisi, “Training cnns w ith low- rank ﬁlters for efﬁcient ima ge classiﬁcation, ” arXiv pr eprint arXiv:151 1.067 44 , 2015. [6] M. Jaderberg, A. V edaldi, and A. Zisserman, “ Speeding up con volutional n eural networks w ith low r ank expansions, ” arXiv pr eprint arXiv:1405. 3866 , 2014. [7] J. F eng an d T . Darrell, “L earning th e structure of d eep co n volutional networks, ” in Pr oceeding s of the IEEE Internation al Co nfer ence on Computer V ision , 2015, pp. 2749 –2757 . [8] B. Liu, M. W an g, H. Foroosh, M. T appen, and M. Pensky , “Sparse conv olutional neural n et- works, ” in Pr oceedin gs of the IEEE Conference on Co mputer V ision a nd P attern R ecognition , 2015, pp. 806–8 14. [9] W . W en, C. W u, Y . W a ng, Y . Chen, and H . Li, “Learning structu red sparsity in deep n eural networks, ” arXiv pr eprint arXiv:1608 .0366 5 , 201 6. [10] M. J. Shaﬁee, A. Mish ra, and A. W ong, “Deep learning with Darwin : Evolutionary synthe sis of deep neural networks, ” arXiv pr eprint arXiv:1606 .0439 3 , 20 16. [11] D. White and P . Ligom enides, “Ganne t: A genetic algo rithm fo r optim izing topolo gy and weights in n eural network design , ” in I nternation al W orkshop o n Artiﬁcia l Neural Networks . Springer, 199 3, pp. 322–327. [12] K. O. Stanley and R. Miikkulain en, “Evolving n eural network s thro ugh augm enting topolo - gies, ” Evolutiona ry co mputation , vol. 10, no. 2, pp. 99– 127, 2002. [13] O. W elzel, C. H. Tis chb irek, J. Jung, E . M. K ohler, A. Svetlitchny , A. W . Henkel, J. K orn hu- ber, a nd T . W . Groemer , “Syn apse clusters are pref erentially formed by synap ses with large recycling pool sizes, ” PloS one , vol. 5, no. 10, p. e13514 , 2 010. [14] G. Kastellak is, D. J. Cai, S. C. Mednick, A. J. Silva, a nd P . Poirazi, “Syn aptic clusterin g within dendrites: An eme rging theory of memory f ormation , ” Pr ogr ess in n eur obiology , vol. 126, pp. 19–35 , 2 015. [15] M. E . Lar kum an d T . Nevian, “Synaptic clu stering by dendr itic signallin g m echanisms, ” Cur- r ent opinion in neur obiology , v ol. 18, no. 3, pp. 321–3 31, 2 008. [16] N. T akah ashi, K. Kitamura, N. Matsuo, M. Mayford, M . Kano, N. Matsuk i, and Y . Ikegaya, “Locally synchron ized syn aptic inputs, ” Science , vol. 3 35, no. 6066 , pp. 353–356, 2012. [17] J. W in nubst and C. Lohman n, “Syn aptic clustering during de velopment a nd learning : the why , when, and how , ” F r on tiers in molecular neur o science , vol. 5, 2015. [18] Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner , “Gradient- based learnin g applied to document recogn ition, ” Pr oceeding s of t he IEEE , v ol. 86, no. 11, pp. 227 8–23 24, 199 8. [19] A. Coate s, A. Y . Ng, and H. Lee, “ An analysis o f s ingle- layer networks in unsupervised f eature learning, ” in Interna tional con fer ence on artiﬁcial intelligence and statistics , 2011, pp. 215– 223. [20] A. Krizhevsky and G. Hinto n, “Learning multiple layers of features from tin y images, ” 200 9. [21] A. K rizhevsky , I. Sutskever , and G. E. Hinton, “Imagen et classiﬁcation with deep conv olution al neural networks, ” in Advances in neural information pr ocessing systems (NIPS) , 2012. 5

Evolutionary Synthesis of Deep Neural Networks via Synaptic Cluster-driven Genetic Encoding

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment