Learning ELM network weights using linear discriminant analysis

We present an alternative to the pseudo-inverse method for determining the hidden to output weight values for Extreme Learning Machines performing classification tasks. The method is based on linear discriminant analysis and provides Bayes optimal si…

Authors: Philip de Chazal, Jonathan Tapson, Andre van Schaik

Learn ing ELM netwo rk weight s usin g line ar discr imina nt an alysi s P hilip de Chazal 1 , Jo natha n Ta pso n 1 and André van S chai k 1 1 The MARC S Institut e, Univ ersit y of West ern Sy dney, Penrith N SW 2751 , Aus trali a {p.dechazal, j.tapson, a.vanschaik}@uws.edu.au Abstract. We presen t a n alternativ e to the pse udo - inve rse m ethod for d e- term ining the hi dden to out put weig ht val ues f or Extr eme Lear ning Mac hines performing c lassification tasks. The me thod is ba sed on li near di scrim inant a naly sis a nd provides Bayes op timal single point estimate s for t he wei ght va l- ues. Keyw ords: Extreme learn ing mach ine, Linear dis criminant an alysis, Hi dden to output we ight optim iza tion, MNI ST da tabas e 1 Intro duction The E xtrem e Learning Machine (ELM ) is a multi - layer feedforward neu ral network that offers fast trainin g and flexible non - linear ity for f unction and cla ssification tasks. Its principal benef it is that th e network parameters are calculated in a single pass du r- ing th e training process [1]. In its standa rd for m it has an i nput layer that is f ully c o n- nected to a hidden layer with no n - linear ac tivation functio n s . T he hid den la yer is full y connected to an out put la yer with linear activatio n functions . T he numbe r of hidde n units i s o ften mu ch gr eate r th an the i nput l ayer with a fan - out of 5 to 2 0 hidd en uni ts per input fr eque ntl y used . A key feature of ELMs is that the wei ghts c on necti ng the input layer to the hidden layer are set to random values . T his simplifies the req uir e- ments for tra ining to one of deter m ini ng the h idden to out put un it we ight s, which c an be achieved in a single pass. By rand oml y pro je cting the i nput s to a much hi ghe r d i- mensionalit y, it is possib le to find a hyperp lane which approximates a desired regre s- sion function, or represent s a linear separable classification problem [ 2 ]. A co mmon wa y of ca lcula tin g the hid den to outp ut we ight s is t o use the Moo re - Penrose pseudo - inver se ap pli ed t o the hi dde n laye r out pu ts usi ng lab elle d tr aini ng data. In this p aper we present an alternat ive method for h idden to out put weigh t calc u- lation for net works perfor ming classi fication tas ks . The advantage of our method over the ps eudo - inverse method is th at the weights are the best s ingle point es timates from a Bayesian perspective for a l inear output stage. Usi ng the same network architecture and sa me ra ndo m value s for the inp ut to hidd en la yer wei ghts, we appl ied th e t wo weight calculation methods to th e MNIST datab ase and demons trate d t hat our met hod offer s a p er for ma nce advantage. 2 Methods If we consider a particul ar sample of input data 1 L k × ∈ x  whe re k is a series i n- dex a nd K is the l engt h of t he series , then the forwa rd propaga tion of the local si g- nals t hrou gh the net work c an b e de scri bed by: ( 2 ) (1) ,, 11 ML n k nm ml l k ml y w g wx = =  =   ∑∑ (1) Where 1 N k × ∈ y  is the o utput vecto r co rre spond ing to the inp ut vec to r k x , l and L are the inp ut la ye r i ndex and num ber of input features respectiv ely, m and M are the h idden l ayer in dex an d numbe r of hi dden uni ts respect ively , and n and N are the outp ut la yer index a nd number of output units respectively. (1) w and (2) w are the weights associated with the input to hidden l ayer and t he hi dde n to outp ut layer linear sum s respectively. ( ) g is t he hidd en l ayer non - linear activatio n functio n. With ELM, (1) w are assigned ran do mly which sim pli fies the trai nin g req uire ments to task o f op timisation o f the (2) w onl y. The choic e of line ar o utput neur ons furt her simplifies the optimis ation probl em of (2) w to a single pass algor ithm. The w eight optimisatio n proble m for (2) w can be stated as (2) ,, 1 M n k nm m k m y wa = = ∑ whe r e (1) ,, 1 L m k ml l k l a g wx =  =   ∑ . (2) We can restate thi s as matrix equation by using NM × ∈ W  with elements (2) nm w , and MK × ∈ A  in which each column contain s outputs o f the hidden uni t at one instant in the serie s 1 M k × ∈ a  , and the o utput NK × ∈ Y  where each column contain s out put of the network at on e instance in the series as follows: = Y WA . (3) The optim isat ion problem in volve s dete rmi ning the matri x W give n a se ries o f desire d output s for Y and a s eries of hidden layer ou tputs A . We represent the desired output s for k y usin g the ta rge t vec tor s 1 N k × ∈ t  whe re , nk t has va lue 1 i n the row corres ponding to the de sired cla ss and 0 f or the other N- 1 ele- ments. For example [ ] 0, 1 , 0, 0 T k = t indicate s the desired target is class 2 (o f four classes). As above we can restate the desired targets u sing a matrix NK × ∈ T  whe re each column contains the desired targets of the network at one instance in the series. Substi tut ing T in f or the desi red output s for Y , the optimi zation probl em i nvolve s solvi ng the fol lo wing li near equat ion for W : = T WA . (4) 2.1 Output w eight calculat ion using the pseu do -i nverse In ELM lit erature W is ofte n d ete rmi ned b y the t akin g the Mo ore - Penrose pse u- do - inverse KM +× ∈ A  of A [3] . If the rows of are A are linearly independent (whic h nor mally t rue i f K>M) the n W ma yb e calculated using + = W TA whe re ( ) -1 TT + = A A AA . (5) Thi s minimis es the sum of square error between networks outpu ts Y and t he d e- sired outputs T , i.e. + A mini mise s ( ) 2 ,, 2 11 KN nk nk kn yt = = −= − ∑∑ YT (6) We refer to the pseudo - inverse method for output weight calculatio n as PI - ELM . We note that in cases where the cla ssification pr oble m is ill - posed it may be necessary to r egular ize t his solu tion, usi ng sta nda rd metho ds s uch as T ikhono v re gulariz at ion (rid ge regr essio n). 2.2 Output w eight c alculatio n using linear disc r iminant a nalysis In this paper we develop an alternat ive approach to estimating W based on a ma x- imum l ikelihood es tima tor assuming a linear m odel. We refer to it as the LDA - ELM method as it is equi valent to ap plying linear discri minant ana lysis to the hidde n layer outp uts. O ur p rese ntatio n is b a sed o n the notat ion of Ripley [4 ]. For a n N - class problem Bayes’ rule sta tes that th e posteri or probabil it y of th e n th class n p is related to its prior probabilit y k π and i ts cla ss de nsit y funct ion ( ) , nn f d θ by ( ) ( ) 1 , , nn n n N zz z z f p f π π = = ∑ d θ d θ (7) where d is the inp ut da ta vec tor (in o ur ca se the hidd en l ayer o utp ut), and n θ are the parameters of th e class density function. The class den sities are modelled with a multi - variate Gaussian model with co m- mon co var ia nce Σ and class dependent mean vectors k μ . Give n a n inp ut vec tor k a the class density is ( ) ( ) ( ) ( ) 1 22 1 1 2 , , 2 exp M T n k n n kn kn f π −− −  = = −− −  a θ μ Σ Σ aμ Σ aμ (8) We set the dimensio n of the Gau ssian mode l equal to the n umber of hidden units so that k a is as defi n ed above for the hidd en u nit out put an d henc e MM × ∈ Σ  and 1 M n × ∈ μ  . To b egin with , t he training data is partitioned according to the class membership so that we ha ve 1 N n n KK = = ∑ labelled data vectors of hi dde n unit o utp uts , (n) 1 , 1.. M kn kK × ∈= a  where all members (n) a belong to class n . For a give n set o f hidd en u nit outp ut dat a and class me mber ship a likelihood f un c- tion ( ) l θ is f ormed usin g ( ) ( ) ( ) () 12 11 , , ..., , n K N n N nn k n nk ll f π = = = = ∏∏ θ θθ θ a θ (9) Our ai m is to fi nd va lues o f n θ that maximise ( ) l θ for given set of tr aining data. Equi vale ntly we can maxi mis e the va lue o f the log - likeli ho od : ( ) ( ) ( ) ( ) ( ) ( ) () 12 12 11 1 , , ..., log , , ..., log , log n K NN n N N nk n n n nk n L l fK π = = = = = + ∑∑ ∑ θθ θ θθ θ a θ ( 10 ) Substi tut ing o ur mul ti - variate Gaussian model f or ( ) , n kn f a θ we get ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 12 1 () 1 () 11 2 22 11 1 , , ..., , ..., , log 2 log log . n NN K NN T nn M kn kn n n nk n LL K ππ − = = = = = − − − − −+ ∑∑ ∑ θθ θ μ μ Σ Σ aμ Σ aμ ( 11 ) Thi s is ma ximiz ed whe n () 1 n K n n kn k K = = ∑ μa , and ( ) ( ) () () 11 n K N T nn kk kk nk K = = = −− ∑∑ Σ aμ aμ . ( 12 ) Havi ng det er mined the n μ ’s a nd Σ fro m the tr aini ng da ta we no w nee d to find t he value s for W . We begi n by sub sti tuti ng ( 8 ) into ( 7 ), bri nging t he n π into e xpo ne ntial functi on a nd re movi ng the comm on numerator and denom inator t erm ( ) 1 22 2 M π −− Σ , givi ng us ( ) ( ) ( ) ( ) ( ) ( ) 1 1 2 1 1 2 1 ex p log ex p log T kn kn n n N T kz kz z z p π π − − =  − − −+  =  − − −+  ∑ a μ Σa μ a μ Σa μ . ( 13 ) After expa ndin g the ( ) ( ) 1 1 2 T kn kn − −− − a μ Σa μ ter ms and re moving t h e 1 1 2 T kk − − a Σa from the numerator and denominator w e get ( ) ( ) 1 exp exp n n N a a y p y = = ∑ ( 14 ) whe re ( ) 11 1 2 log TT n n nk n n y π −− = +− μΣ a μΣ μ ( 15 ) C lassification is performed by choosing t he cla ss wit h t he highe st value o f n p . A s n p in (1 4 ) is a mon otonic fun ction of n y in (1 5 ) w e can use either functi on when dec iding o ur fi nal clas s . W e choose to use n y defi ned i n (1 5 ) as it is a linear function of the input data vecto r k a and it can be used to d etermine W for ou r netw ork as fo l- lo ws : ( ) ( ) ( ) 11 1 11 1 1 11 2 22 22 2 11 1 12 log log log TT T N NN TT T N ππ π −− − −− −  −− − =   μ Σμ μ Σμ μ Σμ W μΣ μΣ μ Σ   . ( 16 ) Note that 1 NM ×+ ∈ W  , as a cons tant term has been i ntrodu ced into the hi dden t o outp ut l a ye r we ig hts (the first ro w of W ). I f we want to determine the poster ior pro b- abilities then we use (1 4 ) applied t o the n etwork ou tputs. Summary o f method In summary calculating W procee ds as f ollows (i) Partitio ned the hidden unit output data a ccording to the class membersh ip so that we have 1 N n n KK = = ∑ labelled data vectors, (n) 1 , 1.. L kn kK × ∈= a  whe re al l me mbe r s (n) a belong to class n (ii) Calcu late () 1 n K n n kn k K = = ∑ μa and ( ) ( ) () () 11 n K N T nn kk kk nk K = = = −− ∑∑ Σ aμ aμ (iii) Set the prio r prob abilities n π (iv) Calculate ( ) ( ) ( ) 11 1 11 1 1 11 2 22 22 2 11 1 12 log log log TT T N NN TT T N ππ π −− − −− −  −− − =   μ Σμ μ Σμ μ Σμ W μΣ μΣ μ Σ   To classify n e w data we (i) Calculate the network output y in resp onse to the hid den la yer outp ut a is 1  =   yW a (ii) (O ptional) Calcula te the poster ior prob abilities ( ) ( ) 1 exp exp n n N a a y p y = = ∑ (iii) T he final decision of the ne two rk is the outp ut wit h the hi ghest value of n y or, equi valentl y, n p . Combining Classifiers Equa tion (1 4 ) provides an easy way to combine the outputs of mu ltiple classifiers. Once the posterior proba bilities are calculated f or each class f or each classifier w e can form a combin ed posteri or proba bility a nd cho ose t he cla ss wit h the hi ghe st co mbined posteri or probabil ity . There are any sc heme s for d oing t his [ 5 ] with unw eighte d ave r- aging across the poste rior probabi lity ou tputs be ing one of the m ost sim ple schemes. 3 Experi ments We applied the LD A - ELM and PI - ELM weight ca lculation met hod to the MNI ST hand writt e n digit recognitio n prob lem [ 6 ]. Autho rs J T and AvS have pre vious ly r e- port ed good classification results using E LM on th is database [2] . The database has 60,000 tra inin g and 10,00 0 testi ng exampl es. Each example i s a 28*28 pix el 256 lev el grayscale image of a handwritten digit between 0 and 9. The 10 classes are approx i- mately equall y distributed in the trai nin g and te stin g set s . T he ELM algorithms were applied directly to the unprocessed images and we trained the ne t wo r k s by p ro vid ing al l dat a in b atch mo de . T he ra ndom va lue s for the input layer weights were uniformly distrib uted between - 0.5 and 0.5. The prior prob a- bilities for t he 10 classes for L DA - ELM were each set to 0.1. In order t o perform a direct com paris on of the two meth ods we used th e followin g protocol : For fa n - out o f 1 to 20 hid den unit s per input, r epeat 200 ti mes (i) Assi gn ra ndo m value s to the i nput la yer weig hts a nd de termi ne the hid den laye r outputs for th e 60,000 tra ining data exam ples . (ii) Deter mine PI - ELM net work weig hts usi ng d ata fr om ( i) . (iii) Determine LDA - ELM netwo rk wei ghts u sing d ata fr om ( i). (iv) Eval uate bo th netw orks on t he 10,000 test data examples and store results . We averaged the results for the 200 repeats of th e experim ent for each fan - out a nd compared the misclassification rates. The se results are sho wn in F ig . 1 and Table 1. Fig. 1. The erro r rate of the LDA - ELM and t he PI - ELM on the MNIST databas e for fan - out vary ing be twee n 1 and 20. All resul ts at each fan - out are a ver aged fr om 200 re peats of t he e xperiment . The result s sho w that t he LDA - ELM outperforms the PI - ELM at ev ery fan - out value. The average performa nce benefit was a 3.1% decrease in the error rat e of LDA - ELM with a larger benefit at s mall er fa n - ou t val ues. Table 2 b elow sho ws t hat there is little extra computatio nal require ment for the LD A - ELM method . 2 3 4 5 6 7 0 5 10 15 20 Error (%) Fan - out LDA-ELM PI-ELM Table 1. The error rate (%) an d percentage i mprove ment of th e LDA - ELM over th e PI - ELM on the MNI ST databas e. Result s average d from 200 repea ts. Fan out 1 2 3 4 5 6 7 8 9 10 12 15 20 PI - E LM 7.20 5.21 4.32 3.80 3.45 3.20 3.00 2.86 2.74 2.63 2.49 2.31 2.15 LDA - E LM 6.84 5.03 4.17 3.68 3.35 3.11 2.92 2.78 2.66 2.55 2.42 2.25 2.08 % imp rovemen t 4.9 3.5 3.3 3.1 2.9 3.0 2.6 2.9 2.8 2.7 2.6 2.6 3.3 Table 2. Computa tion ti me s (in se conds). T he el apsed t im e is show n f or traini ng the PI - ELM and L DA - ELM network s on the 6 0,000 i ma ges from MNIST databa se and t esting on t he 10,00 0 ima ges using MA TLAB R2012a co de runni ng o n 20 12 Son y Vaio Z series la ptop with an Int el i7 - 2640M 2. 8G Hz pr ocess or and 8G B RA M. Fan - out 1 2 3 4 5 6 7 8 9 10 12 15 20 PI - E LM 6.2 13.9 24.9 37. 8 53.3 68.5 88.2 1 11 1 36 162 228 3 39 630 LDA - E LM 6.2 13.9 25.2 38. 1 54.0 69.5 90.6 1 13 1 40 167 238 3 57 702 The last experiment we performed inve sti gated co mbining multip le net work s using the LD A - E LM b y aver agi ng p oste rior prob ab ilitie s. We inve stiga ted usi ng a n ense m- ble n umber between 1 and 20 and repeated the train ing and testing 10 times at each ens emble num ber . We then averaged the results which are shown below in Fig . 2. Fig. 2. The erro r rate of the LDA - ELM o n the MNIS T database at a fan - out of 20 w ith the ensem ble num ber v ary ing betw een 1 a nd 20. T he res ult at each ensemble n umber is averaged from 10 r epeats o f the experiment. The result s sho wn in Fi g . 2 demonstrate the benef it of combining multiple LDA - ELM n etworks on the MNIST da tabase. C ombin ing tw o networks reduce d the error rate f rom 2. 08% to 1.86% and adding m ore networks further reduced the error . T he best error rate wa s 1.69% achieved w hen 20 ne tworks were combin ed. 4 Discussion The results o n the MNIST database shown in Fig . 1 s uggest that there is a perf o r- mance benefit to be gained by us ing the LDA - ELM o utput weight ca lculation o ver t he PI - ELM m ethod . As there is only a small extra co mputation overhead we believe it is a viable alternative to th e pseudo - inverse m ethod especially at small fan - out value s. A nothe r benefit of the LDA - ELM is the ab ility to com bine outpu ts from netw orks by combining the poster ior proba bilities estimates of the in d ivid ual net wor ks . W hen we applied this to the MNIST database w e were able to redu ce the error rate to 1. 7%. This result is co mparable to the bes t performance of mo s t oth er 2 and 3 lay er neural networks processing the raw d ata [ 7 ]. Further work will incl ude co mpa ring t he t wo weight calculation m ethods on other publicly av ailable databases such as abalone and iris data sets [ 8 ]. 1.5 1.6 1.7 1.8 1.9 2.0 2.1 0 5 10 15 20 Error (%) Ensembl e 5 Conclusion We have presented a new method for weight calculati o n for hidde n to outpu t wei gh ts for ELM n et w or ks perform ing classification tasks. The m ethod is based o n linea r disc ri minant anal ys is a nd requires a modest a mount o f extra calculatio n time com pared to th e pseudo - inverse method (<1 2 % fo r a fan - out ≤ 20) . W hen ap pli ed to the MNIST database the averag e m isclassific ation rate impro vement was 3.1% in com parison to t he pseu do - inverse m ethod for identically conf igured and initialized netw orks. 6 Bibliography 1. G.B. Huang, Q.Y. Zhu and C. K. S ie w, “Extreme L earnin g Machin e: Theo ry and Appli c a- tions,” Neur ocomp uting , vol. 70, pp. 48 9 – 501, 20 06. 2. J. Taps on and A. v an Sc haik, “ Lea rning t he pse udoinv ers e solu tion t o ne twork we ights,” Neura l Networ ks , vol . 45, pp. 9 4 – 10 0, Sep. 2013. 3. R. Penrose and J. A. T odd , “ On bes t appr oxim ate solutio ns of line ar m atrix equati ons ,” Mathe matical Proc eedings of the Cambri dge Phi loso phical Soc iety , vol . 52, pp 1 7 - 19 , 1956. 4. B.D. Ripley, Patt ern Re cogn ition and Neur al Ne tworks , Cam bridg e Univ. Press, 1 996. 5. L.I. Kunc heva , Combi ning P atter n Classi fiers : Methods and A lgorit hms , Wiley Pr ess, 2004. 6. Y . LeCun and C. Cort es, “The MNIST dat abase of handwritt en digits”, Available at: http: //ya nn.lec un.com /exdb/m nist . 7. Y. LeCun, L. Bottou, Y. Be n gio and P. Haf fner, “ Grad ient - Based L ea rning Appl ied to Docum ent R ecogni tion ,” Pr ocee dings of t he IEE E , vol. 86( 11) , pp. 227 8 - 23 24, Nov . 1998 . 8. Data sets from the U CI Machine Lear ning Repository, http://archiv e.ics.uci.edu/ml/.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment