Learning ELM network weights using linear discriminant analysis

Learn ing ELM netwo rk weight s usin g line ar discr imina nt an alysi s P hilip de Chazal 1 , Jo natha n Ta pso n 1 and André van S chai k 1 1 The MARC S Institut e, Univ ersit y of West ern Sy dney, Penrith N SW 2751 , Aus trali a {p.dechazal, j.tapson, a.vanschaik}@uws.edu.au Abstract. We presen t a n alternativ e to the pse udo - inve rse m ethod for d e- term ining the hi dden to out put weig ht val ues f or Extr eme Lear ning Mac hines performing c lassification tasks. The me thod is ba sed on li near di scrim inant a naly sis a nd provides Bayes op timal single point estimate s for t he wei ght va l- ues. Keyw ords: Extreme learn ing mach ine, Linear dis criminant an alysis, Hi dden to output we ight optim iza tion, MNI ST da tabas e 1 Intro duction The E xtrem e Learning Machine (ELM ) is a multi - layer feedforward neu ral network that offers fast trainin g and flexible non - linear ity for f unction and cla ssification tasks. Its principal benef it is that th e network parameters are calculated in a single pass du r- ing th e training process [1]. In its standa rd for m it has an i nput layer that is f ully c o n- nected to a hidden layer with no n - linear ac tivation functio n s . T he hid den la yer is full y connected to an out put la yer with linear activatio n functions . T he numbe r of hidde n units i s o ften mu ch gr eate r th an the i nput l ayer with a fan - out of 5 to 2 0 hidd en uni ts per input fr eque ntl y used . A key feature of ELMs is that the wei ghts c on necti ng the input layer to the hidden layer are set to random values . T his simplifies the req uir e- ments for tra ining to one of deter m ini ng the h idden to out put un it we ight s, which c an be achieved in a single pass. By rand oml y pro je cting the i nput s to a much hi ghe r d i- mensionalit y, it is possib le to find a hyperp lane which approximates a desired regre s- sion function, or represent s a linear separable classification problem [ 2 ]. A co mmon wa y of ca lcula tin g the hid den to outp ut we ight s is t o use the Moo re - Penrose pseudo - inver se ap pli ed t o the hi dde n laye r out pu ts usi ng lab elle d tr aini ng data. In this p aper we present an alternat ive method for h idden to out put weigh t calc u- lation for net works perfor ming classi fication tas ks . The advantage of our method over the ps eudo - inverse method is th at the weights are the best s ingle point es timates from a Bayesian perspective for a l inear output stage. Usi ng the same network architecture and sa me ra ndo m value s for the inp ut to hidd en la yer wei ghts, we appl ied th e t wo weight calculation methods to th e MNIST datab ase and demons trate d t hat our met hod offer s a p er for ma nce advantage. 2 Methods If we consider a particul ar sample of input data 1 L k × ∈ x  whe re k is a series i n- dex a nd K is the l engt h of t he series , then the forwa rd propaga tion of the local si g- nals t hrou gh the net work c an b e de scri bed by: ( 2 ) (1) ,, 11 ML n k nm ml l k ml y w g wx = =  =   ∑∑ (1) Where 1 N k × ∈ y  is the o utput vecto r co rre spond ing to the inp ut vec to r k x , l and L are the inp ut la ye r i ndex and num ber of input features respectiv ely, m and M are the h idden l ayer in dex an d numbe r of hi dden uni ts respect ively , and n and N are the outp ut la yer index a nd number of output units respectively. (1) w and (2) w are the weights associated with the input to hidden l ayer and t he hi dde n to outp ut layer linear sum s respectively. ( ) g is t he hidd en l ayer non - linear activatio n functio n. With ELM, (1) w are assigned ran do mly which sim pli fies the trai nin g req uire ments to task o f op timisation o f the (2) w onl y. The choic e of line ar o utput neur ons furt her simplifies the optimis ation probl em of (2) w to a single pass algor ithm. The w eight optimisatio n proble m for (2) w can be stated as (2) ,, 1 M n k nm m k m y wa = = ∑ whe r e (1) ,, 1 L m k ml l k l a g wx =  =   ∑ . (2) We can restate thi s as matrix equation by using NM × ∈ W  with elements (2) nm w , and MK × ∈ A  in which each column contain s outputs o f the hidden uni t at one instant in the serie s 1 M k × ∈ a  , and the o utput NK × ∈ Y  where each column contain s out put of the network at on e instance in the series as follows: = Y WA . (3) The optim isat ion problem in volve s dete rmi ning the matri x W give n a se ries o f desire d output s for Y and a s eries of hidden layer ou tputs A . We represent the desired output s for k y usin g the ta rge t vec tor s 1 N k × ∈ t  whe re , nk t has va lue 1 i n the row corres ponding to the de sired cla ss and 0 f or the other N- 1 ele- ments. For example [ ] 0, 1 , 0, 0 T k = t indicate s the desired target is class 2 (o f four classes). As above we can restate the desired targets u sing a matrix NK × ∈ T  whe re each column contains the desired targets of the network at one instance in the series. Substi tut ing T in f or the desi red output s for Y , the optimi zation probl em i nvolve s solvi ng the fol lo wing li near equat ion for W : = T WA . (4) 2.1 Output w eight calculat ion using the pseu do -i nverse In ELM lit erature W is ofte n d ete rmi ned b y the t akin g the Mo ore - Penrose pse u- do - inverse KM +× ∈ A  of A [3] . If the rows of are A are linearly independent (whic h nor mally t rue i f K>M) the n W ma yb e calculated using + = W TA whe re ( ) -1 TT + = A A AA . (5) Thi s minimis es the sum of square error between networks outpu ts Y and t he d e- sired outputs T , i.e. + A mini mise s ( ) 2 ,, 2 11 KN nk nk kn yt = = −= − ∑∑ YT (6) We refer to the pseudo - inverse method for output weight calculatio n as PI - ELM . We note that in cases where the cla ssification pr oble m is ill - posed it may be necessary to r egular ize t his solu tion, usi ng sta nda rd metho ds s uch as T ikhono v re gulariz at ion (rid ge regr essio n). 2.2 Output w eight c alculatio n using linear disc r iminant a nalysis In this paper we develop an alternat ive approach to estimating W based on a ma x- imum l ikelihood es tima tor assuming a linear m odel. We refer to it as the LDA - ELM method as it is equi valent to ap plying linear discri minant ana lysis to the hidde n layer outp uts. O ur p rese ntatio n is b a sed o n the notat ion of Ripley [4 ]. For a n N - class problem Bayes’ rule sta tes that th e posteri or probabil it y of th e n th class n p is related to its prior probabilit y k π and i ts cla ss de nsit y funct ion ( ) , nn f d θ by ( ) ( ) 1 , , nn n n N zz z z f p f π π = = ∑ d θ d θ (7) where d is the inp ut da ta vec tor (in o ur ca se the hidd en l ayer o utp ut), and n θ are the parameters of th e class density function. The class den sities are modelled with a multi - variate Gaussian model with co m- mon co var ia nce Σ and class dependent mean vectors k μ . Give n a n inp ut vec tor k a the class density is ( ) ( ) ( ) ( ) 1 22 1 1 2 , , 2 exp M T n k n n kn kn f π −− −  = = −− −  a θ μ Σ Σ aμ Σ aμ (8) We set the dimensio n of the Gau ssian mode l equal to the n umber of hidden units so that k a is as defi n ed above for the hidd en u nit out put an d henc e MM × ∈ Σ  and 1 M n × ∈ μ  . To b egin with , t he training data is partitioned according to the class membership so that we ha ve 1 N n n KK = = ∑ labelled data vectors of hi dde n unit o utp uts , (n) 1 , 1.. M kn kK × ∈= a  where all members (n) a belong to class n . For a give n set o f hidd en u nit outp ut dat a and class me mber ship a likelihood f un c- tion ( ) l θ is f ormed usin g ( ) ( ) ( ) () 12 11 , , ..., , n K N n N nn k n nk ll f π = = = = ∏∏ θ θθ θ a θ (9) Our ai m is to fi nd va lues o f n θ that maximise ( ) l θ for given set of tr aining data. Equi vale ntly we can maxi mis e the va lue o f the log - likeli ho od : ( ) ( ) ( ) ( ) ( ) ( ) () 12 12 11 1 , , ..., log , , ..., log , log n K NN n N N nk n n n nk n L l fK π = = = = = + ∑∑ ∑ θθ θ θθ θ a θ ( 10 ) Substi tut ing o ur mul ti - variate Gaussian model f or ( ) , n kn f a θ we get ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 12 1 () 1 () 11 2 22 11 1 , , ..., , ..., , log 2 log log . n NN K NN T nn M kn kn n n nk n LL K ππ − = = = = = − − − − −+ ∑∑ ∑ θθ θ μ μ Σ Σ aμ Σ aμ ( 11 ) Thi s is ma ximiz ed whe n () 1 n K n n kn k K = = ∑ μa , and ( ) ( ) () () 11 n K N T nn kk kk nk K = = = −− ∑∑ Σ aμ aμ . ( 12 ) Havi ng det er mined the n μ ’s a nd Σ fro m the tr aini ng da ta we no w nee d to find t he value s for W . We begi n by sub sti tuti ng ( 8 ) into ( 7 ), bri nging t he n π into e xpo ne ntial functi on a nd re movi ng the comm on numerator and denom inator t erm ( ) 1 22 2 M π −− Σ , givi ng us ( ) ( ) ( ) ( ) ( ) ( ) 1 1 2 1 1 2 1 ex p log ex p log T kn kn n n N T kz kz z z p π π − − =  − − −+  =  − − −+  ∑ a μ Σa μ a μ Σa μ . ( 13 ) After expa ndin g the ( ) ( ) 1 1 2 T kn kn − −− − a μ Σa μ ter ms and re moving t h e 1 1 2 T kk − − a Σa from the numerator and denominator w e get ( ) ( ) 1 exp exp n n N a a y p y = = ∑ ( 14 ) whe re ( ) 11 1 2 log TT n n nk n n y π −− = +− μΣ a μΣ μ ( 15 ) C lassification is performed by choosing t he cla ss wit h t he highe st value o f n p . A s n p in (1 4 ) is a mon otonic fun ction of n y in (1 5 ) w e can use either functi on when dec iding o ur fi nal clas s . W e choose to use n y defi ned i n (1 5 ) as it is a linear function of the input data vecto r k a and it can be used to d etermine W for ou r netw ork as fo l- lo ws : ( ) ( ) ( ) 11 1 11 1 1 11 2 22 22 2 11 1 12 log log log TT T N NN TT T N ππ π −− − −− −  −− − =   μ Σμ μ Σμ μ Σμ W μΣ μΣ μ Σ   . ( 16 ) Note that 1 NM ×+ ∈ W  , as a cons tant term has been i ntrodu ced into the hi dden t o outp ut l a ye r we ig hts (the first ro w of W ). I f we want to determine the poster ior pro b- abilities then we use (1 4 ) applied t o the n etwork ou tputs. Summary o f method In summary calculating W procee ds as f ollows (i) Partitio ned the hidden unit output data a ccording to the class membersh ip so that we have 1 N n n KK = = ∑ labelled data vectors, (n) 1 , 1.. L kn kK × ∈= a  whe re al l me mbe r s (n) a belong to class n (ii) Calcu late () 1 n K n n kn k K = = ∑ μa and ( ) ( ) () () 11 n K N T nn kk kk nk K = = = −− ∑∑ Σ aμ aμ (iii) Set the prio r prob abilities n π (iv) Calculate ( ) ( ) ( ) 11 1 11 1 1 11 2 22 22 2 11 1 12 log log log TT T N NN TT T N ππ π −− − −− −  −− − =   μ Σμ μ Σμ μ Σμ W μΣ μΣ μ Σ   To classify n e w data we (i) Calculate the network output y in resp onse to the hid den la yer outp ut a is 1  =   yW a (ii) (O ptional) Calcula te the poster ior prob abilities ( ) ( ) 1 exp exp n n N a a y p y = = ∑ (iii) T he final decision of the ne two rk is the outp ut wit h the hi ghest value of n y or, equi valentl y, n p . Combining Classifiers Equa tion (1 4 ) provides an easy way to combine the outputs of mu ltiple classifiers. Once the posterior proba bilities are calculated f or each class f or each classifier w e can form a combin ed posteri or proba bility a nd cho ose t he cla ss wit h the hi ghe st co mbined posteri or probabil ity . There are any sc heme s for d oing t his [ 5 ] with unw eighte d ave r- aging across the poste rior probabi lity ou tputs be ing one of the m ost sim ple schemes. 3 Experi ments We applied the LD A - ELM and PI - ELM weight ca lculation met hod to the MNI ST hand writt e n digit recognitio n prob lem [ 6 ]. Autho rs J T and AvS have pre vious ly r e- port ed good classification results using E LM on th is database [2] . The database has 60,000 tra inin g and 10,00 0 testi ng exampl es. Each example i s a 28*28 pix el 256 lev el grayscale image of a handwritten digit between 0 and 9. The 10 classes are approx i- mately equall y distributed in the trai nin g and te stin g set s . T he ELM algorithms were applied directly to the unprocessed images and we trained the ne t wo r k s by p ro vid ing al l dat a in b atch mo de . T he ra ndom va lue s for the input layer weights were uniformly distrib uted between - 0.5 and 0.5. The prior prob a- bilities for t he 10 classes for L DA - ELM were each set to 0.1. In order t o perform a direct com paris on of the two meth ods we used th e followin g protocol : For fa n - out o f 1 to 20 hid den unit s per input, r epeat 200 ti mes (i) Assi gn ra ndo m value s to the i nput la yer weig hts a nd de termi ne the hid den laye r outputs for th e 60,000 tra ining data exam ples . (ii) Deter mine PI - ELM net work weig hts usi ng d ata fr om ( i) . (iii) Determine LDA - ELM netwo rk wei ghts u sing d ata fr om ( i). (iv) Eval uate bo th netw orks on t he 10,000 test data examples and store results . We averaged the results for the 200 repeats of th e experim ent for each fan - out a nd compared the misclassification rates. The se results are sho wn in F ig . 1 and Table 1. Fig. 1. The erro r rate of the LDA - ELM and t he PI - ELM on the MNIST databas e for fan - out vary ing be twee n 1 and 20. All resul ts at each fan - out are a ver aged fr om 200 re peats of t he e xperiment . The result s sho w that t he LDA - ELM outperforms the PI - ELM at ev ery fan - out value. The average performa nce benefit was a 3.1% decrease in the error rat e of LDA - ELM with a larger benefit at s mall er fa n - ou t val ues. Table 2 b elow sho ws t hat there is little extra computatio nal require ment for the LD A - ELM method . 2 3 4 5 6 7 0 5 10 15 20 Error (%) Fan - out LDA-ELM PI-ELM Table 1. The error rate (%) an d percentage i mprove ment of th e LDA - ELM over th e PI - ELM on the MNI ST databas e. Result s average d from 200 repea ts. Fan out 1 2 3 4 5 6 7 8 9 10 12 15 20 PI - E LM 7.20 5.21 4.32 3.80 3.45 3.20 3.00 2.86 2.74 2.63 2.49 2.31 2.15 LDA - E LM 6.84 5.03 4.17 3.68 3.35 3.11 2.92 2.78 2.66 2.55 2.42 2.25 2.08 % imp rovemen t 4.9 3.5 3.3 3.1 2.9 3.0 2.6 2.9 2.8 2.7 2.6 2.6 3.3 Table 2. Computa tion ti me s (in se conds). T he el apsed t im e is show n f or traini ng the PI - ELM and L DA - ELM network s on the 6 0,000 i ma ges from MNIST databa se and t esting on t he 10,00 0 ima ges using MA TLAB R2012a co de runni ng o n 20 12 Son y Vaio Z series la ptop with an Int el i7 - 2640M 2. 8G Hz pr ocess or and 8G B RA M. Fan - out 1 2 3 4 5 6 7 8 9 10 12 15 20 PI - E LM 6.2 13.9 24.9 37. 8 53.3 68.5 88.2 1 11 1 36 162 228 3 39 630 LDA - E LM 6.2 13.9 25.2 38. 1 54.0 69.5 90.6 1 13 1 40 167 238 3 57 702 The last experiment we performed inve sti gated co mbining multip le net work s using the LD A - E LM b y aver agi ng p oste rior prob ab ilitie s. We inve stiga ted usi ng a n ense m- ble n umber between 1 and 20 and repeated the train ing and testing 10 times at each ens emble num ber . We then averaged the results which are shown below in Fig . 2. Fig. 2. The erro r rate of the LDA - ELM o n the MNIS T database at a fan - out of 20 w ith the ensem ble num ber v ary ing betw een 1 a nd 20. T he res ult at each ensemble n umber is averaged from 10 r epeats o f the experiment. The result s sho wn in Fi g . 2 demonstrate the benef it of combining multiple LDA - ELM n etworks on the MNIST da tabase. C ombin ing tw o networks reduce d the error rate f rom 2. 08% to 1.86% and adding m ore networks further reduced the error . T he best error rate wa s 1.69% achieved w hen 20 ne tworks were combin ed. 4 Discussion The results o n the MNIST database shown in Fig . 1 s uggest that there is a perf o r- mance benefit to be gained by us ing the LDA - ELM o utput weight ca lculation o ver t he PI - ELM m ethod . As there is only a small extra co mputation overhead we believe it is a viable alternative to th e pseudo - inverse m ethod especially at small fan - out value s. A nothe r benefit of the LDA - ELM is the ab ility to com bine outpu ts from netw orks by combining the poster ior proba bilities estimates of the in d ivid ual net wor ks . W hen we applied this to the MNIST database w e were able to redu ce the error rate to 1. 7%. This result is co mparable to the bes t performance of mo s t oth er 2 and 3 lay er neural networks processing the raw d ata [ 7 ]. Further work will incl ude co mpa ring t he t wo weight calculation m ethods on other publicly av ailable databases such as abalone and iris data sets [ 8 ]. 1.5 1.6 1.7 1.8 1.9 2.0 2.1 0 5 10 15 20 Error (%) Ensembl e 5 Conclusion We have presented a new method for weight calculati o n for hidde n to outpu t wei gh ts for ELM n et w or ks perform ing classification tasks. The m ethod is based o n linea r disc ri minant anal ys is a nd requires a modest a mount o f extra calculatio n time com pared to th e pseudo - inverse method (<1 2 % fo r a fan - out ≤ 20) . W hen ap pli ed to the MNIST database the averag e m isclassific ation rate impro vement was 3.1% in com parison to t he pseu do - inverse m ethod for identically conf igured and initialized netw orks. 6 Bibliography 1. G.B. Huang, Q.Y. Zhu and C. K. S ie w, “Extreme L earnin g Machin e: Theo ry and Appli c a- tions,” Neur ocomp uting , vol. 70, pp. 48 9 – 501, 20 06. 2. J. Taps on and A. v an Sc haik, “ Lea rning t he pse udoinv ers e solu tion t o ne twork we ights,” Neura l Networ ks , vol . 45, pp. 9 4 – 10 0, Sep. 2013. 3. R. Penrose and J. A. T odd , “ On bes t appr oxim ate solutio ns of line ar m atrix equati ons ,” Mathe matical Proc eedings of the Cambri dge Phi loso phical Soc iety , vol . 52, pp 1 7 - 19 , 1956. 4. B.D. Ripley, Patt ern Re cogn ition and Neur al Ne tworks , Cam bridg e Univ. Press, 1 996. 5. L.I. Kunc heva , Combi ning P atter n Classi fiers : Methods and A lgorit hms , Wiley Pr ess, 2004. 6. Y . LeCun and C. Cort es, “The MNIST dat abase of handwritt en digits”, Available at: http: //ya nn.lec un.com /exdb/m nist . 7. Y. LeCun, L. Bottou, Y. Be n gio and P. Haf fner, “ Grad ient - Based L ea rning Appl ied to Docum ent R ecogni tion ,” Pr ocee dings of t he IEE E , vol. 86( 11) , pp. 227 8 - 23 24, Nov . 1998 . 8. Data sets from the U CI Machine Lear ning Repository, http://archiv e.ics.uci.edu/ml/.

Learning ELM network weights using linear discriminant analysis

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment