Generating Random Parameters in Feedforward Neural Networks with Random Hidden Nodes: Drawbacks of the Standard Method and How to Improve It

The standard method of generating random weights and biases in feedforward neural networks with random hidden nodes, selects them both from the uniform distribution over the same fixed interval. In this work, we show the drawbacks of this approach an…

Authors: Grzegorz Dudek

Generating Random Parameters in Feedforward Neural Networks with Random   Hidden Nodes: Drawbacks of the Standard Method and How to Improve It
Generati ng Random P aram eters in F eedforward Neural Net w orks with Rando m Hi dden No des: Dra wbacks of the Standar d Metho d and How to Impro v e It Grzegor z Dudek Ele ctric a l Engine ering Dep artment, Czesto chow a University of T e ch nolo gy, Czesto chowa, Poland, dudek@el.p cz.czest.pl Abstract The standard method of generating random weigh ts and bia ses in feedfor - ward neural net works with random hidden no des, selects them b oth from the uniform distribution ov er the same fixed in terv al. In this work, we show the drawbac ks of this approach and prop ose a new metho d of generating random parameters . This metho d ensures the most nonlinear fragments o f sigmoids, which ar e mo st use ful in mo deling target function nonlinearity , are kept in the input h yp ercub e. In addition, we show ho w to g enerate activ a tion functions with uniformly distributed slope angles. Keywor ds: F eedforward neural netw orks , Neural netw orks with ra ndom hidden no des, Randomized learning a lgorithms 1. In tro ductio n Single-hidden-lay er feedforward neur al netw ork s with rando m hidden no des (FNNRHN) have b ecome po pular in recent y e a rs due to their fast learning sp eed, g o o d g eneralization p erformanc e and eas e o f implementation. Addition- ally , these netw orks do not use a gradient des cent method for lea rning, w hich is time consuming a nd s ensitive to lo cal minima of the err o r function (whic h is nonconv ex in this cas e ). In rando mized learning, weigh ts and biases of the hid- den nodes ar e selected at random from any int erv al [ − u, u ], and s tay fixed. The optimization problem b ecomes conv ex and the output weights can be learned using a simple, scalable s tandard linear least-squar es method [1]. The resulting FNN has a universal approximation capabilit y when the random parameter s are s elected fr o m a symmetric in terv a l a ccording to any contin uous sampling distribution [2]. But how to select this in terv al and which distribution to use are op e n questions, and cons idered to b e the most imp ortant research gaps in randomized learning [3, 4]. ✩ Supported by Gran t 2017/27/B/ST6/0180 4 from the N ational Science Cen tre, Poland. Pr eprint submitt e d to Journal of L A T E X T emplates Septemb er 18, 2019 Typically , the hidden no de weigh ts and biases are b oth selected from a uni- form distr ibution over the fixed interv a l, [ − 1 , 1], without scientific justification, regar dless of the da ta, pro blem to be solved, and a ctiv ation function t yp e [5]. Some authors optimize the interv al lo oking for u to e nsure the b est mo del p er- formance [6 , 7 , 8 , 9]. Recen tly developed metho ds [10, 11] prop ose more sophis- ticated approaches for gener ating random parameters, where the distribution of the a ctiv ation functions in spa ce is analyzed and their para meters a r e adjusted randomly to the data. In this work we show the drawbacks of a s tandard method o f r andom pa- rameters generatio n and propo se its modificatio n. W e trea t the weights and biases separately due to their different functions. The biases ar e g enerated on the basis of the w e ight s and p oints selected randomly fro m the input spa c e. The resulting s igmoids hav e their nonlinear fra gments, which are mo st useful for mo deling the targ et function (TF) fluctuations, inside the input hypercub e. Moreov er, w e show ho w to g enerate the weigh ts to pro duce sig moids with the slop e angles distr ibuted uniformly . 2. Generating sig moids insi de the input h yp ercub e Let us consider a n appr oximation problem of a single-v ariable function of the form: g ( x ) = sin (20 · exp x ) · x 2 (1) T o learn FNNRHN w e cr e a te a tr aining s et Φ con taining N = 5000 points ( x l , y l ), wher e x l ∼ U (0 , 1) and y l are calculated from (1) and then distor ted by adding noise ξ ∼ U ( − 0 . 2 , 0 . 2). A tes t set of the same size is created in the same manner but without nois e . The output is nor malized in the ra nge [ − 1 , 1]. Fig. 1 shows the r esults of fitting when using FNNRHN with 10 0 sigmoid hidden no des wher e the weigh ts a nd biases ar e r a ndomly selected fro m U ( − 1 , 1) and U ( − 1 0 , 10). The bo ttom charts show the hidden no de sigmoids whose linear combination forms the function fitting data. This fitted function is s hown as a solid line in the upp er c ha rts. As you can see from the figure, for a, b ∈ [ − 1 , 1] the sigmoids a re flat and their dis tribution in the input interv al [0 , 1] (shown as a g rey field) do es not co rresp ond to the TF fluctuations. This results in a very weak fit. When a, b ∈ [ − 10 , 10] the sigmoids are s teeper but many of them hav e their s tee p est fragments, which a r e around their inflection p oints, o utside of the input in ter v a l. The satura ted fragments of these sigmoids, which a re in the input interv al, ar e useles s for mo deling nonlinear TFs. So, many of the 1 0 0 sigmoids ar e wasted. F rom this simple exa mple it ca n b e concluded that to get a parsimonious flexible FNNRHN model, the sigmoids should be steep enough and their steep est fra gment s, a round the inflection p oints, should b e inside the input interv al. Let us analyze how the inflec tio n p o int s are distributed in space when the weigh ts and biases are selected from a uniform distribution o ver the interv a l 2 Figure 1: TF (1) fitting: fitted curv es and the si gmoids constructing them for a, b ∼ U ( − 1 , 1) (left panel) and f or a, b ∼ U ( − 10 , 10) (right panel). [ − u, u ]. The sigmoid v alue at its inflectio n p oint χ is 0 . 5, th us: 1 1 + e xp( − ( a · χ + b )) = 0 . 5 (2) F rom this eq ua tion we obta in: χ = − b a (3) The distribution of the inflectio n point is a distribution of the ratio of tw o independent random v ar iables having b oth the uniform distribution, a, b ∼ U ( − u, u ). In such a case, the pro bability density function (PDF) o f χ is of the form: f ( χ ) = Z ∞ −∞ | a | f A ( a ) f B ( aχ ) da =        Z u − u | a | f A ( a ) f B ( aχ ) da for | χ | < 1 Z u | χ | − u | χ | | a | f A ( a ) f B ( aχ ) da for | χ | ≥ 1 =      1 4 for | χ | < 1 1 4 | χ | 2 for | χ | ≥ 1 (4) where f A and f B are the P DFs o f weigh ts and biases, r esp ectively . The left panel of Fig. 2 shows the PDF of χ . The same PDF c an b e obtained when a ∼ U ( − u, u ) and b ∼ U (0 , u ) (cas e sometimes found in the literature). As y ou c a n see from Fig. 2, the probability that the inflection po in t is inside the input interv al (shown as a grey field) is 0 . 2 5. This means 3 -5 -4 -3 -2 -1 0 1 2 3 4 5 0 0.05 0.1 0.15 0.2 0.25 0.3 f( ) 0 5 10 15 20 n 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P( H) Figure 2: PD F of χ when a, b ∼ U ( − u, u ) (lef t panel) and probability that χ belongs to H = [0 , 1] n depending on n (right panel). that most sigmoids hav e their steep est fragments, which are most useful for mo deling TF fluctuations, o utside of this interv a l. F or the mult iv ariable case, when we co nsider n -dimensional s igmoids, the situation improves – see the right panel of Fig. 2. F or n = 2 almost 46% o f sig moids ha ve their inflection po in ts in the input square. This per centage increases to mor e than 9 0% for n ≥ 7. T o obtain an n -dimensiona l sigmoid with one of its inflection p oints χ inside the input h yp ercub e H = [ x 1 , min , x 1 , max ] × ... × [ x n, min , x n, max ], first, we generate weigh ts a = [ a 1 , a 2 , ..., a n ] T ⊂ R n . Then we set the sigmoid in such a wa y that χ is at so me p oint x ∗ from H . Th us: h ( x ∗ ) = 1 1 + exp ( − ( a T x ∗ + b )) = 0 . 5 (5) F rom this eq ua tion we obta in: b = − a T x ∗ (6) Poin t x ∗ = [ x ∗ 1 , ..., x ∗ n ] can b e s elected as follows: • this ca n be some po in t randomly selected from H : x ∗ j ∼ U ( x j, min , x j, max ), j = 1 , 2 , ..., n . This metho d is suita ble when the input p oints are evenly distributed in the h yp ercub e H . • this can b e some ra ndomly selected tr aining p oint: x ∗ = x ξ ∈ Φ, wher e ξ ∼ U { 1 , ..., N } . This methods distributes the sigmoids accor ding to the data density , av oiding empty reg ions. • this can b e a pr ototype of the tr aining p oint cluster: x ∗ = p i , where p i is a pr o totype of the i - th cluster o f x ∈ Φ. This metho d gr oups the tra ining po in ts into m =#no des cluster s. F or each sigmoid a differ e n t pro totype is taken a s x ∗ . 4 -100 -80 -60 -40 -20 0 20 40 60 80 100 a -100 -80 -60 -40 -20 0 20 40 60 80 100 -80 -60 -40 -20 0 20 40 60 80 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 f( ) a [-1,1] a [-10,10] a [-100,100] Figure 3: Relationship b et wee n a and α (left panel) and PDF of α for different i n terv als f or a (right panel). 3. Generating sig moids with uni formly distri buted slop e angl es It sho uld b e noted that weigh t a tr anslates nonlinearly in to the slop e a ngle of a sigmoid. Let us a nalyze sigmoid S which has its inflection po in t χ at x = 0. In suc h a case b = 0. A der iv ative of S at x = 0 is equa l to the tang ent of its slop e angle α at χ : tan α = ah ( x ) (1 − h ( x )) = a 1 1 + exp( − ( a · 0 + 0 ))  1 − 1 1 + e xp( − ( a · 0 + 0))  (7) F rom this equation we obtain the re la tionship b etw een the w eight and the slop e angle: α = arctan a 4 (8) This rela tio nship is depicted in Fig . 3 a s w ell as the PDF of α when w eights a are generated fro m different interv als. Note that the relationship betw e e n a a nd α is highly nonlinear . In terv al [ − 1 , 1] for a cor resp onds to the interv al [ − 14 ◦ , 14 ◦ ] for α , so only fla t sig moids ar e obtaina ble in suc h a case. F or a ∈ [ − 10 , 10] we obtain α ∈ [ − 68 . 2 ◦ , 68 . 2 ◦ ], a nd for a ∈ [ − 100 , 100] we obtain α ∈ [ − 87 . 7 ◦ , 87 . 7 ◦ ]. F or narrow interv als for a , such as [ − 1 , 1], the distr ibution of α is similar to a uniform one. When the interv al for a is e x tended, the shap e o f PDF of α changes – larger angles, near the b ounds, are more proba ble than smaller ones. When a ∈ [ − 1 00 , 100] mor e than 77% o f sigmoids are inclined at a n angle greater than 80 ◦ , so they a re very steep. In suc h a case, ther e is a r eal threat of overfitting. T o generate sigmoids with uniformly distributed slop e angles, first we gen- erate | α | ∼ U ( α min , α max ) individually for them, where α min ∈ (0 ◦ , 90 ◦ ) and α max ∈ ( α min , 90 ◦ ). The b order ang les, α min and α max , can bo th be a djusted to the problem b eing solved. F or highly no nlinea r TFs, with strong fluctuations, only α min can b e a djusted, keeping α max = 90 ◦ . Having the ang le s, we calculate the weigh ts from (8): a = 4 tan α (9) 5 F or the mult iv ariable case, we gener ate all n weights in this wa y , indep en- dent ly for ea ch of m s igmoids. This ensures random slo pes (betw een α min and α max ) fo r the multidimensional sig moids in each of n directio ns. The pr op osed metho d of g enerating ra ndom para meters of the hidden neu- rons is summarized in Algorithm 1. In this algor ithm w eights a can b e generated randomly from U ( − u, u ) or o ptio nally , to ensure uniform distribution of the sig- moid slo pe a ng les, they can be determined based on the slop e angles g enerated randomly from U ( α min , α max ). The b ounds: u, α min and α max should b e selected in cros s-v alidation. Algorithm 1 Generating Random Parameters of FNNRHN Input: Num be r of hidden no des m Num be r of inputs n Bounds for weight s, u ∈ R , o r optionally b ounds for slop e angles, α min ∈ (0 ◦ , 90 ◦ ) and α max ∈ ( α min , 90 ◦ ) Set of m p oints x ∗ ∈ H : { x ∗ 1 , ..., x ∗ m } Output: W eights A =    a 1 , 1 . . . a m, 1 . . . . . . . . . a 1 ,n . . . a m,n    Biases b = [ b 1 , . . . , b m ] Pro cedure: for i = 1 to m do for j = 1 to n do Cho ose rando mly a i,j ∼ U ( − u , u ) or optiona lly choo se randomly α i,j ∼ U ( α min , α max ) and calcula te a i,j = ( − 1 ) q · 4 tan α i,j , where q ∼ U { 0 , 1 } end for Calculate b i = − a T i x ∗ i end for 4. Simulation s tudy The results of TF (1) fitting when us ing the prop osed metho d is shown in Fig. 4. In this case the weigh ts were selected fr om U ( − 1 0 , 10) and bia ses 6 Figure 4: TF (1) fitting: fitted curve and the sigmoids constructing it f or the pr oposed m ethod. were determined accor ding to (8). As you ca n s ee from this figure, all s ig moids hav e their inflection points inside H . The n umber of hidden no des to achieve RM S E = 0 . 00 84 is 35. T o obta in a similar level of er ror we need ov er 60 no des when using the standard metho d fo r gener ating the parameters. The fo llowing ex per imen ts concer n multiv aria ble function fitting. TF in this case is defined as: g ( x ) = n X j =1 sin (20 · exp x j ) · x 2 j (10) The training set con tains N p oints ( x l , y l ), where x l,j ∼ U (0 , 1) a nd y l are calculated from (10), then no rmalized in the ra nge [ − 1 , 1] and distorted b y adding nois e ξ ∼ U ( − 0 . 2 , 0 . 2). A test se t of the s ame s ize is created in the sa me manner but without noise. The exp eriments were car ried out for n = 2 ( N = 5000), n = 5 ( N = 2 0000) and n = 10 ( N = 5000 0 ), using: • SM – the sta ndard metho d of generating b oth w eights and bia ses fro m U ( − u, u ), • P Mu – the prop osed metho d of gener a ting w eights from U ( − u , u ) and biases acco rding to (8 ), • P M α – the pro po sed metho d of generating slop e angles fr om U ( α min , 90 ◦ ), then calculating weights from (9), and bias es from (8). Fig. 5 s hows the mea n test er rors ov er 1 00 trials fo r differ en t no de num b ers. F or each no de num b er the optimal v alue of u or α min was selected from u ∈ { 1 , 2 , ..., 10 , 20 , 50 , 100 } a nd α min ∈ { 0 ◦ , 10 ◦ , ..., 80 ◦ } , res pectively . As you can see from Fig. 5, PM α in a ll ca ses lea ds to the be st results. F or n = 2 it needs less no des to get a low er error (0.03 52) than PMu and SM. In terestingly , for higher dimensio ns , using to o many no des leads to an increase in the err or for 7 0 200 400 600 800 1000 #nodes 0 0.05 0.1 0.15 0.2 0.25 0.3 RMSE n=2 SM PMu PM 0.0397 0.0453 0.0352 0 200 400 600 800 1000 #nodes 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 RMSE n=5 0.2270 0.2270 0.1731 0 200 400 600 800 1000 #nodes 0.2 0.205 0.21 0.215 0.22 0.225 RMSE n=10 0.2204 0.2204 0.2038 Figure 5: RM SE dep ending on the num b er of nodes. SM a nd P Mu. This can b e related to the overfitting caused by the steep no des generated b y the standa rd metho d. In the same time, for PM α , where the no de slop e angles are distributed uniformly , an decreas e in the erro r is obs erved. This issue needs to be ex plored in detail o n a larger nu mber o f data sets. 5. Conclusion A dr awback of the s tandard metho d of gener ating random hidden no des in FNNs is tha t many sigmo ids hav e their mos t nonlinea r frag men ts outside of the input hypercub e, esp ecially for lo w - dimensional cases . So, they cannot b e use d for mo deling the target function fluctuations. In this work, we prop ose a metho d of g enerating ra ndom pa r ameters which ensures that all the s ig moids have their steep est fra gments ins ide the input h yp ercub e. In addition, w e show how to determine the weights to ensure the sig moids ha ve uniformly distributed slop e angles. This preven ts ov erfitting which can ha pp en when weights are genera ted in a standa rd wa y , esp ecially for highly nonlinea r target functions. References [1] J . Pr incipe, B. Chen, Univ ersal appr oximation with conv ex optimization: gimmick or reality?, IEEE Computationa l Int elligence Magazine 10 (2) (2015) 68 –77. doi:10.11 09/MCI .2015.2405352 . [2] D. Husmeier, Random vector functional link (R VFL) netw orks, in: Neu- ral Netw o rks for Conditional Probability E s timation: F orecasting Be- yond Point Predictions, Springer -V erlag London, 19 99, Ch. 6, pp. 8 7–97. doi:10 .1007/ 978- 1- 4471- 0847- 4_6 . [3] L. Zhang , P . Suga nt han, A survey of r andomized algo rithms for train- ing neural netw orks, Information Scie nce s 364–365 (2016 ) 14 6 –155. doi:10 .1016/ j.ins.2016.01.039 . [4] W. Cao, X. W ang , Z. Ming, J . Gao, A review o n neural net- works with rando m weights, Neuro computing 275 (201 8) 278 –287. doi:10 .1016/ j.neucom.2017.08.040 . 8 [5] S. Scardapane, D. Comminiello, M. Scarpiniti, A. Uncini, A semi- sup e rvised rando m vector functional- link netw ork based on the trans- ductive framework, Information Scie nce s 36 4-365 (2016 ) 1 56–16 6. doi:10 .1016/ j.ins.2015.07.060 . [6] D. W ang, M. Li, Sto chastic configura tion netw orks: F undamentals and algorithms, IEEE T ra ns actions on Cyb ernetics 47 (10) (20 17) 3466– 3479. doi:10 .1109/ TCYB.2017.2734043 . [7] M. Li, D. W ang, Insights into randomized alg orithms for neura l netw or ks: Practica l issues and commo n pitfalls, Info rmation Scienc e s 382 –383 (2 0 17) 170–1 78. d oi:10. 1016/j. ins.2016.12.007 . [8] L. Zhang, P . Sugant han, A comprehensive ev aluatio n o f random v ector functional link net works, Information Sciences 367– 368 (2016 ) 1094 –1105 . doi:10 .1016/ j.ins.2015.09.025 . [9] F. Cao, D. W ang, H. Zhu, Y. W ang, An itera tiv e learning a lgorithm for feedforward neural netw o rks with random weigh ts, Information Sciences 328 (20 16) 54 6–557 . doi:10 .1016/ j.ins.2015.09.002 . [10] G. Dudek, Gener ating random weigh ts and biases in feedfor w ard neural net works with r andom hidden nodes, Information Sciences 481 (201 9 ) 3 3– 56. doi:1 0.1016 /j.ins.2018.12.063 . [11] G. Dudek, Improving rando mized le a rning of feedfor ward neura l netw or ks by a ppropriate genera tion of r andom parameter s, in: Adv ances in Co mpu- tational In tellig ence. IW ANN 201 9, V ol. 115 06 of LNCS, Springer , 2019, pp. 517 –530. doi:10.1 007/97 8- 3- 030- 20521- 8_43 . 9

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment