Generating Random Parameters in Feedforward Neural Networks with Random Hidden Nodes: Drawbacks of the Standard Method and How to Improve It

Generati ng Random P aram eters in F eedforward Neural Net w orks with Rando m Hi dden No des: Dra wbacks of the Standar d Metho d and How to Impro v e It Grzegor z Dudek Ele ctric a l Engine ering Dep artment, Czesto chow a University of T e ch nolo gy, Czesto chowa, Poland, dudek@el.p cz.czest.pl Abstract The standard method of generating random weigh ts and bia ses in feedfor - ward neural net works with random hidden no des, selects them b oth from the uniform distribution ov er the same ﬁxed in terv al. In this work, we show the drawbac ks of this approach and prop ose a new metho d of generating random parameters . This metho d ensures the most nonlinear fragments o f sigmoids, which ar e mo st use ful in mo deling target function nonlinearity , are kept in the input h yp ercub e. In addition, we show ho w to g enerate activ a tion functions with uniformly distributed slope angles. Keywor ds: F eedforward neural netw orks , Neural netw orks with ra ndom hidden no des, Randomized learning a lgorithms 1. In tro ductio n Single-hidden-lay er feedforward neur al netw ork s with rando m hidden no des (FNNRHN) have b ecome po pular in recent y e a rs due to their fast learning sp eed, g o o d g eneralization p erformanc e and eas e o f implementation. Addition- ally , these netw orks do not use a gradient des cent method for lea rning, w hich is time consuming a nd s ensitive to lo cal minima of the err o r function (whic h is nonconv ex in this cas e ). In rando mized learning, weigh ts and biases of the hid- den nodes ar e selected at random from any int erv al [ − u, u ], and s tay ﬁxed. The optimization problem b ecomes conv ex and the output weights can be learned using a simple, scalable s tandard linear least-squar es method [1]. The resulting FNN has a universal approximation capabilit y when the random parameter s are s elected fr o m a symmetric in terv a l a ccording to any contin uous sampling distribution [2]. But how to select this in terv al and which distribution to use are op e n questions, and cons idered to b e the most imp ortant research gaps in randomized learning [3, 4]. ✩ Supported by Gran t 2017/27/B/ST6/0180 4 from the N ational Science Cen tre, Poland. Pr eprint submitt e d to Journal of L A T E X T emplates Septemb er 18, 2019 Typically , the hidden no de weigh ts and biases are b oth selected from a uni- form distr ibution over the ﬁxed interv a l, [ − 1 , 1], without scientiﬁc justiﬁcation, regar dless of the da ta, pro blem to be solved, and a ctiv ation function t yp e [5]. Some authors optimize the interv al lo oking for u to e nsure the b est mo del p er- formance [6 , 7 , 8 , 9]. Recen tly developed metho ds [10, 11] prop ose more sophis- ticated approaches for gener ating random parameters, where the distribution of the a ctiv ation functions in spa ce is analyzed and their para meters a r e adjusted randomly to the data. In this work we show the drawbacks of a s tandard method o f r andom pa- rameters generatio n and propo se its modiﬁcatio n. W e trea t the weights and biases separately due to their diﬀerent functions. The biases ar e g enerated on the basis of the w e ight s and p oints selected randomly fro m the input spa c e. The resulting s igmoids hav e their nonlinear fra gments, which are mo st useful for mo deling the targ et function (TF) ﬂuctuations, inside the input hypercub e. Moreov er, w e show ho w to g enerate the weigh ts to pro duce sig moids with the slop e angles distr ibuted uniformly . 2. Generating sig moids insi de the input h yp ercub e Let us consider a n appr oximation problem of a single-v ariable function of the form: g ( x ) = sin (20 · exp x ) · x 2 (1) T o learn FNNRHN w e cr e a te a tr aining s et Φ con taining N = 5000 points ( x l , y l ), wher e x l ∼ U (0 , 1) and y l are calculated from (1) and then distor ted by adding noise ξ ∼ U ( − 0 . 2 , 0 . 2). A tes t set of the same size is created in the same manner but without nois e . The output is nor malized in the ra nge [ − 1 , 1]. Fig. 1 shows the r esults of ﬁtting when using FNNRHN with 10 0 sigmoid hidden no des wher e the weigh ts a nd biases ar e r a ndomly selected fro m U ( − 1 , 1) and U ( − 1 0 , 10). The bo ttom charts show the hidden no de sigmoids whose linear combination forms the function ﬁtting data. This ﬁtted function is s hown as a solid line in the upp er c ha rts. As you can see from the ﬁgure, for a, b ∈ [ − 1 , 1] the sigmoids a re ﬂat and their dis tribution in the input interv al [0 , 1] (shown as a g rey ﬁeld) do es not co rresp ond to the TF ﬂuctuations. This results in a very weak ﬁt. When a, b ∈ [ − 10 , 10] the sigmoids are s teeper but many of them hav e their s tee p est fragments, which a r e around their inﬂection p oints, o utside of the input in ter v a l. The satura ted fragments of these sigmoids, which a re in the input interv al, ar e useles s for mo deling nonlinear TFs. So, many of the 1 0 0 sigmoids ar e wasted. F rom this simple exa mple it ca n b e concluded that to get a parsimonious ﬂexible FNNRHN model, the sigmoids should be steep enough and their steep est fra gment s, a round the inﬂection p oints, should b e inside the input interv al. Let us analyze how the inﬂec tio n p o int s are distributed in space when the weigh ts and biases are selected from a uniform distribution o ver the interv a l 2 Figure 1: TF (1) ﬁtting: ﬁtted curv es and the si gmoids constructing them for a, b ∼ U ( − 1 , 1) (left panel) and f or a, b ∼ U ( − 10 , 10) (right panel). [ − u, u ]. The sigmoid v alue at its inﬂectio n p oint χ is 0 . 5, th us: 1 1 + e xp( − ( a · χ + b )) = 0 . 5 (2) F rom this eq ua tion we obta in: χ = − b a (3) The distribution of the inﬂectio n point is a distribution of the ratio of tw o independent random v ar iables having b oth the uniform distribution, a, b ∼ U ( − u, u ). In such a case, the pro bability density function (PDF) o f χ is of the form: f ( χ ) = Z ∞ −∞ | a | f A ( a ) f B ( aχ ) da =        Z u − u | a | f A ( a ) f B ( aχ ) da for | χ | < 1 Z u | χ | − u | χ | | a | f A ( a ) f B ( aχ ) da for | χ | ≥ 1 =      1 4 for | χ | < 1 1 4 | χ | 2 for | χ | ≥ 1 (4) where f A and f B are the P DFs o f weigh ts and biases, r esp ectively . The left panel of Fig. 2 shows the PDF of χ . The same PDF c an b e obtained when a ∼ U ( − u, u ) and b ∼ U (0 , u ) (cas e sometimes found in the literature). As y ou c a n see from Fig. 2, the probability that the inﬂection po in t is inside the input interv al (shown as a grey ﬁeld) is 0 . 2 5. This means 3 -5 -4 -3 -2 -1 0 1 2 3 4 5 0 0.05 0.1 0.15 0.2 0.25 0.3 f( ) 0 5 10 15 20 n 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P( H) Figure 2: PD F of χ when a, b ∼ U ( − u, u ) (lef t panel) and probability that χ belongs to H = [0 , 1] n depending on n (right panel). that most sigmoids hav e their steep est fragments, which are most useful for mo deling TF ﬂuctuations, o utside of this interv a l. F or the mult iv ariable case, when we co nsider n -dimensional s igmoids, the situation improves – see the right panel of Fig. 2. F or n = 2 almost 46% o f sig moids ha ve their inﬂection po in ts in the input square. This per centage increases to mor e than 9 0% for n ≥ 7. T o obtain an n -dimensiona l sigmoid with one of its inﬂection p oints χ inside the input h yp ercub e H = [ x 1 , min , x 1 , max ] × ... × [ x n, min , x n, max ], ﬁrst, we generate weigh ts a = [ a 1 , a 2 , ..., a n ] T ⊂ R n . Then we set the sigmoid in such a wa y that χ is at so me p oint x ∗ from H . Th us: h ( x ∗ ) = 1 1 + exp ( − ( a T x ∗ + b )) = 0 . 5 (5) F rom this eq ua tion we obta in: b = − a T x ∗ (6) Poin t x ∗ = [ x ∗ 1 , ..., x ∗ n ] can b e s elected as follows: • this ca n be some po in t randomly selected from H : x ∗ j ∼ U ( x j, min , x j, max ), j = 1 , 2 , ..., n . This metho d is suita ble when the input p oints are evenly distributed in the h yp ercub e H . • this can b e some ra ndomly selected tr aining p oint: x ∗ = x ξ ∈ Φ, wher e ξ ∼ U { 1 , ..., N } . This methods distributes the sigmoids accor ding to the data density , av oiding empty reg ions. • this can b e a pr ototype of the tr aining p oint cluster: x ∗ = p i , where p i is a pr o totype of the i - th cluster o f x ∈ Φ. This metho d gr oups the tra ining po in ts into m =#no des cluster s. F or each sigmoid a diﬀer e n t pro totype is taken a s x ∗ . 4 -100 -80 -60 -40 -20 0 20 40 60 80 100 a -100 -80 -60 -40 -20 0 20 40 60 80 100 -80 -60 -40 -20 0 20 40 60 80 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 f( ) a [-1,1] a [-10,10] a [-100,100] Figure 3: Relationship b et wee n a and α (left panel) and PDF of α for diﬀerent i n terv als f or a (right panel). 3. Generating sig moids with uni formly distri buted slop e angl es It sho uld b e noted that weigh t a tr anslates nonlinearly in to the slop e a ngle of a sigmoid. Let us a nalyze sigmoid S which has its inﬂection po in t χ at x = 0. In suc h a case b = 0. A der iv ative of S at x = 0 is equa l to the tang ent of its slop e angle α at χ : tan α = ah ( x ) (1 − h ( x )) = a 1 1 + exp( − ( a · 0 + 0 ))  1 − 1 1 + e xp( − ( a · 0 + 0))  (7) F rom this equation we obtain the re la tionship b etw een the w eight and the slop e angle: α = arctan a 4 (8) This rela tio nship is depicted in Fig . 3 a s w ell as the PDF of α when w eights a are generated fro m diﬀerent interv als. Note that the relationship betw e e n a a nd α is highly nonlinear . In terv al [ − 1 , 1] for a cor resp onds to the interv al [ − 14 ◦ , 14 ◦ ] for α , so only ﬂa t sig moids ar e obtaina ble in suc h a case. F or a ∈ [ − 10 , 10] we obtain α ∈ [ − 68 . 2 ◦ , 68 . 2 ◦ ], a nd for a ∈ [ − 100 , 100] we obtain α ∈ [ − 87 . 7 ◦ , 87 . 7 ◦ ]. F or narrow interv als for a , such as [ − 1 , 1], the distr ibution of α is similar to a uniform one. When the interv al for a is e x tended, the shap e o f PDF of α changes – larger angles, near the b ounds, are more proba ble than smaller ones. When a ∈ [ − 1 00 , 100] mor e than 77% o f sigmoids are inclined at a n angle greater than 80 ◦ , so they a re very steep. In suc h a case, ther e is a r eal threat of overﬁtting. T o generate sigmoids with uniformly distributed slop e angles, ﬁrst we gen- erate | α | ∼ U ( α min , α max ) individually for them, where α min ∈ (0 ◦ , 90 ◦ ) and α max ∈ ( α min , 90 ◦ ). The b order ang les, α min and α max , can bo th be a djusted to the problem b eing solved. F or highly no nlinea r TFs, with strong ﬂuctuations, only α min can b e a djusted, keeping α max = 90 ◦ . Having the ang le s, we calculate the weigh ts from (8): a = 4 tan α (9) 5 F or the mult iv ariable case, we gener ate all n weights in this wa y , indep en- dent ly for ea ch of m s igmoids. This ensures random slo pes (betw een α min and α max ) fo r the multidimensional sig moids in each of n directio ns. The pr op osed metho d of g enerating ra ndom para meters of the hidden neu- rons is summarized in Algorithm 1. In this algor ithm w eights a can b e generated randomly from U ( − u, u ) or o ptio nally , to ensure uniform distribution of the sig- moid slo pe a ng les, they can be determined based on the slop e angles g enerated randomly from U ( α min , α max ). The b ounds: u, α min and α max should b e selected in cros s-v alidation. Algorithm 1 Generating Random Parameters of FNNRHN Input: Num be r of hidden no des m Num be r of inputs n Bounds for weight s, u ∈ R , o r optionally b ounds for slop e angles, α min ∈ (0 ◦ , 90 ◦ ) and α max ∈ ( α min , 90 ◦ ) Set of m p oints x ∗ ∈ H : { x ∗ 1 , ..., x ∗ m } Output: W eights A =    a 1 , 1 . . . a m, 1 . . . . . . . . . a 1 ,n . . . a m,n    Biases b = [ b 1 , . . . , b m ] Pro cedure: for i = 1 to m do for j = 1 to n do Cho ose rando mly a i,j ∼ U ( − u , u ) or optiona lly choo se randomly α i,j ∼ U ( α min , α max ) and calcula te a i,j = ( − 1 ) q · 4 tan α i,j , where q ∼ U { 0 , 1 } end for Calculate b i = − a T i x ∗ i end for 4. Simulation s tudy The results of TF (1) ﬁtting when us ing the prop osed metho d is shown in Fig. 4. In this case the weigh ts were selected fr om U ( − 1 0 , 10) and bia ses 6 Figure 4: TF (1) ﬁtting: ﬁtted curve and the sigmoids constructing it f or the pr oposed m ethod. were determined accor ding to (8). As you ca n s ee from this ﬁgure, all s ig moids hav e their inﬂection points inside H . The n umber of hidden no des to achieve RM S E = 0 . 00 84 is 35. T o obta in a similar level of er ror we need ov er 60 no des when using the standard metho d fo r gener ating the parameters. The fo llowing ex per imen ts concer n multiv aria ble function ﬁtting. TF in this case is deﬁned as: g ( x ) = n X j =1 sin (20 · exp x j ) · x 2 j (10) The training set con tains N p oints ( x l , y l ), where x l,j ∼ U (0 , 1) a nd y l are calculated from (10), then no rmalized in the ra nge [ − 1 , 1] and distorted b y adding nois e ξ ∼ U ( − 0 . 2 , 0 . 2). A test se t of the s ame s ize is created in the sa me manner but without noise. The exp eriments were car ried out for n = 2 ( N = 5000), n = 5 ( N = 2 0000) and n = 10 ( N = 5000 0 ), using: • SM – the sta ndard metho d of generating b oth w eights and bia ses fro m U ( − u, u ), • P Mu – the prop osed metho d of gener a ting w eights from U ( − u , u ) and biases acco rding to (8 ), • P M α – the pro po sed metho d of generating slop e angles fr om U ( α min , 90 ◦ ), then calculating weights from (9), and bias es from (8). Fig. 5 s hows the mea n test er rors ov er 1 00 trials fo r diﬀer en t no de num b ers. F or each no de num b er the optimal v alue of u or α min was selected from u ∈ { 1 , 2 , ..., 10 , 20 , 50 , 100 } a nd α min ∈ { 0 ◦ , 10 ◦ , ..., 80 ◦ } , res pectively . As you can see from Fig. 5, PM α in a ll ca ses lea ds to the be st results. F or n = 2 it needs less no des to get a low er error (0.03 52) than PMu and SM. In terestingly , for higher dimensio ns , using to o many no des leads to an increase in the err or for 7 0 200 400 600 800 1000 #nodes 0 0.05 0.1 0.15 0.2 0.25 0.3 RMSE n=2 SM PMu PM 0.0397 0.0453 0.0352 0 200 400 600 800 1000 #nodes 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 RMSE n=5 0.2270 0.2270 0.1731 0 200 400 600 800 1000 #nodes 0.2 0.205 0.21 0.215 0.22 0.225 RMSE n=10 0.2204 0.2204 0.2038 Figure 5: RM SE dep ending on the num b er of nodes. SM a nd P Mu. This can b e related to the overﬁtting caused by the steep no des generated b y the standa rd metho d. In the same time, for PM α , where the no de slop e angles are distributed uniformly , an decreas e in the erro r is obs erved. This issue needs to be ex plored in detail o n a larger nu mber o f data sets. 5. Conclusion A dr awback of the s tandard metho d of gener ating random hidden no des in FNNs is tha t many sigmo ids hav e their mos t nonlinea r frag men ts outside of the input hypercub e, esp ecially for lo w - dimensional cases . So, they cannot b e use d for mo deling the target function ﬂuctuations. In this work, we prop ose a metho d of g enerating ra ndom pa r ameters which ensures that all the s ig moids have their steep est fra gments ins ide the input h yp ercub e. In addition, w e show how to determine the weights to ensure the sig moids ha ve uniformly distributed slop e angles. This preven ts ov erﬁtting which can ha pp en when weights are genera ted in a standa rd wa y , esp ecially for highly nonlinea r target functions. References [1] J . Pr incipe, B. Chen, Univ ersal appr oximation with conv ex optimization: gimmick or reality?, IEEE Computationa l Int elligence Magazine 10 (2) (2015) 68 –77. doi:10.11 09/MCI .2015.2405352 . [2] D. Husmeier, Random vector functional link (R VFL) netw orks, in: Neu- ral Netw o rks for Conditional Probability E s timation: F orecasting Be- yond Point Predictions, Springer -V erlag London, 19 99, Ch. 6, pp. 8 7–97. doi:10 .1007/ 978- 1- 4471- 0847- 4_6 . [3] L. Zhang , P . Suga nt han, A survey of r andomized algo rithms for train- ing neural netw orks, Information Scie nce s 364–365 (2016 ) 14 6 –155. doi:10 .1016/ j.ins.2016.01.039 . [4] W. Cao, X. W ang , Z. Ming, J . Gao, A review o n neural net- works with rando m weights, Neuro computing 275 (201 8) 278 –287. doi:10 .1016/ j.neucom.2017.08.040 . 8 [5] S. Scardapane, D. Comminiello, M. Scarpiniti, A. Uncini, A semi- sup e rvised rando m vector functional- link netw ork based on the trans- ductive framework, Information Scie nce s 36 4-365 (2016 ) 1 56–16 6. doi:10 .1016/ j.ins.2015.07.060 . [6] D. W ang, M. Li, Sto chastic conﬁgura tion netw orks: F undamentals and algorithms, IEEE T ra ns actions on Cyb ernetics 47 (10) (20 17) 3466– 3479. doi:10 .1109/ TCYB.2017.2734043 . [7] M. Li, D. W ang, Insights into randomized alg orithms for neura l netw or ks: Practica l issues and commo n pitfalls, Info rmation Scienc e s 382 –383 (2 0 17) 170–1 78. d oi:10. 1016/j. ins.2016.12.007 . [8] L. Zhang, P . Sugant han, A comprehensive ev aluatio n o f random v ector functional link net works, Information Sciences 367– 368 (2016 ) 1094 –1105 . doi:10 .1016/ j.ins.2015.09.025 . [9] F. Cao, D. W ang, H. Zhu, Y. W ang, An itera tiv e learning a lgorithm for feedforward neural netw o rks with random weigh ts, Information Sciences 328 (20 16) 54 6–557 . doi:10 .1016/ j.ins.2015.09.002 . [10] G. Dudek, Gener ating random weigh ts and biases in feedfor w ard neural net works with r andom hidden nodes, Information Sciences 481 (201 9 ) 3 3– 56. doi:1 0.1016 /j.ins.2018.12.063 . [11] G. Dudek, Improving rando mized le a rning of feedfor ward neura l netw or ks by a ppropriate genera tion of r andom parameter s, in: Adv ances in Co mpu- tational In tellig ence. IW ANN 201 9, V ol. 115 06 of LNCS, Springer , 2019, pp. 517 –530. doi:10.1 007/97 8- 3- 030- 20521- 8_43 . 9

Generating Random Parameters in Feedforward Neural Networks with Random Hidden Nodes: Drawbacks of the Standard Method and How to Improve It

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment