OCReP: An Optimally Conditioned Regularization for Pseudoinversion Based Neural Training

OCReP: An Optimally Conditioned Regula rization for Pseudoin v ersion Based Neural T raining Rossella Cancelliere a , Mario Gai b , P atric k Ga llinari c , Luca Rubini a a University of T u rin, Dep. of Computer Scienc es, C. so Svizzer a 185, 1014 9 T orino, Italy b National Institute of Astr ophysics, Astr ophys. Observ. of T orino , Pino T.se (TO), Italy c L ab or a tory of Computer S cienc es, LIP6, Univ. Pierr e et Marie Curie, Paris, F r anc e Abstract In this pap er w e consider the training of single hidden la y er neural net- w orks b y pseudoin v ersion, whic h, in spite of its p opularit y , is sometimes aﬀected b y n umerical instabilit y issues. Regularization is kno wn t o b e ef- fectiv e in suc h cases, so that w e intro duce, in the framework of Tikhono v regularization, a matricial reformulation of the problem whic h allo ws us to use the condition n umber as a diag no stic to ol for iden tiﬁcation of instabil- it y . By imp osing w ell-conditioning requiremen ts on t he relev an t matrices, our theoretical analysis allo ws the iden tiﬁcation of an optimal v alue for the regularization parameter from the standp oin t of stabilit y . W e compare with the v alue deriv ed b y cross-v alidation for ov erﬁtting con trol and optimisation of the generalization p erformance. W e test our metho d for b ot h regression and classiﬁcation tasks. The prop osed method is quite eﬀectiv e in terms of predictivit y , often with some improv eme nt on p erformance with resp ect to the reference cases considere d. This approac h, due to analytical determina- tion of the regularization pa r a meter, dramatically reduces the computational load required by man y other t ec hniques. Keywor ds: Regularization parameter, Condition n um b er, Pseudoin v ersion, Numerical instabilit y 1. In t r o duction In past decades Single Lay er F eedforward Neural Net w orks (SLF N) train- ing w as mainly a ccomplished b y iterativ e algorithms in v olving the rep etition of learning steps aimed at minimising the error functional o v er the space o f Pr eprint submitte d to Neur al Net works August 14, 2021 net w ork parameters. These tec hniques often gav e rise to metho ds slow and computationally expensiv e. Researc hers therefore ha v e alw a ys b een motiv ated to explore alternative algorithms and recen tly some new tec hniques based on matrix in ve rsion ha ve b een dev elop ed. In the literature, they we re initially emplo y ed to train radial basis function neural net works (P ogg io and Girosi, 1 990a): the idea of using them also for diﬀeren t neural a rc hitectures w as suggested for instance in (Cancelliere, 2001). The w ork b y Huang et al. (see for instance (Huang et al. , 2006)) gav e rise to a great interest in neural netw ork comm unit y: they presen ted the tec h- nique of Extreme Learning Mac hine (ELM) for whic h SLF Ns with randomly c hosen input w eights a nd hidden la y er biases can learn sets of observ ations with a desired precision, pro vided that a ctiv ation functions in the hidden la y er are inﬁnitely diﬀeren tiable. Besides, b ecause o f the use of linear output neurons, output w eigh ts determination can b e brough t bac k to line ar systems solution, obta ined via Mo ore- Penrose generalised inv erse (or pseudoin v erse) of the hidden la y er output matrix; so doing itera t iv e training is no more required. Suc h tec hniques app ear any w ay to req uire mor e hidden units with respect to conv en tional neural net w ork training alg orithms to ac hiev e comparable accuracy , as discuss ed in Y u and Deng (Y u and D eng, 2012). Man y application-oriented studies in the last y ears hav e b een dev oted to the use of these single-pass tec hniques, easy to implemen t and computa- tionally fast; some are describ ed e.g. in (Nguye n et al. , 2010; Kohno e t al. , 2010; Ajorlo o et al. , 2007). A y early conference is cu rrently being held on the sub j ect, t he Inte rnatio nal Conference on Extreme Learning Machin es, and the metho d is curren tly dealt with in some journal sp ecial issue, e.g. Soft Computing (W ang et al. , 2012) and the In ternational Journal of Uncertain t y , F uzziness and Kno wledge-Based Systems (W ang, 2013). Because of the p ossible presence of singular and almost singular mat r ices, pseudoin v ersion is kno wn to b e a p ow erful but n umerically unstable method: nonetheless in the neural netw ork comm unity it is often used without singu- larit y c hec ks and ev aluated through appro ximated metho ds. In this pap er w e impro v e on the t heoretical f r a mew ork using singular v alue a nalysis to detec t the o ccurrence o f instability . Building on Tikhono v regularization, whic h is kno wn to b e eﬀectiv e in this con text (Golub et al. , 1999), w e pres ent a tec hnique, named Optimally Conditioned Regularization for P seudoin v ersion (OCReP), that replaces unstable, ill-p osed problems with 2 w ell-p osed ones. Our a pproac h is based on the for ma l deﬁnition o f a new matricial for- m ulation that allows the use of condition n um b er as diagnostic to ol. In this con text an o ptimal v alue for the regularization parameter is analytically deriv ed by imp o sing w ell-conditioning requiremen ts on the relev an t matr ices. The issue of regularization parameter c hoice has often been iden tiﬁed as crucial in literature, and dealt with in a num ber of historical contributions: a conserv ative guess might put its published estimates at sev eral dozens. Some of the most relev an t works are men tioned in section 2, where the related theoretical bac kground is recalled. Its determination, mainly aimed at o ve rﬁtting con trol, has often b een done either exp erimen ta lly via cross-v alidation, requiring hea vy computa- tional training pro cedures, or analytically under sp eciﬁc conditions on the matrices in v olv ed, sometimes hardly applicable to real dat asets, as discus sed in section 2. In section 3 w e presen t the basic concepts concerning input and output w eigh ts setting, and w e recall the main ideas on ill-p osedness, regularizatio n and condition n um b er. In section 4 o ur matricial framework is intro duced, and constraints on condition n um b er are imp osed in order to deriv e the optimal v alue for the regularization parameter. In section 5 our diagnosis and con trol to ol is tested on some applica- tions selec ted fro m the UC I database and v alidat ed by comparison with the framew ork regularized via cross-v alidation and with the unregularized one. The same datasets a re used in section 6 to test the tech nique eﬀectiv e- ness: our p erformance is compared with those obtained in other regularized framew orks, originated in both statistical and neural domains. 2. Recap on ordinary least-square and ridge regression estimators As stated in the in tro duction, pseudoin v ersion based neural training brings bac k output w eights determination to linear sy stems solution: in this section w e recall some general ideas on this issue, that in next sec tions will b e sp e- cialized to deal with SLFN training. The estimate of β through ordinary least-squares (OL S) tec hnique is a classical to ol for solving the problem Y = X β + ǫ , (1) 3 where Y and ǫ are column n -ve ctors, β is a column p -v ector a nd X is a n n × p matrix; ǫ is random, with exp ectation v alue ze ro and v ariance σ 2 . In (Ho erl, 1962) and (Ho erl and Kennard, 1 9 70) the r ole of ordinary ridg e regression (ORR ) estimator ˆ β ( λ ) as an alternativ e to t he OLS estimator in the presence of multic ollinearity is deeply analized. In statistics, m ulti- collinearit y (also collinearit y) is a phenomenon in whic h t wo or more predictor v ar iables in a m ultiple regression mo del are highly correlated, meaning that one can be linearly predicted from the others with a no n- trivial degree of ac- curacy . In this situation, the co eﬃcien t estimates of the m ultiple regression ma y c hange erratically in resp onse t o small c hanges in the mo del or the data. It is known in literature that there exist estimates of β with smaller mean square error (MSE) than the un biased, or Gauss-Mark ov, estimate (Golub et al. , 1979; Berger, 19 76) ˆ β (0) = ( X T X ) − 1 X T Y . (2) Allo wing fo r some bias may result in a signiﬁcan t v ariance reduction: this is know n as the bias-v aria nce dilemma ( see e.g. (Tibshirani, 1996; Geman et al. , 1992), whose eﬀects o n output w eigh ts determination will b e deep ened in section 3.2. Hereafter w e fo cus o n the one parameter family of ridge estimates ˆ β ( λ ) giv en b y ˆ β ( λ ) = ( X T X + nλI ) − 1 X T Y . (3) It can b e sho wn that ˆ β ( λ ) is also the solution to the problem of ﬁnding the minim um o v er β of 1 n || Y − X β | | 2 2 + λ || β || 2 2 , (4) whic h is known as the metho d of r egularization in the appro ximation theory literature (Golub et al. , 1979); ba sing on it we will dev elop the theoretical framew ork for our w ork in the next sections. There has alw ays b een a substantial a mount of in terest in estimating a go o d v a lue of λ from the data: in addition to those already cited in this section a non-exhaustiv e list of well kno wn or more r ecen t pap ers is e.g (Ho erl and Kennard, 1976; La wless and W ang, 1976; McDonald and Galarneau, 1975; Nordb erg, 19 8 2; Saleh and Kibria, 1993; Kibria, 2003 ; Khalaf and Sh ukur, 2005; Mardiky a n and Ce tin , 2008). 4 A meaningful review of these formulations is prov ided in (Doruga de a nd Kashid, 2010). They ﬁrst deﬁne the matrix T suc h that T T X T X T = Λ (Λ = diag ( λ 1 , λ 2 , · · · λ p ) con tains t he eigen v alues of the matrix X T X ); then they set Z = X T and α = T T β , and sho w that a great amoun t of diﬀeren t metho ds require the OL S estimates of α and σ ˆ α = ( Z T Z ) − 1 Z T Y , (5) ˆ σ 2 = Y T Y − ˆ α T Z T Y n − p − 1 . (6) to deﬁne eﬀectiv e ridge parameter v alues. It is importa nt to note that often sp eciﬁc conditions on dat a are needed to ev aluate these estimators. In particular this applies to the expressions of the ridge parameter pro- p osed b y (Kibria, 2003) and (Ho erl and Kennard, 19 70), that share the c har- acteristic o f b eing functions of the ratio betw een ˆ σ 2 and a function o f ˆ α ; they will be used for comparison with our prop osed me tho d in se ction 6. The alternativ e tec hnique of generalised cross-v alidatio n (GCV) prop osed b y (Go lub e t al. , 1979) prov ides a g o o d estimate of λ from the data as the minimizer of V ( λ ) = 1 n || I − A ( λ ) Y || 2 2  1 n T race( I − A ( λ ))  2 2 , (7) where A ( λ ) = X ( X T X + nλI ) − 1 X T . (8) This solution is part icularly in teresting, since it do es not require an es- timate of σ 2 : b ecause of this, it will b e one term of comparison with o ur exp erimental results in section 6. In the next section w e will sho w how the pro blem of ﬁnding a go o d so- lution to (1) applies to the con text of pseudoin v ersion based neural training, sp ecializing the in v olv ed relev an t matricies to deal with this issue. 3. Main ideas on regularization and condition n um b er theory 3.1. Gener alise d inverse matrix for weights setting W e deal with a standard SLFN with L input neurons, M hidden neurons and Q output neurons, non-linear activ a t io n functions φ in the hidden lay er and linear activ ation functions in the output lay er. 5 Considering a dataset of N distinct training sample s ( x j , t j ), where x j ∈ R L and t j ∈ R Q , the learning pro cess for a SLFN aims at pro ducing the matrix of desired outputs T ∈ R N × Q when the matrix of all input instances X ∈ R N × L is prese nted as input. As stated in the in tro duction, in the pseudoin v erse appro ac h t he mat rix of input we ights and hidden lay er biases is randomly c hosen and no longer mo diﬁed: w e name it C . After having ﬁxed C , the hidden lay er output matrix H = φ ( X C ) is completely determined; w e underline that since H ∈ R N × M , it is not inv ertible. The use of line ar output neurons allows to determine the output w eigh t matrix W ∗ in terms of the OLS solution to the pro blem T = H W + ǫ , in analogy with eq.(1 ). Therefore fro m eq.(2 ), w e ha v e W ∗ = ( H T H ) − 1 H T T (9) According to (P enrose and T o dd, 1956; Bishop, 2006) W ∗ = H + T . (10) H + is the Mo ore-Penrose pseudoinv erse (or generalized in v erse) of matrix H , and it minimises the cost f unctional E D = | | H W − T || 2 2 (11) Singular v alue decomp osition (SVD) is a computationally simple and ac- curate w ay to compute the pseudoin v erse (see for instance ( Golub and V an Loan, 1996)), as fo llo ws. Ev ery matrix H ∈ R N × M can b e expressed as H = U Σ V T , (12) where U ∈ R N × N and V ∈ R M × M are orthogonal matrices and Σ ∈ R N × M is a rectangular diagonal matrix (i.e. a matrix with σ ih = 0 if i 6 = h ); its elemen ts σ ii = σ i , called singular v a lues, are non-negativ e. A common con v en tion is to list the singular v alues in desc ending order, i.e. σ 1 ≥ σ 2 ≥ · · · ≥ σ p > 0 (13) where p = min { N , M } , so that Σ is uniquely determined. The SVD of H is then used to o btain the pseu doinv erse matrix H + : 6 H + = V Σ + U T , (14) where Σ + ∈ R M × N is aga in a r ectangular diagonal matr ix whose elemen ts σ + i are obtained b y ta king the recipro cal of eac h corresp onding eleme nt: σ + i = 1 /σ i (see also (R a o and Mitra , 1971)). F rom eq.(9) w e tha n ha v e: W ∗ = V Σ + U T T , (15) Remark An in teresting case o ccurs whe n only k < p elemen ts in eq.(13) are non- zero, i.e. σ k +1 = · · · = σ p = 0; in this case the rank of matrix H is k and Σ + is deﬁned as: Σ + = diag (1 /σ 1 , · · · , 1 /σ k , 0 , · · · , 0 ) ∈ R M × N , (16) as sho wn for instance in (Golub and V an Loan, 1996). This is also often done in practice, for computat io nal reasons, for ele- men ts smaller than a predeﬁned threshold, th us actually computing an ap- pro ximated v ersion of the pseudoin v erse matrix H + . This approac h is for example used by default for pseudoin v erse ev aluation b y means of t he Matla b pin v function 1 , b ecause the t o ol is wide ly used b y man y scien tists for example in ELM contex t, eac h time that it is applied blindly , i.e. without ha ving decided at what threshold to zero the small σ i , an appro ximation a priori uncon trolled is in tro duced in H + ev aluatio n. 3.2. Stability a n d gener alization pr op erties of r e gularization algorithms A k ey prop ert y fo r any learning algorithm is stabilit y: the learned map- ping has to suﬀer only small changes in presence of small perturbations (for instance the dele tion of one example in t he training set). Another imp ortant prop erty is generalization: the p erformance on the training examples (empirical error) must b e a go o d indicator o f the p erfor- mance on future examples ( expected erro r ), that is, the diﬀerence b et w een the t wo m ust b e small. An algorithm that g uaran tees g o o d g eneralization predicts w ell if its empirical error is small. 1 ht tp:// www.mathw orks.co m/ help/matlab/r e f/pin v.html. 7 Man y studies in literature dealt with the connection b etw een stabilit y and generalization: the notion of stabilit y has b een in v estigated by sev eral au- thors, e.g. b y Devro y e and W agner (Devroy e a nd W agner, 1979) and Kearns and Ron (Kearns and Ron, 1999). P oggio et al. in (Mukherjee et al. , 200 3) in tro duced a statistical form of leav e-one-out stabilit y , named C V E E E loo , building on a cross-v alidat io n lea v e-one-out stabilit y endo w ed with conditio ns on stability of b o th exp ected and empirical errors; t hey demonstrated t ha t this condition is necessary and suﬃcien t for generalization and consistency of the class o f empirical r isk min- imization (ERM) learning algorithms, and tha t it is also a suﬃcien t condition for generalisation for not ERM algorithms (see also (Poggio et al. , 2004)). T o turn an original instable, ill-p osed problem in to a w ell-p osed one, reg- ularization metho ds of the fo rm (4) are often used (Badev a and Morozov, 1991) and among t hem, Tikhono v regularizatio n is one of the most com- mon (Tikhono v and Arsenin, 1 9 77; Tikhono v, 1963). It minimises the error functional E ≡ E D + E R = || H W − T || 2 2 + || Γ W || 2 2 , (17) obtained adding to the cost functional E D in eq.(11) a p enalt y term E R that dep ends on a suitably chosen Tikhono v matrix Γ. This issue has b een discusse d in its applications to neural net w orks in (P og g io and G irosi, 1990 b), and surv eye d in (Girosi et al. , 1995; Ha ykin, 1999). Besides, Bousquet and Elisseeﬀ (Bousquet and Elisseeﬀ , 2002) prop osed the notion of uniform stability to c haracterize the generalization prop erties of an algorithm. Their results state tha t Tikhono v regularizatio n algorithms are uniformly stable and t ha t uniform stability implies go o d g eneralization (Mukherjee et al. , 2006). Regularization th us in tro duces a p enalt y function that not only impro v es on stabilit y , ma king the problem less sensitiv e to initial conditions, but it is also important to contain mo del complex ity a v oiding ov erﬁtting. The idea of p enalizing by a square function of w eigh ts is also w ell kno wn in neural litera t ure as we ight decay : a wide amount of a rticles hav e b een dev oted to this arg ument, and more generally to the adv an tage of regulariza- tion for the control of ov erﬁtting. Among them w e recall (Hastie et al. , 2009; Tibshirani, 1996; Bishop , 2006; Girosi et al . , 19 95; F u, 1998; Ga llinari and Cibas, 1999). A frequen t c hoice is Γ = √ γ I , to give preference to solutions with smaller 8 norm (Bishop, 2006), so eq. (17) can be rewritten as E ≡ E D + E R = | | H W − T || 2 2 + γ || W || 2 2 . (18) W e deﬁne ˆ W = min W ( E ) the regularized solution of (18): it b elongs to the family of ridge estimates described by eq .(3 ) and can b e expressed as ˆ W = ( H T H + γ I ) − 1 H T T (19) or, as shohw in ((F uhry and Reic hel, 2012)) as ˆ W = V D U T T . (20) V and U are fr o m the singular v alue decomp osition of H (eq.(12)) and D ∈ R M × N is a rectangular diagonal matrix whose elemen ts, built using the singular v alues σ i of matrix Σ, a re: D i = σ i σ 2 i + γ . (21) W e remark on the diﬀerence b et w een the minima o f the regularized and unregularized error functionals. Increasing v alues of the regularizatio n pa- rameter γ induce larger and larger departure of the former (eq. (19)) f rom the latter (eq. (9)). Th us, the regularization pro cess increases the bias of the appro ximating solution and reduces its v ariance, as discussed a b out the bias-v a riance dilemma in section 2. A suitable v alue for the Tikhono v parameter γ has therefore to deriv e from a compromise b et w een ha ving it suﬃcien tly large to con trol the approac hing to zero of σ i in eq.(21), while a voiding an exce ss of the p enalt y term in eq.(18). Its tuning is therefore crucial. 3.3. Condition numb er a s a me asur e of il l-p ose dness The condition num ber of a mat r ix A ∈ R N × M is the n umber µ ( A ) deﬁned as µ ( A ) = || A || || A + || (22) where k·k is an y matrix nor m. If the columns (rows ) of A are linearly inde- p enden t, e.g. in case of experimental dat a matrices, then A + is a left (righ t) in v erse of A , i.e. A + A = I N ( AA + = I M ). The Cauch y- Sc h w arz inequalit y in this cas e then pro vides µ ( A ) ≥ 1; b esides, µ ( A ) ≡ µ ( A + ) . Matrices are said to b e ill-conditioned if µ ( A ) ≫ 1 . 9 If k·k 2 norm is use d, then µ ( A ) = σ 1 ( A ) σ p ( A ) , (23) where σ 1 and σ p are the largest and smallest singular v alues of A resp ectiv ely . F r o m eq.(23) w e can easily understand that large condition num b ers µ ( A ) suggest the presence o f v ery small singular v alues (i.e. of almost singular matrices), whose numerical inv ersion, required t o ev aluate Σ + and the un- regularized solution W ∗ , is a cause of instability . F r o m n umeric linear algebra we also kno w that if the condition n um b er is lar g e the problem of ﬁnding least- squares solutions to the corresp o nd- ing system of linear equations is ill- p osed, i.e. ev en a small p erturbation in the data can lead to h uge p erturbations in the entries o f solution (see (Golub and V an Loan, 1996)). According to (Mukherjee et al. , 2006) the stability of Tikhono v regular- ization algorithms can also b e c hara cterized using the classical notion of con- dition n um b er: our prop osed regularization metho d ﬁts within this con text. W e will see that it speciﬁcally aims at ana litically determining the v alue of the γ para meter that minimizes the conditio ning o f the r egula rized hidden la y er output matrix so that the solution ˆ W is stable in the sense of eq.(2.9) of (Mukhe rjee et al. , 2006). In the next section, w e will derive the o pt ima l v alue of the regulariza- tion pa r a meter γ according to this stabilit y criterion (minimu m condition n um b er). The exp erimen tal results presen ted in sections 5 and 6 will evide nce that our quest for stable solutions allo ws us to also a chiev e go o d generalization and predictivit y . A comparison will b e made to this purp ose with the p erformance obtained when γ is determined via the standard cross-v alidation approach, aimed at ov erﬁtting con trol and generalization p erformance optimization. 4. Conditioning of the regularized matricial framew ork F o r con v enien t implemen tation of our diagnostics, and building on eq.(20), w e pro p ose an origina l matricial framew ork in whic h to dev elop our study to ol with the fo llo wing deﬁnition. Deﬁnition 1. We deﬁne the matrix H r eg ≡ V D U T (24) 10 as the regularized hidden layer output ma trix of the neur al network. This allo ws us to rewrite eq.(20) as ˆ W = H r eg T , (25) for similarit y with eq.(9). By construction, H r eg is decomp o sed in three matrices according to the SVD framew ork, and its singular v a lues are pro vided by eq.(21) as a function of the singular v alues σ i of H . This new regularized matricial framework mak es easier the comparison of the prop erties of H r eg with those of the corresp o nding unregularized matrix H + . In fact, when unregularized pseudoin v ersion is used, nothing prev en ts the o ccurrence of v ery small singular v a lues that mak e n umerically instable the ev aluation o f H + (see eq. 14). On the con trary , ev en in presence of v ery small v alues σ i of the original unregularized problem, a careful c hoice of the parameter γ allows to tune the singular v alues D i of the r egula rized matrix H r eg , prev en ting n umerical instability . 4.1. Condition numb er d e ﬁnition According to eq. (23), w e deﬁne the condition n um b er of H r eg as: µ ( H r eg ) = D max D min . (26) where D max and D min are the largest a nd smallest singular v a lues of H r eg . The shap e of the functional relation σ / ( σ 2 + γ ) that links regularized and unregularized singular v alues, deﬁned through eq. (21), is show n in Fig.1 for three diﬀeren t v alues of γ . The curv es are no n- negativ e, b ecause σ > 0 and γ > 0, and hav e only one maxim um, with coordina t es ( √ γ ; 1 2 √ γ ). A few pairs of corresponding v alues ( D i , σ i ) are mark ed b y dots on eac h curv e. F o r the sak e of the determination of µ ( H r eg ) w e are in terested in ev a lu- ating D max and D min of H r eg o v er the ﬁnite, discrete range [ σ 1 , σ 2 , . . . , σ p ]. The v alue D max is reac hed in corresp ondence to a giv en singular v alue of H , a priori not kno wn, that w e lab el σ max , so that: D max = σ max σ 2 max + γ . (27) 11 σ σ / ( σ 2 + γ ) Figure 1 : Exa mple of regularized/unr egularized singula r v alues relationship via eq. (21) The v ariation of γ has t he eﬀect of changing the curv e and shifting its maxim um p o in t within the in terv al [ σ 1 , σ p ]. Therefore, σ max can coincide with an y singular v alue of H fro m eq. (1 3 ), includin g the extreme ones. Con v ersely , we no w demonstrate that D min can only b e reac hed in corre- sp ondence to σ 1 or σ p (or bo t h when coinciden t). Theorem 3.1 The minim um singular v alue D min of matrix H r eg can only b e reac hed in corresp ondence to the largest singular v alue σ 1 or to the s mallest singular v alue σ p of the unregularized matrix H (or b oth). Pr o of. Without loss of generalit y , w e can express γ as a function of σ 1 σ p , i.e. γ = β σ 1 σ p , where β is a real po sitiv e v alue. By replaceme nt in eq. (21), we get D 1 = 1 σ 1 + β σ p , D p = 1 σ p + β σ 1 T o establish their ordering, w e ev aluate the diﬀerence ∆ of their inv erses : ∆ = 1 D 1 − 1 D p = ( σ 1 + β σ p ) − ( σ p + β σ 1 ) = (1 − β )( σ 1 − σ p ) . 12 Recalling that σ 1 − σ p > 0 , w e can distinguish three case s: Case 1, β > 1 ( γ > σ 1 σ p ) → ∆ < 0 → D 1 > D p Because of the D i distribution shape, D p is a lso the minim um among all v alues D i , so that D min ≡ D p . Case 2, β < 1 ( γ < σ 1 σ p ) → ∆ > 0 → D 1 < D p Then, D 1 is also the minim um a mong all v alues D i , so that D min ≡ D 1 . Case 3, β = 1 ( γ = σ 1 σ p ) → ∆ = 0 → D 1 = D p Th us, D 1 and D p are both minima, so that D min ≡ D 1 = D p . 4.2. Condition numb er e v aluation The result b y Theorem 3.1 allo ws us to ﬁnd, a ccording to eq. (26), the follo wing expressions for µ ( H r eg ) : Case 1, β > 1 : µ ( H r eg ) = D max D p = σ max ( σ p + β σ 1 ) σ 2 max + β σ 1 σ p Case 2, β < 1 : µ ( H r eg ) = D max D 1 = σ max ( σ 1 + β σ p ) σ 2 max + β σ 1 σ p Case 3, β = 1 : µ ( H r eg ) = D max D p = D max D 1 = σ max ( σ p + σ 1 ) σ 2 max + σ 1 σ p Bearing in mind that well-conditioned problems are c ha racterized b y small condition num bers, w e no w will lo ok for the β parameter v alues whic h, in the three cases ab ov e, make the regularized condition n umber smaller. In Case 1, µ ( H r eg ) is an increasing function of β , so that in its domain, i.e. (1 , ∞ ), it s minim um v alue is reac hed when β → 1 + . On the con trary , in Case 2, µ ( H r eg ) is a decreasing function of β , so that in its domain, i.e. (0 , 1), the minim um is reache d when β → 1 − . Fig.2 sho ws the function b eha viour o ve r the whole domain. Both cases ha v e a common limit: 13 1 β µ (H reg ) Figure 2 : Regular ized condition n umber vs. β lim β → 1 + µ ( H r eg ) = lim β → 1 − µ ( H r eg ) = σ max ( σ p + σ 1 ) σ 2 max + σ 1 σ p (28) Suc h v alue is just that provided b y Case 3, whic h can therefore be consid ered the best p ossible c hoice to minimize the condition num ber. Th us our quest for the b est p ossible conditioning for the matrix H r eg iden tiﬁes an explicit optimal v alue for the regular izat io n parameter γ : γ = σ 1 σ p (29) 5. Sim ulation and Discussion F o r the n umerical experimentation, we use eight b ench mark datasets from the UCI rep ository (Bac he and Lic hman, 2013) listed in T able 1. All sim u- lations are carried out in Matlab 7.3 env ironment. The p erformance is assessed b y statistics o v er a set of 50 diﬀeren t extrac- tions of input w eigths, computing either the a ve rag e RMSE (for regression tasks) o r the a ve rag e p ercen tage of misclassiﬁcation ra t e (for classiﬁcation tasks) on the test set. Either quantit y is lab eled “Err” in t he tables sum- marising our results. The error standard deviation (lab eled “Std”) is also computed to ev idence the disp ersion of experimental results. 14 Dataset T yp e N. Instances N. A ttributes N. Classes Abalone Regression 4177 8 - Mac hine Cpu Regression 209 6 - Delta Ailerons Regression 7129 5 - Housing Regression 506 13 - Iris Classiﬁcation 150 4 3 Diab etes Classiﬁcation 768 8 2 Wine Classiﬁcation 178 13 3 Segmen t Classiﬁcation 2310 19 7 T able 1: T he UCI datasets and their c harac teristics Our regularization strategy , lab eled Optimally Conditioned Regulariza- tion for Pse udoinv ersion (OCReP), is v eriﬁed by sim ulation against t he com- mon approa c h in whic h cross-v alidation is used i) to determine the regular- ization para meter γ at a ﬁxed high num b er of hidden neurons and ii) to p erform also hidden neurons n umber optimization, resp ectiv ely in sec. 5.1 and 5.2. A discussion of the eﬀectiv eness o f OCReP in terms of minimization of the condition n um b er of the in v olv ed matricies is done in se c. 5.3. 5.1. OCR eP p erformanc e assessment: ﬁxe d numb er of hidden units In this section we compare OCReP with a regularization approac h in whic h γ is selected b y a cross-v alidation sc heme, whic h is t ypically used for con trol of under/o v erﬁtting and optimization o f the mo del generalization p erformance. A 70%/30 % split b etw een training and test set is applied; t hen, a three-fold cross-v alidation searc h on the training set iden tiﬁes the b est γ b y b est p erformance on the v a lidation set, ov er the set of 5 0 v alues of γ [10 − 25 , 10 − 24 , · · · 10 25 ]. F o r the sake o f comparison, a ﬁxed, high num b er of hidden units M is used, selected a ccording to dimension and complexit y o f the datasets. F or the three datasets Mac hine Cpu, Iris and Wine the sim ulation is perfo rmed for 50 and 1 00 hidden neurons; for Abalone, D elta Ailerons, Housing and Diab etes, we use 5 0, 1 00, 200 and 300 neurons; for Segmen t, w e use 100 0 and 1500 units. Figures 3 and 4 ( resp ectiv ely for regression and classiﬁcation datasets) sho w a v erage test errors as a function of the sampled v a lues of γ (red dots); 15 10 −10 10 −5 10 0 2 2.5 3 3.5 Abalone (M=300) γ RMSE 10 −10 10 −5 10 0 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 x 10 −4 Delta Ailerons (M=300) γ RMSE 10 −10 10 −5 10 0 0 100 200 300 400 500 600 700 Machine Cpu (M=100) γ RMSE 10 −10 10 −5 10 0 0 5 10 15 20 25 30 35 40 45 Housing (M=300) γ RMSE Figure 3: T est error trends for regression datasets as a function of the v alues of γ ov er the selected cross-v alidation rang e (red dots): the cross- v alidation selected γ is the bla ck square; the prop osed γ from OCReP is the blue circle. the standard deviation is sho wn as an error bar. Our prop o sed optimal γ is evidenced as a blue circle, whereas the v alue of γ selected b y cross-v alidation is sho wn a s a black square. The results a re in eac h case related to the hig hest n um b er o f neurons ex p erimented . The horizontal axis has b een zo omed in o nto the region of in terest, i.e. [10 − 10 , 10 5 ]. It ma y b e not ed that the p erformance from OCReP and cross-v alidation are comparable, and a lso close to the experimen tal minimum. This ma y b e in terpreted as go o d predictivit y for b o th algorithms. Also, w e remark that the error bars, i.e. experimen tal result dispersion, 16 10 −10 10 −5 10 0 0 5 10 15 20 25 30 Iris (M=100) γ Mean Miscl. Err. [%] 10 −10 10 −5 10 0 0 2 4 6 8 10 12 14 16 18 Wine (M=100) γ Mean Miscl. Err. [%] 10 −10 10 −5 10 0 22 24 26 28 30 32 34 Diabetes (M=300) γ Mean Miscl. Err. [%] 10 −10 10 −5 10 0 0 5 10 15 20 25 30 35 Segment (M=1500) γ Mean Miscl. Err. [%] Figure 4: T est erro r trends for class iﬁcation datasets as a function of the v alues of γ ov er the selected cross-v alidation rang e (red dots): the cross- v alidation selected γ is the bla ck square; the prop osed γ from OCReP is the blue circle. 17 is large for small v alues o f γ , consisten tly with exp ectations on ineﬀectiv e regularization. T able 2 : Compariso n of OCReP vs. cross-v alidation at ﬁxed num b er of hidden neuro ns for small size datasets Iris Wine Mac hine Cpu M 50 OCReP Err. 1 . 51 2 . 98 31 . 21 Std 1 . 13 1 . 75 1 . 1 cross-v al. Err. 2 . 13 3 . 37 31 . 1 Std 0 . 77 2 . 27 1 . 02 100 OCReP Err. 2 . 53 1 . 39 34 . 13 Std 0 . 77 1 . 19 01 . 6 8 cross-v al. Err. 2 . 17 1 . 8 8 30 . 94 Std 0 . 31 1 . 88 0 . 69 T able 3 : Compariso n of OCReP vs. cross-v alidation at ﬁxed num b er of hidden neuro ns for lar ge size data sets Segmen t M 1000 OCReP Err. 2 . 53 Std 0 . 77 cross-v al. Err. 2 . 17 Std 0 . 31 1500 OCReP Err. 4 . 41 Std 0 . 45 cross-v al. Err. 3 . 97 Std 0 . 35 The n umerical results hav e been reported in T a b. 2, 3 and 4 according to the grouping ba sed on dimension and complexit y of the datasets. F o r each dataset and selec ted n um b er of hidden neurons M , the b est test error is evidenced in b o ld, whenev er the diﬀerence is statistically signiﬁcan t 2 . 2 The Student’s t-test has been used for assessing the s tatistical signiﬁca nce through 18 T able 4 : Compariso n of OCReP vs. cross-v alidation at ﬁxed num b er of hidden neuro ns for medium size datasets. F o r Delta Ailerons, av er age erro rs and standard deviatio ns have to b e m ultiplied b y 1 0 − 4 . Abalone D elta Ailerons Housing Diabetes M 50 OCReP Err. 2 . 22 1 . 64 5 . 54 26 . 01 Std 0 . 16 0 . 0051 0 . 12 0 . 604 cross-v al. Err. 2 . 13 1 . 59 4 . 79 26 . 79 Std 0 . 017 0 . 0073 0 . 37 0 . 814 100 OCReP Err. 2 . 15 1 . 62 5 . 17 25 . 66 Std 0 . 007 0 . 004 0 . 08 0 . 608 cross-v al. Err. 2 . 11 1 . 58 4 . 49 25 . 71 Std 0 . 006 0 . 0036 0 . 28 0 . 608 200 OCReP Err. 2 . 12 1 . 59 4 . 62 25 . 13 Std 0 . 003 0 . 0031 0 . 09 0 . 445 cross-v al. Err. 2 . 11 1 . 61 4 . 30 25 . 79 Std 0 . 003 0 . 0096 0 . 27 0 . 443 300 OCReP Err. 2 . 113 1 . 58 4 . 24 24 . 26 Std 0 . 03 0 . 0018 0 . 13 0 . 689 cross-v al. Err. 2 . 114 1 . 60 4 . 18 25 . 66 Std 0 . 003 0 . 0042 0 . 23 0 . 456 Th us, for example, on Iris the b est p erfo rmance is ac hiev ed using 5 0 neu- rons b y OCReP , and with 100 neurons by cross-v alidatio n. In some cases, e.g. Wine (50 neurons), there is no clear winner fro m statistical considerations, i.e. the b est results are comparable, within the errors. F r o m the ab o v e results it app ears that cross-v alidation ha s b etter test error p erformance o n a n umber of datasets slightly higher, at ﬁxed n um b er of hidden neurons. Ho w ev er, it is imp o rtan t to eviden ce that the use of OCReP allows to sa v e the hu ndreds o f pseudoin v ersion steps required b y cross-v alidat io n, wic h is a crucial issue for practical impleme ntation. determination o f the conﬁdence in terv als r elated to 99% conﬁdence level. 19 5.2. OCR eP p erformanc e assessment: variable numb er of hidden units In o rder to pursue the double aim of p erformance and hidden units opti- mization, a ﬁrst in teresting step is t o giv e a lo ok to the v a riation as a function of hidden lay er dimension of error trends of unregularized mo dels ( i.e. mo dels whose output weigh ts are ev aluated according to eq.(10)). A con text widely used among researc hers using suc h tec hniques (see e.g. Helm y and Rasheed (2009); Huang et al. (2006)) is to us e input w eigh ts dis- tributed according to a random uniform distribution in the interv al ( − 1 , 1), and sigmoidal activ ation functions for hidden neurons: hereafter w e name this framew ork Sigm-unreg. 0 50 100 150 200 250 300 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 Abalone Number of hidden neurons RMSE Sigm−unreg OCReP 0 50 100 150 200 250 300 1 2 3 4 5 x 10 −4 Delta Ailerons Number of hidden neurons RMSE Sigm−unreg OCReP 0 20 40 60 80 100 0 100 200 300 400 500 600 700 800 900 Machine Cpu Number of hidden neurons RMSE Sigm−unreg OCReP 0 50 100 150 200 250 300 0 5 10 15 20 25 30 35 Housing Number of hidden neurons RMSE Sigm−unreg OCReP Figure 5: T es t error tre nds for regression datasets: OCReP vs. unreg ularized pseudoin- version. Figures 5 and 6 sho w, resp ectiv ely fo r regression and classiﬁcation datasets, 20 the a v erage test error v alues, (ov er 50 diﬀeren t input w eigh ts sele ctions) for b oth OCReP (blue line) and Sigm-unreg (red line) as a function of the n um- b er of hidden no des, whic h is gra dually increased by unit y steps. In all cases, after an initial decrease the Sigm-unreg test error increases signiﬁcan tly . On the con trary , the OCReP test error curv es keep decreasing, alb eit a t slo w er and slo w er rate, th us showing also a go o d capa bility of o v erﬁtting con trol of the method. W e aim no w at comparing t he results obtained when the tr ade-oﬀ v alue of γ is searc hed b y cross-v alidation, with t he tw o diﬀeren t framew orks discussed so far, i.e. OCReP and Sigm- unreg. A 70 %/ 3 0% split b et w een tr a ining and test set is applied; w e t hen p erform a three-fold cross-v alidation for the selection of the num b er of hidden neurons ¯ M at whic h the minim um error is recorded in all cases. T est errors are again ev aluated as the a v erage of 50 diﬀeren t random c hoices of input we ights. The n umerical results of the sim ulation are presen ted in T ables 5 and 6, resp ectiv ely for regression and classiﬁcation tasks, with their standard deviations (Std) and ¯ M . Best test errors ar e evidenced in b o ld, whenev er the diﬀerence b et w een OCReP and cro ss-v alidation is statistically signiﬁcan t. W e see that our pro p osed regularization techniq ue provides , for regres- sion datasets, p erformance comparable with the cross-v a lida tion option but alw a ys a better p erforma nce (with statistical signiﬁcance at 99% lev el) with resp ect to the unregularized case. F o r classiﬁcation datatsets in three cases out of four OCReP provides a b etter p erfo rmance with resp ect to cross-v alidation, and alwa ys a b etter p erformance with resp ect to the Sigm-unreg case. In all suc h cases, the statistical signiﬁcance is at the 99% lev el. Also, in almost all cases smaller standard deviations are asso ciated with the OCReP metho d, suggesting a low er sensitivit y t o initial input we ights conditions. 5.3. A dd itional c onsider ations The prop osed metho d OCReP presen ts in o ur opinion t wo f eat ures of in terest: on one side, its computational eﬃcienc y , and on the ot her side its optimal conditioning. Our goal o f optimal analytic determination of the regularization param- eter γ results in a dra matic improv ement in the computing requiremen ts with respect to exp erimen tal tuning b y searc h ov er a pre-deﬁned larg e grid 21 0 20 40 60 80 100 0 10 20 30 40 50 60 70 Iris Number of hidden neurons Mean Miscl. Err. [%] Sigm−unreg OCReP 0 20 40 60 80 100 0 10 20 30 40 50 60 70 Wine Number of hidden neurons Mean Miscl. Err. [%] Sigm−unreg OCReP 0 50 100 150 200 250 300 18 20 22 24 26 28 30 32 Diabetes Number of hidden neurons Mean Miscl. Err. [%] Sigm−unreg OCReP 0 200 400 600 800 1000 0 5 10 15 20 25 30 35 40 Segment Number of hidden neurons Mean Miscl. Err. [%] Sigm−unreg OCReP Figure 6 : T est err or trends for classiﬁca tion datasets: O CReP vs. unr egularized pseudo in- version. of N γ ten tativ e v alues. In the latt er case, fo r eac h c hoice of γ o v er the se- lected range, at least a pseudoin v ersion is required for ev ery output w eigh t determination, th us increasing the computational load by a factor N γ . Besides, our metho d is designed explicitly fo r optimal conditioning. In our sim ulations, w e v erify that t he g o al is fulﬁlled by ev aluating a v erage conditio n n um b ers of hidden lay er output matrices. The stat istics is p erformed o v er 50 diﬀeren t conﬁgurations of input we ights a nd a ﬁxed n umber of hidden units, namely the largest used in section 5.1 for each dataset. The results are summarised in T ables 7 and 8, resp ectiv ely for regression and classiﬁcation datasets. On the ﬁrst row of each table, w e list the ratio of a ve rag e condition n um b ers o f matrices H r eg , and H + , asso ciated resp ectiv ely to OCReP and Sigm-unreg, i.e. regularized and unregularized approac hes. On the second 22 ro w, w e list the ratio of av erage conditio n n um b ers of matrices H r eg and H C V , th us comparing our regula r ization approac h with t he more con v en tional o ne, the latter using cross-v alidation. Not surprisingly , our regularization metho d provides a s igniﬁcant im- pro v emen t o n conditioning with resp ect to the unregularized approach, as evidenced by ratio v alues m uc h smaller than unity . Besides, OCReP also pro vides b etter conditioned matrices tha n those deriv ed b y selection of γ through cross-v alidatio n, s ince the corresp onding condition n umbers are sys- tematically smaller in the former case, sometimes up to an order of magni- tude. T able 5: Hidden layer optimization for reg ression tasks. F or Delta Ailer ons, av erag e errors and standar d deviations ha ve to b e m ultiplied b y 10 − 4 . Abalone Housing Delta Ailerons Mac hine Cpu OCReP Err. 2 . 12 4 . 25 1 . 58 31 . 22 Std. 0 . 32 0 . 13 0 . 0048 0 . 78 ¯ M 178 255 29 8 63 Cross-v alidatio n Err. 2 . 11 4 . 19 1 . 58 31 . 51 Std. 0 . 0097 0 . 25 0 . 0036 1 . 25 ¯ M 110 250 93 70 Sigm-unreg Err. 2 . 14 4 . 73 1 . 62 34 . 44 Std. 0 . 014 0 . 20 0 . 57 2 . 89 ¯ M 31 76 74 15 6. Comparison wit h other approac hes Since the literature prov ides a host of diﬀeren t recip es f or either the choice of the regularization parameter, or the actual regularization algorithm, here- after w e fo cus on a couple of speciﬁc framew orks. 6.1. Other cho i c es of r e gularization p ar ameter Among the approac hes mentioned in section 2, w e primary select the tec hnique of generalised cross-v alidation (G CV) from (Golub et al. , 1979), 23 T able 6: Hidden lay er optimization for class iﬁcation tasks. Iris Wine D iab etes Segment OCReP Err. 1 . 6 1 . 73 25 . 53 2 . 50 Std. 1 . 1 0 1 . 25 0 . 51 0 . 32 ¯ M 67 91 291 760 Cross-v alidatio n Err. 2 . 1 2 2 . 10 25 . 2 2 . 65 Std. 1 . 2 6 2 . 27 1 . 29 0 . 38 ¯ M 14 137 25 620 Sigm-unreg Err. 2 . 3 1 3 . 20 25 . 92 4 . 45 Std. 1 . 4 8 2 . 09 1 . 12 0 . 47 ¯ M 67 91 291 760 T able 7: C o ndition num b er compa rison for regression datas ets Abalone Housing Delta Ailerons Mac hine Cpu µ ( H r eg ) /µ ( H + ) 0 . 0002 0 . 0008 0 . 00007 0 . 0001 µ ( H r eg ) /µ ( H C V ) 0 . 8 0 . 3 0 . 3 0 . 1 described b y eqs. ( 7 ) and (8), for comparison with our metho d. The main motiv a tion for our c hoice is its indep endence on the estimate of the error v ari- ance σ 2 , whic h is a characteristic shared with our case. F or eac h dataset, w e select the same ﬁxe d num bers of hidden units a s in sec tion 5.1: then for eac h case eq. (7) is minimized o v er the set of 50 v alues of γ [1 0 − 25 , 10 − 24 · · · 10 25 ] and for 50 diﬀeren t conﬁgurations of input w eigh ts. W e ev aluate the mean and standard deviation of the cor r espo nding regu- larized test error, rep orted in T ables 9, 10 and 11. W e also remind tha t the tabulated error “Err” is either t he a v erage RMSE for regression tasks, or t he a v erage misclassiﬁcation rate for classiﬁcation t asks; “Std” is the corr espo nd- ing standard deviation. The p erformance comparison is based on statistical signiﬁcance at 9 9 % lev el. Whenev er GCV provid es test error v alues statistically b etter than OCReP 24 T able 8: Co ndition num b er comparison for c la ssiﬁcation datas ets Iris Wine Diab etes Segmen t µ ( H r eg ) /µ ( H + ) 0 . 00002 0 . 005 0 . 00 07 0 . 000005 µ ( H r eg ) /µ ( H C V ) 0 . 2 0 . 4 0 . 1 0 . 2 T able 9: GCV results at ﬁxed num b er of hidden neuro ns for small datasets Iris Wine Machine Cpu M 50 Err. 2 . 47 3 . 66 33 . 03 Std 1 . 06 2 . 42 1 . 27 100 Err. 3 . 06 3 . 77 36 . 06 Std 1 . 08 2 . 44 1 . 13 (listed in T ab. 2, 3 and 4), they are mark ed in bold. W e remark that in all cases listed in T ab. 2 and 3 OCReP prov ides sta- tistically b etter results than GCV. The situation o f medium size data sets evidences a somewhat mixed b eha viour: with 50 hidden neurons, GCV wins; with 100 neurons, for three out of f our data sets (i.e. Abalone, Housing and Diab etes) the p erformance is statistically comparable. In all other cases of T ab. 4 OCReP a gain pro vides b etter statistical results than G CV. W e mak e t w o ot her comparisons, using the ridge estimates described in eq.(13) and eq.(9) of (Dorugade and Kashid, 201 0), and pr o p osed resp ec- tiv ely b y (Kibria, 2003) and (Ho erl and Kennard , 1970): T able 10: GCV results a t ﬁxed n umber o f hidden neurons for large size da tasets Segmen t M 1000 Err. 11 . 3 9 Std 0 . 75 1500 Err. 14 . 7 2 Std 0 . 803 25 T able 11: GCV results at ﬁxed n umber of hidden neurons for medium s ize datasets. F or Delta Ailer o ns, av er age er r ors and standard deviations ha ve to b e multiplied b y 10 − 4 . Abalone Housing Delta Ailerons D ia b etes M 50 Err. 2 . 13 4 . 89 1 . 60 25 . 2 Std 0 . 017 0 . 45 0 . 0103 1 . 22 100 Err. 2 . 15 5 . 05 1 . 63 26 . 66 Std 0 . 021 0 . 70 0 . 0297 1 . 39 200 Err. 2 . 32 6 . 78 1 . 74 27 . 73 Std 0 . 10 2 . 35 0 . 0892 1 . 27 300 Err. 2 . 98 8 . 07 2 . 20 27 . 14 Std 0 . 42 2 . 89 0 . 4054 1 . 15 γ K = 1 p p X 1 ˆ σ 2 ˆ α 2 i , (30) γ H K = ˆ σ 2 ˆ α 2 max . (31) Our exp erimen tation is made o nly fo r regression data sets b ecause the theoretical back gro und of (Dorugade and Kashid, 2 0 10), a nd of most of other w orks referred in section 2, directly applies to the case in whic h the quan tity Y in eq.(1) is a one column matrix. In our form ulation Y is the desired targ et T and it is a one-column matrix only for regression tasks. F o r eac h dataset w e a pplied b ot h metho ds described b y eq. 30 and 31; w e select the same ﬁxed n um b ers of hidden units as in section 5 .1 and p erform 50 experimen ts with diﬀerent conﬁguration of input w eigh ts. Eac h step of pseudoin v ersion is regularized for each metho d with the corresp onding γ v alue. W e ev aluate the mean and standard deviation of the regularized test errors, r ep orted respective ly in T ables 12 and 13. Whenev er the methods pro vide test error v alues statistically better than OCReP (listed in T ab. 2 and 4), they are mark ed in b old. W e remark that the metho d by Kibria obtains a b etter p erformance in t w o cases ov er sixte en, while OCReP in 12 cases o v er sixteen. Besides, the metho d b y Ho erl and Kennard o btains a b etter p erformance in three cases 26 T able 12: Kibria estimate o f ridge parameter : res ults at ﬁxed num b er o f hidden neuro ns for regr ession datasets. F or Delta Ailerons, average errors and s tandard deviations hav e to b e m ultiplied b y 1 0 − 4 . Abalone Housing Delta Ailerons Mac hine Cpu M 50 Err. 2 . 32 5 . 72 1 . 63 34 . 28 Std 0 . 37 0 . 84 0 . 027 4 . 67 100 Err. 2 . 38 5 . 45 1 . 64 32 . 40 Std 0 . 90 0 . 86 0 . 08 3 . 72 200 Err. 2 . 20 5 . 31 1 . 65 Std 0 . 13 0 . 76 0 . 15 300 Err. 2 . 34 5 . 46 1 . 62 Std 1 . 01 1 . 60 0 . 035 o v er sixteen, while OCReP in eight cases o v er sixteen. F or b oth metho ds, b etter p erformance is ac hiev ed only for the case of M = 50 neurons. It ma y b e noted that with resp ect to pro cessing requiremen ts OCReP has clear adv a n tages, since it requires only a SVD step for eac h determination of γ , while the ab ov e t w o metho ds require full spectral decomp o sition and an additional matrix in v ersion. 6.2. Alternative r e gularization metho ds A ﬁrst comparison can b e done with the w ork b y Huang et al. (Huang et al. , 2012), whose tec hnique Extreme Learning Machine (ELM) uses a cost param- eter C that can b e considered as related to the in v erse o f our r egularization parameter γ . As a uthors state, in order to a chiev e g o o d generalization per- formance, C needs to b e c hosen appropriately . They do this b y tr ying 50 diﬀeren t v alues of this parameter: [2 − 24 , 2 − 23 , · · · 2 24 , 2 25 ]. A fair comparison can b e done on our classiﬁc atio n datasets, using their n um b er of hidden neurons, i.e. 1000. Our optimal choice o f γ allo ws to obta in a b etter p erfor ma nce on all datasets (with statistical signiﬁcance assessed at the same conﬁdence lev el that previous exp eriments). Deng et a l. (Deng et al. , 2009) propo se a Regularized Extreme Learning Mac hine (hereafter, RELM) in wic h the regularization parameter is se lected according t o a similar criterion a mong 100 v alues: [2 − 50 , 2 − 49 , · · · 2 50 ]. Be- cause their perfo rmance is optimized with resp ect to the n um b er o f hidden 27 T able 13: H-K estimate of ridge parameter: results at ﬁxed num b er of hidden neuro ns for regres s ion datasets. F o r Delta Ailero ns, av era ge error s and standar d devia tions hav e to b e m ultiplied b y 10 − 4 . Abalone Housing Delta Ailerons Mac hine Cpu M 50 Err. 2 . 13 4 . 87 1 . 60 34 . 28 Std 0 . 016 0 . 44 0 . 01 2 . 37 100 Err. 2 . 14 4 . 98 1 . 62 37 . 39 Std 0 . 90 0 . 67 0 . 029 3 . 18 200 Err. 2 . 33 8 . 101 1 . 73 Std 0 . 10 2 . 83 0 . 08 300 Err. 2 . 95 29 . 06 2 . 21 Std 0 . 41 9 . 26 0 . 41 T able 14 : Comparison b etw een OCReP and E L M Iris Wine D iab etes Segment OCReP Err. 2 . 22 1 . 28 21 . 06 3 . 40 Std. 0 . 21 0 . 88 0 . 65 0 . 25 ELM Err. 2 . 4 1 . 53 22 . 05 3 . 9 3 Std 2 . 29 1 . 81 2 . 18 0 . 69 neurons, for the sak e of comparison w e use OCReP v alues fro m table 6. W e obtain a statistically signiﬁcan t b etter p erformance on dataset Segmen t, while for Diab etes the metho d RELM p erforms better (see table 15). Comparing our results on the common regress ion dat a sets with the alter- nativ e metho d TROP-ELM prop osed b y Mic he et al. ( Mic he et al. , 201 1), w e note that OCReP achiev es alw ay s lo w er RMSE v alues 3 (with statistical signiﬁcance), as can b e se en from table 16. Besides, in our opinion o ur method is sim pler, in the sense that it uses a single step of regularization rather than t w o. In (Martinez-Martinez et al. , 2011), an algorithm is prop o sed for pruning 3 In that work, perfor mance a nd related statistics are expressed in terms of MSE; we only derived the corresp onding RMSE for compariso n with our results. 28 T able 15 : Comparison b etw een OCReP and REL M Diab etes Segme nt OCReP Err. 25 . 53 2 . 50 Std. 0 . 51 0 . 32 ¯ M 291 760 RELM Err. 21 . 81 4 . 49 Std. 2 . 55 0 . 0074 ¯ M 15 200 T able 1 6: Compariso n between OCReP and TROP-ELM. F or Delta Ailerons, average error s and standard deviatio ns hav e to be m ultiplied b y 10 − 4 . Abalone Delta Ailerons Mac hine Cpu Housing OCReP Err. 2 . 12 1 . 58 31 . 22 4 . 25 Std. 0 . 32 0 . 0048 0 . 78 0 . 13 ¯ M 178 298 63 255 TR OP-ELM Err. 2 . 19 1 . 64 264 . 03 34 . 35 ¯ M 42 80 28 59 ELM netw o rks b y using regularized regression metho ds: the crucial step of regularization parameter determination is solv ed b y creating K diﬀeren t mo dels, eac h one based on a diﬀ erent v alue of this parameter, among which the b est one is selecte d using a Bay esian information criterion. Authors state that a t ypical v alue for K is 100, th us an hea vy computational load is required, and the metho d is fo cused on regression tasks. 7. Conclusions In the con text of regularization tec hniques f or single hidden la ye r neural net w orks trained b y pseudoin ve rsion, w e pro vide a n optimal v a lue of the reg- ularization parameter γ by analytic deriv ation. This is ac hiev ed b y deﬁning a con v enien t regularized mat r icial form ulation in the framew ork of Singular V alue Decomp osition, in whic h the regularization parameter is derive d un- der the constraint of condition n um b er minimization. The OCReP metho d 29 has b een tested on UCI data sets f or b oth regression and classiﬁcation tasks. F o r all cases, regularizatio n implemen ted using the analytically derive d γ is pro v en to b e v ery eﬀectiv e in terms of predictivit y , as evidence d b y compari- son with implemen tations of other approac hes from the literature, includin g cross-v alidat io n. OCReP av oids hun dreds of pseudoin v ersions usually needed b y most other me tho ds, i.e. it is quite computationally attractive . Ac knowled gemen ts The activit y has b een partially carried on in the con text of the Visiting Professor Program of the G rupp o Nazionale p er il Calcolo Scien tiﬁco (GNCS) of the Italian Istituto Nazionale di Alta Matematica (INdAM). This w ork has b een partially supp o rted by ASI contracts (Gaia Mission - The Italian P articipation to DP AC ) I/058/ 10/0-1 and 20 14-025- R.0. References Ajorlo o, H., Manzuri-Shalmani, M. T., and L akdash ti, A. (2007) . Restoration of damag ed slices in images using matrix pse udo in v ersion. In Pr o c e e dings of the 22 nd I nternational symp osium on c omputer and i n formation sci- enc es , pag es 1–6. Bac he, K. and Lic hman, M. (20 1 3). UCI mac hine learning rep ository . Badev a, V. and Morozov , V. (1991). Pr obl` emes inc orr e ctement p os´ es: Th´ eorie et applic ations en ide n tiﬁc ation , ﬁltr age o p timal, c ontrˆ ole op- timal, analyse et synth ` es e de syst` emes, r e c onnai s s anc e d’image s . S´ erie Automatique. Masson. Berger, J. (1976). Minimax estimation of a m ultiv a riate normal mean under arbitrary quadratic loss. J. Multivariate Analysis , 6 , 256–264 . Bishop, C. M. (2006). Pattern R e c o gnition a nd Machine L e arning (Informa- tion Scienc e and Statistics) . Springer-V erlag New Y o r k, Inc., Secaucus, NJ, USA. Bousquet, O. a nd Eliss eeﬀ, A. (2002 ). Stabilit y and g eneralization. J. Mach . L e arn. R es. , 2 , 49 9 –526. 30 Cancelliere, R. (2001). A high parallel pro cedure t o initialize the output w eigh ts of a ra dia l basis function or bp neural net w ork. In Pr o c e e d i ngs of the 5th International Workshop on Applie d Par al lel Computing, New Par adigms for HPC in Industry and A c ademia , P ARA ’00, pages 384–390, London, UK, UK. Springer-V erlag. Deng, W., Zheng, Q., and Chen, L. ( 2 009). Regula r ised extreme learning mac hine. In Pr o c e e dings of the IEEE Symp osium on Computational Intel- ligenc e and Data Mining . Devro y e, L. P . and W ag ner, T. (19 79). Distribution-free p erformance b ounds for p oten tial function rules. Information The ory, IEEE T r ansactions o n , 25 (5), 601–604. Dorugade, A. and Ka shid, D . (2010). Alternativ e metho d fo r c ho osing ridge parameter for regress ion. Applie d Mathematic al Scienc es , 4 , 44 7–456. F u, W. (1998 ). P enalized regressions: the bridge vs. the lasso. Journal of Computational and Gr ap h ic al Statistics , 7 , 397– 416. F uhry , M. and Reiche l, L. ( 2 012). A new t ikhonov regularization metho d. Numeric al Algorithms , 59 (3), 433–445. Gallinari, P . and Cibas, T. (1 999). Practical complexit y con trol in m ultila ye r p erceptrons. Signal Pr o c essing , 74 , 29 –46. Geman, S., Bienensto c k, E., and Doursat, R. (199 2 ). Neural netw orks and the bias/v ariance dilemma. Neur al Computation , 4 , 1–58. Girosi, F., Jo nes, M., and Poggio, T. (1 995). Regularization Theory and Neural Net w orks Arc hitectures. Neur a l Com putation , 7 , 219–269. Golub, G., heath, M., a nd W ah ba, G . (1979). Generalized cross-v alida t ion as a metho d for c ho osing a go o d ridge parameter. T e chno m etrics , 21 , 215–223. Golub, G. H. a nd V an Loan, C. F. (19 96). Matrix c omputations (3r d e d.) . Johns Hopkins Univ ersity Press, Baltimore, MD, USA. Golub, G. H., Hansen, P . C., and O’Leary , D. P . (1999). Tikhono v regu- larization and to t a l least squares. SIAM J. Matrix Anal. Appl. , 21 (1), 185–194. 31 Hastie, T., Tibshirani, R., and F riedman, J. (2009). The elements of Statisti- c al L e arni n g: Data Mining , Infer enc e, and Pr e diction (2r d e d.) . Springer. Ha ykin, S. (1 999). Neur al Networks: A Compr ehensive F oundation . Inte r- national edition. Pren tice Hall. Helm y , T. and Rasheed, Z. (200 9). Multi-category bioinformatics dataset classiﬁcation using extreme learning m achine. In Pr o c e e dings of the Eleventh c onfer enc e o n Congr ess on Evolutionary Computation , CEC’09, pages 3234–3240, Pisc at aw a y , NJ, USA. IEEE Press. Ho erl, A. (1962). Application of ridg e analysis to regression problems. Chem- ic al Engine ering Pr o gr ess , 58 , 54–59. Ho erl, A. and Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. T e chnometrics , 12 , 55–67 . Ho erl, A. and Kennard, R. (1976). Ridge regression: iterative estimation of the biasing parameter. Comm unic ations in Statistics , A5 , 77–8 8. Huang, G.-B., Zh u, Q.-Y., a nd Siew, C.-K. (2006). Extreme learning ma- c hine: theory and applications. Neur o c omputing , 70 (1), 489–501. Huang, G.-B., Zh u, Q.-Y., and Siew, C.-K. (2012). Extreme learning mac hine for regression and m ulticlass classiﬁcation. IEEE T r a nsactions on Systems, Man, and Cyb erne tics– Part B: Cyb ernetics , 42 (2), 513–529. Kearns, M. and Ron, D. (1999). Algorithmic s tability and sanit y-c hec k b ounds for leav e-o ne-out cross-v alidation. Neur al Com p utation , 11 (6 ), 1427–145 3. Khalaf, G. and Sh ukur, G. (20 05). Cho osing ridge pa rameter for regression problem. Communic ations in Statistics – The ory an d metho ds , 34 , 1177– 1182. Kibria, B. (2 0 03). P erformance of some new r idge regression estimators. Communic ations in Statistics – Simulation and Computation , 32 , 419– 435. Kohno, K., Ka w amoto, M., and Inouy e, Y. (2 010). A matrix pseudoin v ersion lemma and its application to blo ck -ba sed adaptive blind decon v olution for mimo syste ms. T r ans. Cir. Sys. Part I , 57 (7), 1449–1462. 32 La wless, J. and W ang, P . (19 7 6). A sim ulation study of ridge and other regression estimators. Communic ations in Statistics , A5 , 307–324 . Mardiky an, S. and Cetin, E. (2008). Eﬃcien t ch oice of biasing constan t for ridg e regression. Internation a l Journal of Con temp or ary Mathematic al Scienc es , 3 , 52 7–547. Martinez-Martinez, J., Escandell-Mon tero, P ., Soria-Oliv as, E., Martn- Guerrero, J., Magdalena-Benedito, R ., and Juan, G.- S. (2011) . Regula r ized extreme learning mac hine f or regression problems. Neur o c omputing , 74 , 3716–372 1. McDonald, G. and Galarneau, D. (1975). A mon te carlo ev alua tion of some ridge-t yp e estimators. J. A mer. Statist. Asso c. , 70 , 40 7–416. Mic he, Y., v an Heeswijk, M., Bas, P ., Sim ula, O., and Lendasse, A. (2011) . T rop-elm: A double-regularized elm using la rs and tikhonov regularization. Neur o c omputing , 74 (16), 2413 – 24 21. Mukherjee, S., Niy ogi, P ., Poggio, T., and Rifkin, R. (2003). Statistical learning: stability is suﬃcien t for generalization and necessary and suﬃ- cien t fo r consistency of empirical risk minimization. CBCL, Pap er 2 2 3 , Massachusetts I n stitute of T e chno lo gy . Mukherjee, S., Niy ogi, P ., P oggio , T., and Rifkin, R. (2006 ). Learning theory: stabilit y is suﬃcien t for generalization a nd necessary and suﬃcien t fo r consistency of empirical risk minimization. A dv anc es in Computational Mathematics , 25 (1-3), 161–193. Nguy en, T. D., Pham, H. T. B., and Dang, V. H. (2010). An eﬃcien t pseudo in v erse matrix-based solution f or secure auditing. In Pr o c e e dings of the IEEE International Confer e n c e on Computing and Communic ation T e ch- nolo gies, R ese ar ch, In novation, and Visio n for the F utur e , IEEE In terna- tional Conference. Nordb erg, L. (1982). A pro cedure fo r determination of a go o d ridge param- eter in line ar regression. Communic ations in Statistics , A11 , 285–309. P enrose, R. and T o dd, J. A. ( 1956). On b est approximate solutions of linear matrix equations. Mathematic al Pr o c e e dings of the Cambridge Philosoph- ic al So ciety , n ull , 17–19. 33 P oggio, T. and Giro si, F . (1990a). Net w orks for appro ximation and learning. Pr o c e e dings of the IEEE , 78 (9), 1481–1497. P oggio, T. a nd Girosi, F. (1990b) . Regularization algorithms for learning that are eq uiv alen t to m ultilay er net w orks. S cienc e , 247 (4945), 978–982. P oggio, T., Rifkin, R., Mukhe rjee, S., and Niy ogi, P . (2004). General condi- tions for predic tivity in learning t heory . L etters to Natur e , 428 , 419/422. Rao, C. a nd Mitra, S. (1971 ). Gener alize d inverse o f matric es and its appli- c ations . Wiley series in probability and mathematical statistics: Applied probabilit y and statistics. Wiley . Saleh, A. and Kibria, B. (1993). Performances of some new preliminary test ridge r egr ession estimators and their properties. Communic ations in Statistics – The ory and metho ds , 22 , 2747–2764. Tibshirani, R . (1996). Regression shrink age and selection via the lasso. Jour- nal o f R oyal Statistic al So ciety , 58 , 267 –288. Tikhono v, A. and Arsenin, V. (1 9 77). S o lutions of il l-p o se d pr oblems . Scripta series in ma t hematics. Winston, W ashington DC. Tikhono v, A. N. (1963). Solution of incorrectly fo rm ulated problems and the regularization metho d. Soviet Math. Dokl. , 4 , 1035–1038. W ang, X. (2013) . Special issue on extreme learning mac hines with uncer- tain ty . Int. J. Unc. F uzz. Know l. Base d Syst. , 21 . W ang, X.-Z., D., W., and Huang, G .-B. (2012). Special issue on extreme learning mac hines. Sof t C omputing , 16 (9), 1461–1463. Y u, D. and Deng, L. ( 2012). Eﬃ cien t and eﬀectiv e algorithms f o r training single-hidden-la y er neural net w orks. Pattern R e c o gn. L ett. , 33 (5), 554–55 8 . 34

OCReP: An Optimally Conditioned Regularization for Pseudoinversion Based Neural Training

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment