Training Recurrent Neural Networks as a Constraint Satisfaction Problem

1  Abstract — This paper presents a new approa ch for training artificial neural networks using techniques for solving the constraint satisfaction problem (CSP). The quotient gradi ent system (QGS) is a trajectory base d method for solving the CS P. This study converts the traini ng set o f a neural n etw ork into a CSP and uses the QGS to find its solutions. The QGS finds the global minimum of t he optimization problem by tracking trajectories o f a nonlinear dyna m i cal system and does not stop at a local minimum of the optimization problem. Lyapunov theory is used to prove the asymptotic stability of the solutions with and w ithout the presence of measurement errors. Numerical examples illustrate the effectiveness of the propose d methodology and compare it to a genetic algorithm and er ror bac kpropagation. Keywords — Neural Netw orks, Global Opti mization, Quotient Gradient System, Modeling, Training I. I NT RODUCTION After Mi nsky and Paper t sho wed that two -layer perceptro n cannot approximate functions generally [1], it to ok nearly a decade for researchers to show that multilayer feedforward neural networks are universal approximators [2]. Since then, neural networks have been s uccessfully used in v ar ious science and engineering applicatio ns [1],[2]. How ever, training and learning t he internal structure of neural networks ha s re mained a challenging prob lem for researchers. Training neural networks req uires solving a nonlinear no n- convex o ptimization problem and resear chers ha ve pr oposed different approaches to solving it [3 ]. Classic al optimization methods were the first methods used for training neural networks. The most widel y used tr aining algorithm i s error backpropagation which minimizes a n error function using the steepest decent algorit hm [4] . Although er ror b ackpropagation is eas y to implement, it has all the disadvantages of Ne wton- based optimization al gorithms including slow co nverg ence rate and trapping i n local minima. Local minima ca n decrease t he generalization ability of t he neural network [3],[4 ] . upon w ork supported by the National Science Foundation under Grant No. IIA -1301726. H. Khodabandehlo u i s graduate student at Univer sity of Nevada, Reno, Reno, NV, 8955 7 USA ( e-mail: hkhodabande hlou@nevada.unr.edu ) M. Sami Fadali is professo r at Uni versity of Nevada, Reno, Reno, NV, 89557 USA (e-mail: fadali@unr.edu ) To c o pe with these d eficiencies, researches p roposed other training methods such as supervised learning and global optimization approaches [5],[6 ],[7]. Supervised learning approaches learn the interna l structure of the neural net work while learning internal w eig hts of the neural network. Learning the internal structure of the neural net work makes th ese approaches more efficient a nd less reliant on para meters selected by the user [8],[9 ] . Researchers proposed different supervised learning methods such as the tiling algorithm , cascade-correlati on algorithm, step net, and the scaled co njugate algorithm , among others [9]. While in increment al supervised learning approaches network size gro ws in the training phase which may result in over-fitting, some supervised learning appro aches p rune the over-fitted n et work during training [10],[1 1],[12]. How ev er, few of t hese method s have b een succes sfully applied to la rge scale practical problems [9 ]. T his is in co ntrast to conjugate gradient methods which are attractive for large scale problems due to their fast conver gence rate [13] . Quasi-Newton metho ds are a sop histicated alternative to conjugate gradient methods for supervised learning, altho ugh t heir reliance on exact approximation of the He ssia n matrix makes them inefficient in some applications [ 14 ] . Global optim izatio n m e thods a re an o ther alternati ve to cope with deficie ncies of Newton-based methods and learn the internal struct ure of neural net works. Genetic algorithms and simulated annealing have be en widel y used to train neural networks and opti mize network structure [15],[16]. These approaches assume that t he quality of the networ k is related to network topology and parameters. A lop ex i s another global optimization app roach which trains the network u sing the correlation bet ween changes in w ei ghts and change s in the erro r function. Due to local co mputations of the Alopex, it is more suitable for parallel computati on [ 17 ]. Taboo search is another stochastic ap proach wh ich has been frequently used to train neural networks. It can find th e opti mal or near op timal solution of the o ptimization prob lem [1 8] . Implementation of taboo sear ch is ea sier than most global optimization methods an d the method is generally applicable to a wide variety of optimization problems [ 19 ]. Researchers have used a combination of global optim izatio n methods for training neural net works. GA-SA is a combinatio n of a genetic algor ithm a nd simulated annealing. G A -SA uses a genetic algorithm to make si mulated annealing faster to redu ce the tr aining time [2 0],[21] ,[22]. NOVEL is another hy b rid approach which uses a traj ectory-based method to find feasible T raining Recurrent Neural Networks as a Constraint Satisfaction Problem Hamid Khodabandeh lou and M. Sam i Fadali 2 regions of t he solution space and then locates local t he minima in the feasible regions b y local search [ 23 ] . Although global opti mization methods have bee n ap plied for training neural networks, there ar e other p romising glo bal optimization approaches that have not bee n used f or neural network training. Quotie nt gra dient system is a tr ajectory based method to find fea sible solutions of constraint satisfaction problems. QGS searches for the feasible solutio ns of the CSP along the trajector ies of a nonlinear dyna mical s ystem [ 24 ]. This paper exploits QGS to tr ain artificial neural net works by tra nsforming the training data set into a C SP , then tran sforms the res ulting CSP to an unconstrai ned minimization problem. After constructing the u nconstrained minimiza tion problem, th e nonlinear QGS dynamical system is def ined. Using the fact that the eq uilibrium points of the QGS are local minima o f the unconstrained minimization p roblem, a neural net work ca n be trained by integrating QGS over time until it reac hes an equilibrium p oint. T he meth o d is easy to i mplement becau se constructing t he nonlinear d ynamical s ystem is similar to deriving the equations of the steepest descent algor ithm. T he algorithm fi nds multiple local minima of t he optimization by forward and backward integration of the QGS . This p rovides an easy and straight forward approach to find multiple local minima of the o ptimization problem. Ho w ever, like other global opti mization methods, finding local m inima takes more time than Ne wton-based methods. Nu merical examples sho w that QGS outperforms error backpropagation and a genetic algorithm a nd the resultin g net work has better generalizat ion capability. A preli minary vers ion of the paper which co mpares the method with erro r backpropagation was prese nted in [ 26 ]. Solving optimization problems with d ifferent initial p oints is one o f the app roaches to cop e with local minima in Ne wton - based meth o ds. However, th e selected initial points ma y be in the st abilit y region of the same stable equilibrium point, which makes this ap proach inefficient. QGS uses back ward integration to escape from the stability region o f a sta ble equilibrium p oint, then e nters the stability region o f a nother equilibrium p oint with f o rward integration. T his allo ws QGS to explore a bigger region in its search for local minima. The simple imple mentation, along with the global opti mization property of QGS justify its use as a new tr aining method for artificial neural network s. The rem ai nder o f this paper is organ ized as f ollo ws: Section II p resents the QGS methodology. Sectio n II I descr ibes the structure of the neural network. Application of QGS in training neural network is pr esented in Section I V. Section V establis hes the stability of the pro posed method and examines the e ffect of input errors on its stabilit y. N umerical examples are provid ed in Section VI and Section VII presents the co nclusion. II. Q UOTI EN T GRADIEN T SYSTEM CSP is an acti ve field o f r esearch in artificial intelli gence and operations research. Le e and C hiang [ 24], used the trajectories of a nonlinear d ynamical s ystem to fi nd the solutions of t he CSP. This section revie ws their w ork that forms the basis for our ne w approach to n eural network training which is presented in Section IV . Consider a system of nonli near equality and inequality constraints   󰇛  󰇜     󰇛  󰇜       (1) To guarantee t he existence of the solution of t his CSP,   and   are assu med to be smooth. The CSP can be transformed into the unconstrained opti mization prob lem    󰇛  󰇜      󰇛  󰇜     󰇛   󰇜    (2)  󰇛  󰇜     󰇛  󰇜       󰇛󰇜          󰇛          󰇜  (3) where the slack variable   has been introduced to transform the inequality co nstraints to equality constraints. The global minimum o f (2) is the optimal solution of the ori ginal CSP . The QGS is a nonlinear dyn a mical sy ste m of equations defined based on the constraint set a s  󰇗   󰇛  󰇜    󰇛  󰇜     󰇛  󰇜  󰇛󰇜 (4) Lee and Chiang showed that stable equilibrium points of the QGS are lo cal mini mums of uncon strained minimization problem ( 2) w hich are p ossible f easible solutions of the original CSP. A s olution of t he QGS starting from initial point 󰇛󰇜 at initial time    is called a trajec tory or orb it. An eq uilibrium manifold is a p ath co nnected c omponent of   󰇛󰇜 . Assu ming that  󰇛    󰇜      is a n orbit of the QGS, an eq uilibrium manifold  of the QGS is stable if    there e xist  󰇛  󰇜   such that     󰇛  󰇜   󰇛    󰇜    󰇛  󰇜      (5) where   󰇛  󰇜  󰇝                 󰇞 . If  can be chosen such that     󰇛  󰇜     󰇛    󰇜    󰇛  󰇜 (6) the equilibrium manifold is a symptotically s table . An equilibrium manifold  w hic h is not stable is un stable . An equilibrium manifold is pseudo-hyp erbolic if    , the Jacobian of 󰇛 󰇜 at  has no eigenv alues with a zero real part on the nor mal space of  at     and there exist    such that      󰇛  󰇜   is locally homeomorphic to projection from   to   w ith  th e dimension of the equilibrium manifold. The stabilit y region o f the stable equilibrium manifold is an open, connected and invaria nt set and is defined as  󰇛   󰇜  󰇝         󰇛    󰇜    󰇞 (7) The boundar y of a stable e quilibrium manifold   is th e stability bo undary and is denoted by  󰇛  󰇜 . QGS is assumed to satisfy the following as sumptions. Assumptions : let   be sta ble equilibrium man ifold of QGS (A1) If an equilib rium man ifo ld  has nonempty intersec tion with  󰇛  󰇜 then     󰇛  󰇜 (A2) A ll the equili brium manif o lds on  󰇛  󰇜 are pseudo- hyperboli c and have the same d imens io n (A3) T he sta b le and uns table manifolds o f equil ibrium manif old s on   󰇛  󰇜 satisfy the transvers ality conditi o n. (A4) The f unction  satisfies one of the f ollowing (1)  󰇛  󰇜  is a pr o per m ap (2) For any    and any cl osed subset  of 3 󰇝      󰇛  󰇜      󰇛  󰇜  󰇛󰇜  󰇞 ,   󰇝   󰇛  󰇜   󰇛  󰇜     󰇞   where 󰇛󰇜 denotes the gradient o f 󰇛󰇜 . The transversality condition of assumption A3 is defined as follow s. Let   an d   b e manif olds in   of codimen sions   and   . We say that   and   inters ect transvers ally if (i) for every        the re exist an op en neigh borhood    of  , and (ii ) a system of function s 󰇛         󰇜 for       and 󰇛        󰇜 for       such that the set 󰇝   󰇛  󰇜    󰇛  󰇜                 󰇞 is linearly independent for all            [25]. The foll o wing theorem assures the stability of QGS and redefines the stability boundary under assumpti o ns A 1 -A4. Theorem 1 [27] : Let   be a stable equilibriu m manifold of QGS and suppose that assum ptions A1-A4 hold. Then we have the foll o w ing: 1) The QGS is completely s table, i.e., eve ry traject ory o f QGS converg es to an equil ibrium man ifold 2) Let 󰇝        󰇞 be the set of all equili b rium manifolds on  󰇛  󰇜 , then  󰇛   󰇜  󰇌   󰇛  󰇜   where   󰇛󰇜 is a stable manifold of p seudo-hyperbolic equilibrium manifold and is d efined as   󰇛󰇜  󰇝         󰇛    󰇜  󰇞 (8) The next theor em shows th at solving t he CSP is equivalent to finding stable equilibriu m manifolds of the QGS. Theorem 2 [24] : C o nsider the CSP and its ass ociated quotient gradient sy stem. If a ssum ptio ns (A1-A4) h o ld, then we have the follow ing I. Each path com p onent of the so lution set of the CSP is a stable e q uilib rium m anifold of the QGS II. If  is a stable equilibrium manifold of the QGS, then  consists of non-isolated local minima o f the follow ing min imization pro b lem      󰇛󰇜 (9) where      is defined as  󰇛  󰇜      󰇛  󰇜  III. If      then  is a component of the soluti o n set of the CSP if and only if  is an    d imens ional stable equilibrium manifold of the QGS A stable equilibrium manifold of the QGS may not be in the feasible region of th e CSP . In su c h cases, the QGS must esca pe from this equilibrium manifold and enter the stab ility region of another stable equilibri um manifold. If the ne w equilib rium manifold is not in the feasibl e region, this process is rep eated until t he QGS e nters t he stability region o f a feasible equilibrium manifold or until it satisfies a stopping criterion. Once a feasible manifold is reached , QGS is integrated o ver time until an equilibr ium po int is reached. To escape from the basin of attraction of a stable equilibrium point, QGS is integrated bac kward in time until a n unstable po int is reached . Thus, solving the optimization prob lem beco mes a series of forward and backward integrations o f the QG S until the stopping criteria is satis fied. III. N EURAL NE TWORK S Function approximation is r equired in many fields of science and engineeri ng. Neural net works are general functio n approximators and hav e b een successfully applied to d ifferent function appro ximation applications [2 ],[3]. B ased on the nature of t he applicatio n, researcher s have developed different versions of neural networks su c h as feedforward networks, recurrent neural networks, liq uid state networks and wavelet networks among the others [ 28 ]. In this study, we consider a three-la yer fully r ecurrent neural network with smooth acti vation fu nctions. Fig. 1 illustrate s the internal structure of the ne ural net work. The network has i nput  󰇛  󰇜  󰇟   󰇛  󰇜    󰇛  󰇜󰇠  , internal state   󰇛  󰇜  󰇟   󰇛  󰇜    󰇛  󰇜󰇠  and output   󰇛  󰇜  󰇟    󰇛  󰇜     󰇛  󰇜󰇠  . T he input-output equation of the network is described as  󰇛  󰇜   󰇛  󰇜   󰇛    󰇜    󰇛  󰇜   󰇛  󰇜 (10)  ,  and  are net work weights matrices whose size is dependent on the num b er o f net work inputs, outpu ts and hidden layer nodes. For a network with  inputs,  outputs a nd  hidden la yer nodes,            and      . The cost function for training neural network is the trad itional sum of squared err ors (SSE)     󰇛  󰇜   󰇛  󰇜       󰇛  󰇜   󰇛  󰇜   󰇛  󰇛  󰇜  󰇛 󰇜󰇜   (11) where   󰇛  󰇜 is th e net work outp ut,  󰇛  󰇜 is the measured o utput, and  is the total number of training samples. Fig. 1. Artificial neural netw ork structure IV. A PPLYING QGS TO NEURAL NE TWORK TRAINING Solving the CSP is eq uivalent to an unconstrained minimization problem (2 ). QG S is a trajectory-based method to 4 find the local minima o f (2 ) which are t he p ossible feasible solutions of the C SP. To train neural network s using QGS, w e consider the training se t as eq uality co nstraints o f t he CSP and then tra nsform the CSP in to unconstrained minimizati on problem as (2 ) and t hen we use t he second part of Li a nd Chiang’s work which is eq uilibrium point s of QGS are local minimums o f unconstrained minimization problem. If  measurements are available, the CSP can be written as 󰇛󰇜  󰇟   󰇛󰇜 󰇠          󰇛  󰇜   󰇛  󰇜   󰇛    󰇜    󰇛  󰇜 (12) The netw o rk s tate v ector  in cl udes all the network p arameters, i. e, all entries o f    and  . Mor e specifically, if we partition    and  as   󰇯        󰇰    󰇯        󰇰    󰇯        󰇰  (13) then  is defined as   󰇟   󰇠     󰇟                         󰇠           󰇛   󰇜 (14) Since the training set doe s not co ntain a ny inequality constraints, slack variables ar e not needed . Using the traini ng set, the QGS for trainin g the neural network ca n be defined as  󰇗  󰇛 󰇜     󰇛  󰇜  󰇛󰇜 (15) where    󰇛  󰇜         󰇛  󰇜     󰇛  󰇜         (16) To train neural network using QGS, we use the fact that the equilibrium points of QGS are local minima of the unconstrained minimizatio n problem. T herefore the algorit hm needs to f ind an equilib rium po int of QGS a nd then e scape fro m that equ ilibrium point and mo ve to ward another equilibri um point of QGS. The first step is to integrate the QG S from a starting point, w hich need not be feasible, to find an eq uilibrium point. Next, w e escape from the stab ility re gion of the sta ble equilibrium p oint t o an unstable point with bac kward integration of QGS in time. T he eigenvalues of the Jaco bian matrix can be used as a measure of stability and ins tability. The algorithm continues until it cannot find an y new eq uilibrium point or until it satisfies the sto pping criterion. Because neural network training has equilibrium points which can b e considered as zero-di mensional equilibrium manifolds, assu mptions A1, A2 a nd A3 hold. When the activation f unction o f the neural network is a one-one invertib le function ,  󰇛  󰇜  is a proper map. Assumption 4 also holds because the QGS is asym pto tically stable an d  󰇛  󰇜  is proper. V. STABILITY Any training algor ithm must be stable, even in p resence of measurement err or and uncertainties. W e use Lyapunov stability theor y to prove the asymptotic stability o f equilibrium points and their asymptotic stability i n the presence of measurement errors. Theorem 3: T he eq uilibrium po ints of the quotient grad ient system are asymptoticall y stable Proof : Consider the Lyapunov function  󰇛  󰇜    󰇛  󰇜 󰇛󰇜 (17)  󰇛󰇜 is a locall y positive de finite function of t he state t hat is equal to zer o at global optima of the opti mization proble m. Thus,  󰇛 󰇜 is a locally positive definite functio n in the vicinity of each equilibrium point. The derivative of the Lyapunov function along the system trajector ies is  󰇗        󰇗                 (18) The d erivative of the Lyapunov function is negative d efinite in the vicinity o f each equilibrium po int of the QGS, i.e. in the vicinity of each local minim um of t he op timization pr oblem. The Jacobian  is positive definite in the vicinity of the equilibrium points beca use they are minima of the c ost function. Therefore, all the equilibrium poi nts of the QG S are locally asymptoticall y stable. ● Under certain conditions, the eq uilibrium point s are exponentially stable as s hown in the next theore m. Theorem 4 : The equilibrium p oints o f the QG S are exponentially stable. Proof: Consider the Lyapunov f u nction of (17) . When there is no repeated measurement,  is f ull rank and therefore    is a positive d efinite matrix. Assume that    is the smallest singular value o f the po sitive def inite matrix    . The derivative of t he Lyapunov f unction can be written as  󰇗        󰇗    󰇛  󰇜     󰇛  󰇜       󰇛  󰇜  󰇛  󰇜       󰇛  󰇜  (19) where   is the smallest eigenv al ue of T herefore  󰇗     and the equilibrium po ints of the QGS are exponentially stable. W ith t he bo unded i nput and output assumption, (29) yields that     is bo unded. Therefore the spectral radius and conseq uently smallest si ngular value of   are finite. Measurement errors and noise ca n make t he measurements inaccurate and destabilize the s ystem. Fo rtunately, QGS can tolerate relatively large measurement error. In n eural net works, measurement err ors lead to errors in neural network inputs. Consider the QGS a s a function of  and  , i.e,  󰇗   󰇛 󰇜 and let the measurement err ors change  to    . Assuming that  is small  󰇗   󰇛     󰇜   󰇛   󰇜   󰇛   󰇜     (20) 5 where  denotes high er order terms . For sufficientl y sm a ll  , we can neglect hig her order terms and write  󰇗   󰇛     󰇜   󰇛   󰇜   󰇛   󰇜     󰇛   󰇜  󰇛  󰇜 (21) Assuming that the activatio n func tions of the neural network are continuousl y di fferentiable,  w ill be continuously differentiable and hence  is Lipschitz for all    and      , with  the do main t hat contains the equilibriu m p oint. Assume tha t t he perturbatio n ter m satisfies the linear growth bound   󰇛    󰇜             (22) To find a b ound on the perturb ation that guara ntees stability, we need the following pro perty of matrix norms Fact: For every                       (23) where     is the Frobenius nor m of  and  is its rank. Theorem 5: Assume that the input and output of the network are bounded,       and       , and the cor responding neural net work ha s  inputs,  hidden la yer nodes and we have  measurements. The equilibrium o f the perturbed QGS is asymptotically stable i f       󰇧             󰇨   (24) Proof: Consider the Lyapunov f unction  󰇛  󰇜    󰇛  󰇜 󰇛󰇜 . The derivative of  󰇛󰇜 including th e perturbation is  󰇗  󰇛       󰇜                     (25) For a negative definite  󰇗 , we need               (26) This condition is satisfied i f                 (27) Using the nonlinearity of (12) with a b ounded output,  satisfies      󰇛       󰇜 (28) The Jacobian of the hyperb olic function gives        󰇛              󰇜 (29) Using proposition 1 gives t he 2-nor m bound        󰇛              󰇜 (30) By combining ( 29 ), ( 27 ), ( 26 ) and ( 22 )         󰇛              󰇜󰇛       󰇜 (31) Using t he i nput bound       gives t he condition for negative definite  󰇗       󰇧            󰇨   (32) ● VI. S I M ULATION RESULT S To illustrate the e ffectiveness of QGS for trainin g neural networks, w e use a QGS trai ned netw or k for nonl inear syste m identification and co mpare the results with a genetic algorit hm we u se t he r esults of [ 26]. The genetic algorithm o pti mization uses the MATLAB optimization toolbox w it h a population size of  , Roulette selection, ad aptive feasible mutation, scattered crossover , a nd top fitness scaling to get the best results. A. Example 1: Nonlin ear system Our first benchmark syste m is the second order nonlinear system chosen f ro m [31]. The in put -output equation of the system is described as  󰇛    󰇜  󰇛 󰇜 󰇛  󰇜󰇛 󰇛  󰇜    󰇜    󰇛  󰇜    󰇛  󰇜   󰇛 󰇜 (33)  󰇛 󰇜 is the syste m input and  󰇛 󰇜 is the system output.  󰇛 󰇜 is zer o-mean normally distributed w ith standard deviation    . The input to the ne ural n etwork is  󰇛  󰇜  󰇟    󰇛    󰇜   󰇛  󰇜     󰇛  󰇜 󰇠  and 󰇛   󰇜 is the target output for training. All the i nitial net work p arameter valu e s are ze ro -mean normally distrib uted with st andard d eviation     . T he optimal nu mber of h id den la yer n o des is foun d to be    and the total number of training sets is    . T he activation function of the neural network is the tangent hyperb olic function  󰇛  󰇜   󰇛  󰇜            (34) After i nitializing w it h rando m initial v al ues, QGS finds  local minima of the o ptimization problem. T he local mini mum with the b est generaliza tion ca pability is the global m ini mum or close to the op timal solution of the optimization pr oblem. Table I summarizes the mean squared error (MSE) for test data for QGS network, genetic algorithm network, and error backpropagation network. The MSE of the QGS network is less than the MSE for the genet ic algorithm net work and both networks outperform the b ackpropagat ion trained network. Other si mulation res ults tha t ar e not included h er e f o r b revity, including generalizatio n errors, demonstrate that ba ck propagation gives much wo rse results than t he two other networks. Hence, w e do not include back prop agation in the remainder of this example. Table I. Mean squared error Training method QGS GA EBP MSE 0.00797 0.0082 0.0187 Fig. 2 shows t he outputs of the system, the QGS trained network, and the ge netic al gorithm trai ned n et work and Fig. 3 6 shows the same outp uts for test data. While the training res ults is the same for both networks, Fig. 3 shows that QGS train ed network has better generaliza tion p erformance on random input as test data and has smaller ge neralization error. Fig. 2. Outputs of the system and the trained neur al networ k s for the training set Fig. 4 shows the generalizatio n error for the QGS trained network and genetic algor ithm trained network. While b oth networks have similar p erformance with the train ing dat a as input, Fig. 4 illustrates that the QGS tr ained net work has better generalization capability in ter ms of maximum ge neralization error percentage and mean squared error for test data. The average absolute generalizatio n error of QGS trained net work is   while avera ge absolute gen eralization error of genetic algorithm trained network i s    . Fig. 3. O utputs of the sy stem an d the t rained n eur al networ ks for te st dat a Fig. 4. G eneraliz ation Er ror B. Example 2: NARMA S ystem Our first benc hmark system is a tenth or der nonlinear autoregressive moving average (NARMA) p rocess [29]. T he input-output equat ion of the system is described as  󰇛    󰇜   󰇛  󰇜     󰇛  󰇜   󰇛    󰇜       󰇛    󰇜  󰇛  󰇜   (35)  󰇛 󰇜 is the syste m input and  󰇛 󰇜 is the system output.  󰇛 󰇜 is zer o-mean normally distributed w ith standard deviation    . T he input to the syste m is  󰇛  󰇜  󰇟  󰇛  󰇜     󰇛    󰇜   󰇛  󰇜     󰇛  󰇜 󰇠  and 󰇛   󰇜 is the target output for training. All the i nitial net work p arameter valu e s are ze ro -mean normally distrib uted with st andard d eviation     . T he optimal nu mber of h id den la yer n o des is foun d to be    and the total num b er o f traini ng sets is    . Af ter initial izing with rando m i nitial values, QGS finds  local minima for the optimization problem. T he local minimum with the best generalization capability is the global minimum o r close to optimal solution of t he optimization prob lem. Table II. summarizes t he MSE for test data for the QGS trained network, the genetic algor ithm trained ne twork and the error b ackpropagation trained network. T he MSE o f QGS is smaller than the MSE f o r the genetic algorit hm trained network. Both networks o utperform the error backpropagation trained network. As i n Exa mple 1 , o ther simulation results are much worse f or back propagation than for the oth er tw o networks and we omit back propagation results for the r emainder of t his example. Table II. Me an squared error Training method QGS GA EBP MSE 0.0026 0.0038 0.0087 Fig. 5 shows t he outp uts of the system, the QG S trained network, and the genetic algorithm trained network. Fig. 6 shows the same o utputs for test d ata . Fig. 5 shows that QGS trained network has better p erform ance on train data and Fig. 6 7 shows that QGS trained network outperfor ms g ene tic algor ithm trained network on rando m input as a test data. Fig. 7 shows the generalizatio n error for the QGS trained network and for the genetic algorithm trained net work. Wh ile both networks h ave similar p erformance with t he training d ata as input, Fig. 6 and Fig. 7 show t hat the QGS trained network has better generalizatio n cap ability. T he average absol ute generalization error of QGS trained net work i s   while average absol ute ge neralization er ror o f genetic a lgorithm trained network is   . Fig. 5. Outputs of th e sy stem and the trained ne ural networ k s for the training set Fig. 6. O utputs of the sy stem an d the tr aine d neural n etw ork with the t es t data a s i nput. Fig. 7. G eneraliz ation Er ror VII. C ONCL USION In th is study, w e introduce a n ew training algorithm for ne ural network using the QGS. QGS uses traj ectories of a nonlinear dynamical s ystem to find a local minima o f the optimization problem. T he loca l minimu m with t he best generalizat ion capability is t he global min i mum of the optimization problem. Simulation result s shows that QGS tr ained ne twork perfor m s better than net works tr ained using genetic algorith m and err or backpropagation . In p articular , QGS net works ha ve better generalization proper ties, faster traini ng time in comparison to genetic algorithm a nd are more rob ust to errors in the inputs. In contrast to Newton based met hods, QGS does not need multiple initial val ues to find multiple local minima and does not need a huge number of measure ments for training. Therefore, QGS is particularl y suited to applications w it h a limited number o f available input-output measurements. Fu t ure work will exploit the pro jected gradient system (P GS) [30] , to gether with the QGS , to develop a training algorithm for neural networ ks by searc hing for local minima o f an optimization proble m. R EFERENCES [1] M. Minsky, S. Papert, “ Pe rceprrons, ” Cambridge: MIT Press, 1969 [2] K. Hornik, M. Stinc h combe and H. White , “ Multilaye r feedforward networ k s are universal approximator, ” Neural Networks , Vol. 2, pp 359 - 3661, 1992. [3] I. E. Livie ris, P. Pintelas, “A survey on algorithms for tra i ning artificial neural network s,” Tech. Rep, Departme n t of Math, Univer sity of Patras, Patras, Gr e ece, 2008. [4] D.E. Rumelhart, G.E. Hinto n, and R.J. Williams, “ Learning internal represe n tations by error pro pagation, ” In D. Rumelhart and J. McClelland, editors, Parall el Distributed Processing: Explorations in the Microstructure of Cognition, chapter 8, MIT press, cambridge , MA 1986. [5] M. Gori and A. Tesi, “ On th e problem of local mini ma in backpropagatio n , ” IEEE Transaction on Pattern Analysis and Mach i ne Intelligence , vol . 14, issue 1, pp. 76 - 86 , 1992. [6] C.L.P. Chen and Jiyang Luo , “ Instant learni ng for superv i sed learning neural netwo rks: a ran k -expansion algorithm, ” IEEE Internatio nal Conference o n neural netwo rks , 1994. [7] P.J. Werbos , “ Backpropagation: past a nd future, ” I n Proceedings ICNN- 88, pp. 343-353, 1 998, San Diego, C a, USA. [8] V.P. Plagiana kos, D.G . Sotiropo u los, and M .N. Vra hatis, “ A utomatic adaptation o f learning rate for backpropaga tion neur a l netw orks, ” I n N.E. 8 Mastorakis, e ditor, Rece n t Adva n tages in Circui t s and Syste ms, pp. 337- 341, 1998. [9] A. Ri bert, E. Stocker, Y. Lecourtier and A. Ennaji , “ A Survey on Supervised Learning b y Evolving Multi -L a yer Perceptrons, ” IEEE International conference on Computational Intelligence and Multimedia Applications, p p. 122- 126 , 1999. [10] M. Mezard, J.P. N adal (1989). Le arni ng in fee dforward layere d net wor k s : the tiling alg orith m, Journal o f Physics , A.22, pp. 21 9 1-2204. [11] S.E. F ah lman, C. L ebiere , “ The cascade-co relation learning architecture ,” In D.S.Toureski et al, Advances in Neural Information Processing Systems , Vol 2, p p .524- 532 , 1989. [12] S. K n err , L . Perso n naz , G. Dreyfus, “ Singl e-layer learning revisited: a stepwise procedure for bui lding and training a neural network ,” In Neurocomputin g , NATO ASI Series , Series F, 6 8, Springer, 1990, pp.41- 50. [13] M.F. Moller, “ A scale d conjugate gradient algorithm for fas t supe rvised learning ,” Neur a l Networks , v ol. 6, pp. 52 5 -533, 1990. [14] J. Nocedal and Y. Yuan, “ Analysis of a self-scaling quasi-Newton method ,” Mathematical Pro gramming, vol. 61, pp. 19 -37, 199 3. [15] Z. M ichalew ic z , “ Genetic algorithms+dat a structures=evol u tion programs, ” 3rd ed, S p ringer, 19 9 6. [16] S. Kirkpatrick, C.D. Gellat Jr. and M.P. Vecchi , “ Optimization by simulated anneal i ng ,” Science , 220: 671- 6 80, 1983. [17] K.P. Unnikrishnan, K.P. Venugopal, Alo p ex, “ a corre lati on-based learning algorithm for feedforwar d and recurren t n eural networks, ” Neural Comp u tations ,Vol. 6, pp 4 69-490, 1994 . [18] D. Cvijovic and J. K linowski, “ Taboo se arch: an appro ach to the multiple ´ minima pro b lem, ” Science ,iss ue. 267 ,pp 664 -666, 1995. [19] R. Battiti and G . Te cchiolli, “ T he reactive tabu se a rch ,” ORSA Jo u rnal on Computing . vol 6, no. 2,pp 126-1 4 0, 1992. [20] R. Battiti and G. Tecchiolli , “ Reactive search, a history-sensitive heuristics for M AX-SA T, ” ACM Journal of Experimental Algorithmics ,Vol 2, Article 2, 1994. [21] D.E. Gol d berg , “ A note on Boltzmann to urnament se lection for genetic algorithms an d populat ion orie nted simulated a n nealing ,” C o mplex Sy s. 4,pp 445 -460, 1994. [22] M.K. Sen and P.L. Stoffa, “ Comparative analysis of simulated annealing and genetic algorithms: Theoretical aspects and asymptotic convergence, ” Geophys , 1993. [23] H. Khodabandehlo u, M . S ami F adali, “ Ech o State versus Wav e let Neural Networks: Comparison and A pplication to Nonlinear Sy stem Identification, ” IFAC -Papers OnLine, vol. 5 0, issue 1, pp. 2800 -2805, https://doi.org / 10.1016/j.ifacol .2017.08.630 [24] Jaewook Lee. H.D. Chiang , “ Quotient gradient methods for solving constraint satisfaction p roblems ,” IEEE Int. S ymp on Circuits and Syste ms, Austrarlia, 2001. [25] H.Th. Jongen ,P. Jonker and F. Twilt, “ Nonlinear Opti mizatio n in Finite Dimensions: M orse T heor y, Cheby shev Approximation, T ran sversal i ty, Flows, Parame tric Aspects ,” Kluwer Academic, 2000. [26] H. K hoda bandehlou and M. Sami Fadali, “ A Q uotient G radient M ethod to Train Artificial Neural Netw o rks ,” I n Proc. I nt. Joint Conf. Neural Networks (I JCNN), Anchorage, USA , 2017 [27] J. Lee and H. D. Chiang, “ Stability regions of n on-hyperbol i c dynamical systems: Theory and optimal estimation, ” IEEE I nternational Sy mposium on Circuits an d Systems, pp 28 - 31 , 2001. [28] L. Ljung, “ Ide ntification of nonlinear s ystems ,” I EEE 9th Inter n ational Conference on Contro l, Automation,Robotics and Visio n (ICARCV), 2006. [29] H. Jaege r, “ Adaptive nonlinear syste ms identifi ca tion with echo state networ k ,” in A dvances in Neural I n formation Processing Systems, vol.15. Cambridge, MA , MIT Pre ss, p p. 5 93 – 600, 2003. [30] J. Lee and H.D . Chi ang, “ A D ynamical T rajectory-Based Methodol ogy for S ystematical l y Computing Multipl e Optimal Solutions of Ge n eral Nonlinear Programming Problems ,” I EEE Tra n s. Automatic C ontrol , vol. 49, no. 6, pp. 8 8 8-899, 2004. [31] H. Kho dabandehlou and M. S ami Fadali, “ Nonlinear Syste m Identification using Neural N etworks and T raj ectory-Based Optimization ,” arXiv:1804.1 0 346v2 [eess.SP ] , 2018

Training Recurrent Neural Networks as a Constraint Satisfaction Problem

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment