Characterization of the convergence of stationary Fokker-Planck learning

The convergence properties of the stationary Fokker-Planck algorithm for the estimation of the asymptotic density of stochastic search processes is studied. Theoretical and empirical arguments for the characterization of convergence of the estimation…

Authors: Arturo Berrones

Characterization of the convergence of stationary Fokker-Planck learning
Characterizati on of the con v ergence of stationary F okk er–Planc k learnin g Arturo Berrones Posgr ado en Ingenier ´ ıa d e Sistemas Centr o de Innovaci´ on, Investigaci´ on y Desarr ol lo en Ingenier ´ ıa y T e cnolo g ´ ıa F acultad de Ingenier ´ ıa Me c´ anic a y El´ ec tric a Universidad Aut´ onoma de Nuev o L e´ on AP 126, Cd. U niversitaria, San Nic ol´ as de los Garza, NL 66450, M´ exic o artur o@yalma .fime.uanl.mx Abstract The con vergence prop erties of the sta tionary F okk er-Planc k algorithm for the esti- mation of the asymptotic densit y of sto chasti c searc h pr o cesses is studied. Theoret- ical and empirical arguments for the charac terization of con v ergence of the estima- tion in the case of s eparable and nonseparable nonlinear optimization problems are giv en. Some implications o f the c on ve rgence of stationary F okk er-Planck lea rnin g for the inference of p arameters in artificial n eural net work mo dels are outlined. Key wor ds: heuristics, optimizati on, sto c hastic searc h, statistical mec hanics 1 In tro duction The optimization of a cost function whic h has a num b er of lo cal minima is a relev an t sub ject in all fields of science and engineering. In particular, most of mac hine learning problems are stated like oftenly complex, optimization tasks [1]. A common setup consist in the definition of appropriate families of mo dels that should b e selected from data. The selection step in v olv es the optimization of a certain cost or lik eliho o d function, whic h is usually defined on a high dimensional parameter space. In other a ppro ac hes to learning, lik e Bay esian inference [16,14], the en t ir e landscap e generated b y the optimization problem asso ciated with a set of mo dels to g ether with t he da t a and t he cost function is relev ant. Other areas in whic h globa l optimization play s a prominen t role include op erations researc h [12], o ptimal design in engineenered systems [1 9 ] and man y other imp ortant applications. Preprint sub mitted to Elsevier 8 No vem b er 2018 Sto c hastic strategies for optimization are esse n tial t o many o f the heuristic tec hniques used to deal with complex, unstructured global optimization pro b- lems. Metho ds lik e simulated annealing [13,20,9,25] and evolutionary p opula- tion based a lgorithms [10,7,22,11,25], ha v e prov en to b e v aluable to ols, capa- ble of giv e go o d quality solutions a t a relativ ely small computational effort. In popula t io n based optimization, searc h space is explored throug h the ev olu- tion o f finite p opulations of p oints. The p opulation a lternates perio ds of self – adaptation, in whic h particular regions of the searc h space are explored in an in tensiv e manner, and p erio ds of div ersification in which solutions incorp orate the gained information ab out the glo bal landscap e. There is a large amount of evidence that indicates that some exp o nents of p opulation based algorithms are among the most efficien t g lobal optimization tec hniques in terms of com- putational cost and reliabilit y . These metho ds, how ev er are purely heuristic and conv ergence to global optima is not guaranteed. Sim ulated annealing on the other hand, is a metho d that statistically assures glo ba l optimality , but in a limit that is very difficult to acomplish in practice. In sim ulated anneal- ing a single particle explores the solution space through a diffusiv e pro cess. In order to g uaran tee global optimality , the “temp erature” t ha t characterize the diff usion should b e low ered a ccording to a logarithmic sc hedule [8]. This condition imply very long computation times. In this contribution the con v ergence prop erties of an estimation pro cedure for the stationary densit y of a general class of sto chas tic searc h pro cesses, re- cen tly introduced b y the author [2], is explored. By t he estimation pro cedure, promising regions of the searc h space can b e defined on a probabilistic ba- sis. This information can then b e used in connection with a lo cally ada ptiv e sto c hastic or determinis tic alg o rithm. Preliminary a pplicatio ns of this densit y estimation metho d in the improv emen t of nonlinear optimization alg o rithms can b e found in [23]. Theoretical asp ects on t he foundations of the metho d, its links to statistical mech anics and p ossible use of the density estimation pro - cedure as a general div ersification mec hanism are discuss ed in [3]. In the next section w e giv e a brief accoun t o f the basic elemen ts of our stationary densit y estimation algorithm. Thereafter, theoretical and empirical evidenc e on the con v ergence of the densit y estimation is give n. Besides global optimization, the densit y estimation approach may pro vide a nov el tec hnique for maxim um lik eliho o d estimation a nd Bay esian inference. This p ossibilit y , in the con text of artificial neural net w ork training, is o utlined in Section 4. F inal conclusions and remarks are presen ted in Section 5. 2 2 F okk er –Planc k learning of the stationary probability density of a st o ch astic searc h W e no w pro ceed with a brief accoun t of the stationa ry densit y estimation pro cedure on whic h the presen t w ork is based. Consider the minimization of a cost function of the for m V ( x 1 , x 2 , ..., x n , ..., x N ) with a searc h space defined o v er L 1 ,n ≤ x n ≤ L 2 ,n . A sto chastic search pro cess f or this problem is mo deled b y ˙ x n = − ∂ V ∂ x n + ε ( t ) , (1) where ε ( t ) is an additiv e noise with zero mean. Equation (1), kno wn as Langevin equation in the statistical ph ysics literature [17,26 ], captures the essen tial prop erties of a general stochastic searc h. In particular, the gradien t term giv es a mec hanism for lo cal a daptation, while the noise term provides a basic div ersi- fication strategy . Equation (1) can b e interpreted as an ov erdamp ed nonlinear dynamical system comp osed b y N in teracting particles in the presence of ad- ditiv e white noise. The stationary densit y estimation is based on an analogy with this ph ysical system, considering reflecting b oundary conditions. It fol- lo ws t ha t the stationary conditional densit y for particle n satisfy the linear partial differential equation, D ∂ p ( x n |{ x j 6 = n = x ∗ j } ) ∂ x n + p ( x n |{ x j 6 = n = x ∗ j } ) ∂ V ∂ x n = 0 . (2) whic h is a one dimensional F okk er–Planc k equation. An imp ortan t conse- quence of Eq. ( 2) is that the marg ina l p ( x n ) can b e sampled by drawing p oin ts from the conditiona l p ( x n |{ x j 6 = n = x ∗ j } ) via a Gibbs sampling [8]. Due to the linearit y of the F okk er – Planc k equation, a particular form of Gibbs sampling can b e constructed, suc h that its not only p ossible to sample the marginal densit y , but to g iv e an approximate analytical expression f o r it. F rom Eq. (2) follo ws a linear second order differential equation for the cum ulat ive distribu- tion y ( x n |{ x j 6 = n = x ∗ j } ) = R x n −∞ p ( x ′ n |{ x j 6 = n = x ∗ j } ) dx ′ n , d 2 y dx 2 n + 1 D ∂ V ∂ x n dy dx n = 0 , (3) y ( L 1 ,n ) = 0 , y ( L 2 ,n ) = 1 . The b oundary conditions y ( L 1 ,n ) = 0, y ( L 2 ,n ) = 1 came from the fact that 3 the densities ar e normalized ov er the searc h space . Random deviates can b e dra wn from the densit y p ( x n |{ x j 6 = n = x ∗ j } ) b y the in v ersion metho d [6], based on the fact that y is a n uniformly distributed random v ariable in the interv al y ∈ [0 , 1]. View ed as a function of the ra ndo m v a riable x n , y ( x n |{ x j 6 = n } ) can b e approximated thro ugh a linear com bination o f functions from a complete set that satisfy the b oundary conditions in the in terv al of inte rest, ˆ y ( x n |{ x j 6 = n } ) = L X l =1 a l ϕ l ( x n ) . (4) Cho osing for instance, a basis in whic h ϕ l (0) = 0, the L co efficien ts are uniquely defined b y the ev aluation of Eq. (3) in L − 1 in terior p oints. In this w ay , the a ppro ximation of y is p erformed by solving a set of L linear alg e- braic equations, inv olving L − 1 ev aluations of the deriv ativ e of V . The basic sampling procedure, that w e will call here Sta tionary F okk er–Planc k (SFP) sampling, is based on the iterat io n of the fo llo wing steps: 1) Fix t he v ariables x j 6 = n = x ∗ j and appro ximate y ( x n |{ x j 6 = n } ) b y the use of form ulas (3) and (4). 2) By the use of ˆ y ( x n |{ x j 6 = n } ) construct a lo okup table in order to generate a deviate x ∗ n dra wn f rom the stationary distribution p ( x n |{ x j 6 = n = x ∗ j } ). 3) Up date x n = x ∗ n and rep eat the pro cedure for a new v ariable x j 6 = n . An algo r ithm for the automatic learning of the equilibrium distribution of the diffusiv e searc h pro cess describ ed by Eq. (1) can b e based on the iteration of the three steps of the SFP sampling. A conv ergen t represen tation for p ( x n ) is obtained after taking the a v erage o f the co efficien ts a ’s in the expansion (4) o v er the iterations. In order to see this, consider the expressions for the marginal densit y and the conditional distribution, p ( x n ) = Z p ( x n |{ x j 6 = n } ) p ( { x j 6 = n } ) d { x j 6 = n } , (5) y ( x n |{ x j 6 = n } ) = x n Z −∞ p ( x ′ n |{ x j 6 = n } ) dx ′ n . (6) F rom the last t w o equations f o llo w that the marginal y ( x n ) is giv en b y t he exp ected v alue of the conditional y ( x n |{ x j 6 = n } ) o v er the set { x j 6 = n } , 4 y ( x n ) = E { x j 6 = n } [ y ( x n |{ x j 6 = n } )] . (7) All the information on the set { x j 6 = n } is stored in the co efficien ts of the ex- pansion (4 ). Therefore h ˆ y i = L X l =1 h a l i ϕ l ( x n ) → y ( x n ) , (8) where the brac ke ts represen t the av erage ov er the iterations of the SFP sam- pling. 3 Con v ergence of stationary F okk er–Planc k learning The marginals p ( x n ) = dy ( x n ) /dx n giv e the probability that a diffusiv e particle b e at any region x n dx n inside the searc h in terv al [ L 1 ,n , L 2 ,n ], under the actio n o f the cost function. Conv ergence of the stationary densit y estimation pro cedure dep ends on: i) The existence of the stationary state. ii) Con ve rgence of the SFP sampling. Conditions for the existenc e of the stationary state for gene ral m ulti–dimensional F okke r–Planc k equations can b e found in [1 7]. F or our particular reflecting b oundary case, in whic h t he cost function and the diffusion co efficien t do not dep end on time, the basic requiremen t is the absence of singularities in the cost function. By the ev aluation of Eq. (8) at eac h iteration of a SFP sampling the station- ary densit y asso ciated with the stochastic searc h can b e estimated, and the accuracy of the estimate impro v es o v er time. W e call this pro cedure a Sta- tionary F okk er–Planc k Learning (SFPL) of a density . The con v ergence of the SFPL follows from the con v ergence of Gibbs sampling. It is kno wn that under general conditions a Gibbs sampling displa ys g eometric con v ergence [18,4]. F ast con v ergence is a n imp ortan t feature for the practical v alue of SFPL lik e a div ersification mech anism in optimization problems. The rig orous study of the links b et we en the geometric con v ergence conditions (stated in [18] as con- ditions on the k ernel in a Mark o v c hain) with SFPL a pplied o n sev eral classes of optimization problems, should b e a relev an t researc h topic. At this p o in t, some nume rical exp erimen tatio n on the conv ergence of SFPL is presen ted. 5 In what follows, the sp ecific form of the expansion (4) ˆ y = L X l =1 a l sin (2 l − 1) π ( x n − L 1 ,n ) 2( L 2 ,n − L 1 ,n ) ! (9) is used. The estimation algorithm conv erges in one iteration for separable pro blems. A separable function is given by a linear com bination of terms, where each t erm in v o lv es a single v ariable. Separable problems generate an uncoupled dynamics of the sto c hastic searc h desc rib ed by Eq. (1). This b eha vior is illustrated by the minimization of the Mic halewicz’s f unction, a common test function fo r g lo bal optimization algo rithms [5]. The Mic halewicz’s function in a tw o dimensional searc h space is written a s V ( x 1 , x 2 ) = − sin x 1 (sin( x 2 1 /π )) 2 m − sin x 2 (sin(2 x 2 2 /π )) 2 m (10) The searc h space is 0 ≤ x n ≤ π . The Mic halewicz’s function is in teresting as a test function b ecause for la r g e v alues of m the lo cal b eha vior of the function giv es little inf o rmation on the lo cation of the global minimum. F or m = 10 the g lobal minimum of the tw o dimensional Mic halewicz’s f unction has b een estimated has V ∼ − 1 . 89 and is r o ughly lo cated around the p oin t (2 . 2 , 1 . 5), as can b e seen by plotting the function. The partial deriv ativ es of f unction (10) with m = 1 0, ha v e b een ev aluated for eac h v ariable at L − 1 equidistan t p o in ts separated by in terv als o f size h = π / L . The resulting algebraic linear sy stem has been solv ed b y the LU decomp osition algorithm [24 ]. In Fig. (1), Fig. ( 2 ) and F ig. (3) the functions ˆ y ( x 1 ) and ˆ y ( x 2 ) and their asso ciated probability densities are sho wn. The densities ha v e b een estimated after a single iteration of SFPL. The densities p ( x 1 ) and p ( x 2 ) are straigh tforw ardly calculated b y t a king the corresp onding deriv ativ es. In Fig. (1) a case with D = 1 and L = 5 is considered, while in Fig. (2) D = 1 and L = 10. In Fig. (3) a smaller r a ndomness parameter is considered ( D = 0 . 4 ), us ing L = 20. Notice that ev en when D is high enough to allow a n approximation of y with the use of v ery few ev aluations of the deriv ativ es, the resulting densities will giv e p opulations that represen t the cost function landscap e remark ably b etter than those that w ould b e o bta ined b y uniform deviates. The asymptotic conv ergence prop erties of the SFPL are now exp erimen tally studied on the X O R optimizatio n problem, 6 f = ( 1 + exp − x 7 1 + exp ( − x 1 − x 2 − x 5 ) − x 8 1 + exp ( − x 3 − x 4 − x 6 ) − x 9 !) − 2 + ( 1 + exp − x 7 1 + exp ( − x 5 ) − x 8 1 + exp ( − x 6 ) − x 9 !) − 2 +    1 − " 1 + exp − x 7 1 + exp ( − x 1 − x 5 ) − x 8 1 + exp ( − x 3 − x 6 ) − x 9 !# − 1    2 +    1 − " 1 + exp − x 7 1 + exp ( − x 2 − x 5 ) − x 8 1 + exp ( − x 4 − x 6 ) − x 9 !# − 1    2 The X OR function is an arc het ypical example that displa ys many of the fea- tures encoun tered in the optimization tasks that arise in mac hine learning. This is a case with multiple lo cal minima [21] and strong no nlinear in terac- tions b et w een decision v ar iables. In the exp erimen t rep orted in Fig. 4 and Fig. 5, tw o indep enden t tr a jectories a re follow ed ov er successiv e iterations. The parameters of t he SF P sampler are D = 0 . 01 and L = 200. In Fig. 4 are rep orted t he cost function v alues at the co ordinates in which the marginals are maxim um. F or each tra jectory , an initial p oint is unifor mly drawn from the searc h space. As can b e seen, b oth t r a jectories con v erge to a similarly small v alue o f the ob jectiv e function. The av erage cost function v alue, which is estimated b y the ev a luation of the cost function o n 100 po ints uniformly distributed ov er the searc h space, is 2. After 280 iteratio ns, the differenc es b e- t w een b oth tra jectories are around 0 . 05% of the av erage cost function v alue. Moreo v er, the differences in ob jectiv e v alue of the tra jectories with respect to a putative global optim um f ( x ∗ ) = 0 . 00026 , (11) x ∗ = (8 . 22885 , − 8 . 479 52 , − 9 . 87758 , 9 . 10184 , − 4 . 55215 , − 5 . 0597 8 , 9 . 98956 , 9 . 96857 , − 4 . 916 23) , is ≤ 0 . 1 17% of the av erage cost after the iteration 280. The putativ e globa l optim um in the searc h interv a l has b een found by p erforming lo cal searc h via steepest descen t from a p opulation of p o ints dra w fro m the estimated densit y . In order to c hec k stat istical con v ergence, the follow ing measures are in tro- duced, av = 1 N N X n =1 h x n i , (12) 7 0 1 2 3 x 0 0,2 0,4 0,6 0,8 1 y ( x1 ) y ( x2 ) 0 1 2 3 x 0 0,2 0,4 0,6 0,8 1 p ( x1 ) p ( x2 ) Fig. 1. Ev aluation of y and p by one iteration of the S FPL algorithm for the Mic halewicz’s fun ction, using L = 5 and D = 1. Despite the very lo w num b er of gradien t ev aluations used, the algorithm is capable to find a prob ab ility structure that is co nsistent with the global pr op erties of the cost function. s = 1 N N X n =1 q h x 2 n i − h x n i 2 , where the brac k ets in this case represe nt statistical momen ts of the estimated marginals. Under the expansion (9), all the necessary integrals are easily p er- formed analytically . In the first g r a ph of Fig. 5, the ev olution o ver iterations of the SF P sampler o f s and av for t w o arbitrary and indep enden t tra jectories is plo tted. A v ery fast 8 0 1 2 3 x 0 0,2 0,4 0,6 0,8 1 y ( x1 ) y ( x2 ) 0 1 2 3 x 0 0,1 0,2 0,3 0,4 0,5 p ( x1 ) p ( x2 ) Fig. 2. The same case rep orted in Fig. (1), but using L = 10. con v ergence in the measure av is eviden t. The measure s is further studied in the second graph o f Fig. 5, where the difference on that measure among the t w o tra jectories is follow ed o v er iterations. The con v ergence is consisten t with a geometric b eha vior ov er the first ( ∼ 100 ) iterations and sho ws an asymptotic p ow er la w r ate. 4 Maxim um Likelih o o d Estimation and Ba y esian Inference Besides its applicabilit y like a div ersification strategy for lo cal searc h algo- rithms, the fast conv ergence of the SFPL could b e fr uitf ul to giv e efficien t 9 0 1 2 3 x 0 0,2 0,4 0,6 0,8 1 y ( x1 ) y ( x2 ) 0 1 2 3 x 0 0,5 1 1,5 2 2,5 3 p ( x1 ) p ( x2 ) Fig. 3. Ev aluation of y and p by one iteration of the S FPL algorithm for the Mic halewicz’s f unction. In this case L = 20 and D = 0 . 4. With the increment in precision and the reduction of the randomn ess parameter, SFPL fin ds a p roba- bilit y dens it y that is sharp ly p eak ed aroun d the global min im um. Notice th at the computational effort is still small, inv olving only 19 ev aluatio ns of the gradien t. aproac hes to inference, fo r instance in the training o f neural net w orks. F rom the p oint of view of statistical inference, the uncertain t y ab out unknown pa- rameters of a learning mach ine is c har a cterized b y a p osterior densit y for t he parameters give n the observ ed data [16,14]. The prediction of new data is then p erformed either by the maximization of this p osterior (maxim um lik e- liho o d estimation) or b y an ensem ble av erage o ver the p osterior distribution (Ba y esian inference). T o b e specific, sup osse a system whic h generates an out- put Y give n an input X , suc h that the data is describ ed b y a distribution with 10 0 200 400 600 800 1000 iterations 0.0001 0.001 0.01 0.1 1 objective value Fig. 4. Ob jectiv e v alue of the p oin t in w hic h the marginals of the estimated den s it y are maxim u m for t wo indep endent tra jectories. first momen t E [ Y ( X )] = f ( X, w ) and diagonal cov ariance ma t rix σ 2 I . The problem is t o estimate f from a giv en set of observ ations S . The para meters could b e, for instance, differen t neural netw ork we ights and archite ctures. The observ ed data defines a n evidence for the differen t ensem ble mem b ers, giv en b y the p osterior p ( w | S ). In maxim um like liho o d estimation, training consists on finding a single set of optimal parameters that maximize p ( w | S ). Ba ye sian inference, on the other hand, is based on the fact that the estimator of f that minimizes the exp ected squared error under the p osterior is giv en by [16] ˆ Y = h f ( X ) i = Z dw f ( X , w ) P ( w | S ) , (13) so tr aining is done b y estimating this ensem ble a v erage. In the SFPL framework prop osed here, prio rs are alwa ys give n b y uniform densities. This choice in volv es v ery few prior assumptions, regarding the as- signmen t o f reasonable in terv als on whic h the comp onents o f w lie. Under an uniform prior, and if the data presen t in the sample has b een indep endently dra wn, it turns out that the p osterior is giv en b y p ( w | S ) ∝ exp ( − V /D ) (14) where D = 2 σ 2 and V is t he giv en loss function. The SFPL algorithm can b e therefore directly applied in order to lear n the marg inals p ( w n | S ) of the p osterior (14). By construction, these marginals will b e prop erly normalized. It is now argued that SFPL can b e used to efficien tly p erform maxim um like li- ho o d and Ba y esian tra ining. Consider again the X OR example. The asso ciated 11 0 200 400 600 800 1000 iterations -2 -1 0 1 2 3 4 s av 1 10 100 1000 iterations 0.1 1 | s1 - s2 | Fig. 5. Statistical con v ergence of th e stationary dens ity estimatio n p ro cedure on the X OR pr ob lem. The v alue of the a v erage first moment and standard deviation of th e estimated marginals from tw o indep end ent tra jectories is plotted in the left. T h e graph on the right shows the distance b et wee n b oth av erage standard deviations. This d istance d eca y at a geometric rate ov er the first ∼ 100 iteratio ns. Asym ptot- ically the distance b eha v e lik e a p o wer la w c haracterized b y | s 1 − s 2 | ∼ M − 0 . 67 , where M is the n umb er of iterations. densit y has b een estimated assuming a prior densit y for eac h pa rameter w n o v er the in terv al [ − 10 , 10]. The p o sterior density , on the other hand, is a con- sequence of the cost function given the set of training data. In Section 3 it has b een sho wn that fo r nonseparable nonlinear cost functions like in the XOR case, SFPL conv erges to a correct estimation of the marginal densities p ( w n ). Therefore, the maximization of the lik eliho o d is reduced to N line maximiza- tions, where N is the num b er of weigh ts to b e es timated. The adv antage of this pro cedure in comparision with the direct maximization of p ( w | S ) is evi- 12 den t. On the other hand, the SFP sampler itself is designed as a generator of deviates that are dra wn f r o m the stationary densit y . The a v erag e (1 3) can b e appro ximated b y ˆ Y = h f ( X ) i ≈ X t f ( X , w ( t ) ) , (15) without the need of direct sim ulat io ns o f the sto c hastic searc h, whic h is nec- essary in most of o t her tec hniques [16]. In Fig. 6 is shown the behavior of p ( w n ) for a particular w eight as the sample size increases. The pa r ameters L = 200 and D = 0 . 01 are fixed. The tw o dotted lines correspo nd to cases with sample sizes of one and t w o, with inputs (0 , 0) and (0 , 0 ) , (1 , 1) resp ectiv ely . The resulting densities are almost flat in b ot h situations. The dashed line corresp onds to a sample size of three. The sample p oin ts are (0 , 0) , (1 , 1) , (0 , 1). In this case the sample is large enough to giv e a sufficien t evidence to fav or a particular region of the para meter do ma in. The solid line corresp onds to the situation in whic h a ll the four p oints of the data set are used for training. The resulting densit y is the sharp est. The parameter D is prop or t io nal to the no ise strength in the sto c hastic searc h. It can b e selected on the basis of a desired computational effort , as discussed in [3]. Figure 6 indicates that at a fixed noise leve l D , an increase of evidence imply a decrease on the uncertaint y of the w eights. This finding a grees with what is exp ected from the kno wn theory of the statistical mec hanics of neural net works [15], according to whic h the w eight fluctuations deca y as the data sample gro ws. The p erformance of maxim um lik eliho o d and Ba y esian training is rep orted in Fig. 7, using the complete sample fo r the inference of the w eights. The standard deviation o f the error of the net w orks instan tiated at the inferred we igths is rep orted at each iteration. The solid line corresp onds to maximum lik eliho o d training, whic h is essen tia lly the same calculation already rep o rted on Fig . 13 -10 -5 0 5 10 w5 0 0.05 0.1 0.15 0.2 0.25 p ( w5 ) Fig. 6. Th e probability densit y of a p articular weig ht ( w 5 , a b ias of one of the neurons in the h idden la y er) of the ANN mo d el for the X OR p roblem. The dotted lines corresp ond to cases w ith sample sizes of one and tw o. T he dashed line is for the den sit y that results f rom a sample of size thr ee w hile the case for a samp le size of four is giv en by the solid line. 0 20 40 60 80 100 iterations 0 0.01 0.02 0.03 0.04 0.05 0.06 stdv. error Fig. 7. P erf ormance of maxim um lik eliho o d (solid) and Ba yesian (dashed) training for the X O R problem, using stationary F okk er–Planc k learning to estimate w eigh t distributions. 4. The p erformance is very similar for Ba y esian training, whic h corresp onds to the dashed line. The estimation of the a v erage (15) as b een p erformed by ev aluating the neural netw ork on a w eigh t v ector draw n b y the SFP sampler at each iteration. In this w ay , the n um b er o f terms in the sum of Eq. (15) is equal to the n um b er of iterations of the SFP sampler. 14 1 10 100 1000 iterations 1e-06 0.0001 0.01 1 | err1 - err2 | Fig. 8. The difference in a verage cost fu n ction v alue for t wo indep enden t tra jectories in the rob ot arm p roblem. The ANN as b een trained usin g 20 0 samp le p oin ts. A larger example in v olving noisy data is now presen ted. Consider the “rob ot arm problem”, a b enc hmark tha t has alr eady b een used in the contex t of Ba y esian inference [16]. The data set is generated b y the follow ing dynamical mo del for a rob ot arm y 1 = 2 . 0 cos ( x 1 ) + 1 . 3 cos ( x 1 + x 2 ) + e 1 (16) y 2 = 2 . 0 sin ( x 1 ) + 1 . 3 sin ( x 1 + x 2 ) + e 2 The inputs x 1 and x 2 represen t joint a ng les while the outputs y 1 and y 2 giv e the resulting arm p ositions. F ollo wing the exp erimen tal setup prop osed in [16], the inputs are uniformly distributed in the in terv als x 1 ∈ [ − 1 . 932 , − 0 . 453] ∪ [0 . 453 , 1 . 932 ], x 2 ∈ [0 . 534 , 3 . 14 2 ]. The noise terms e 1 and e 2 are Gaussian and white with standa r d deviation of 0 . 1. A sample of 2 0 0 p oints is generated us- ing these prescriptions. A neural netw ork with one hidden lay er consisting on 16 h yp erb o lic tangen t activ ation functions is trained on the generated sam- ple using SFP learning, considering a squared error loss f unction. The same priors are assigned to all of the we ights : uniform distributions in the interv al [ − 1 , 1]. The av erage absolute difference in training errors for tw o independent tra jectories is sho wn on Fig. 8 for a case in whic h L = 300, D = 0 . 00 1 25 and M = 3 0 0 iteratio ns. T aking in to accoun t that the exp ected equilibrium square error is er r ≈ D / 2, it turns out tha t the differences b et we en b oth tra jectories are of the same order o f magnit ude a s the exp ected eq uilibrium error in ab o ut 10 iterations. During t he course of the total of 300 iterations of SFP of one of the tra j ectories, the net w ork as b een ev aluated in the test input ( − 1 . 471 , 0 . 752). The resulting Bay esian prediction is sho wn on Fig. 9 in the form of a pair of histograms. The output that would b e give n b y t he exact mo del (16) in the absence of noise is (1 . 177 , − 2 . 8 4 7). The Bay esian predic- tion given b y SFPL has its mean at (1 . 24 , − 2 . 6 4 ) with a standard deviation 15 -3 -2 -1 0 1 2 0 5 10 15 20 y1 y2 Fig. 9. Ba ye sian predictions for the rob ot arm problem in the t est inp ut ( − 1 . 471 , 0 . 752). The SFPL p arameters are L = 300, D = 0 . 0012 5 and M = 300, using 200 sample p oin ts. of ≈ 0 . 1 2 for each v a r ia ble. Therefore the o utput giv en b y the underlying mo del is contained in the 95% confidence in terv al around the Bay es exp ec- tation. Consisten t predictions can also be obtained under less precision and data. In Fig. 1 0 are sho wn the histograms obtained for a case with a sample size of 50 p oin t s, L = 200 and D = 0 . 01. Altough the Ba y es prediction is more uncertain, it still is statistically consisten t with the underlying pro cess. In the ro b ot arm example the n umber of gradient ev aluatio ns needed to ap- pro ximately con v erge to the equilibrium densit y in the L = 300 case w as ab out 2( L − 1) M ∼ 5980. This seems comp etitive with resp ect to previous ap- proac hes, lik e the hybrid Monte Carlo strategy in t r o duced b y Neal. The reader is referred to Neal’s b o ok [16] in order to find a v ery detailed applicatio n of h ybrid Mon te Carlo to the rob o t arm problem. An additional adv an tage of the SFPL metho d lies on t he fact that explicit expres sions for the parameter’s densities are obtained. Muc h more detailed experimen tation is under curren t dev elopmen t. Additional 16 -4 -2 0 2 4 0 1 2 3 4 5 6 y1 y2 Fig. 10. Same case as in Fig. 9 b ut with SFPL parameters given b y L = 200, D = 0 . 01, M = 300 and sample size of 50 p oin ts. studies regarding issues lik e generalization of more complex ANN models under a limited amoun t of data, in the spirit of the general framew ork f or Bay esian learning [16,14], is curren tly a w o r k in progress by the author. 5 Conclusion Theoretical and empirical evidence for the c haracterization of the con v ergence of the densit y estimation of sto c ha stic searc h pro cesses by the metho d of sta- tionary F okk er–Planc k learning as b een presen ted. In the con text of nonlinear optimization problems, the pro cedure turns out to con ve rge in one iteration for separable problems a nd display s fa st conv ergence for nonseparable cost functions. The p ossible applications of stationary F okk er–Planc k learning in the dev elopment of efficien t and reliable maxim um like liho o d and Ba y esian ANN tra ining tec hniques hav e b een outlined. Ac kno wledgemen t This work w as partially supp orted b y the National Council of Science and T ec hnology of Mexico under gra n t CONA CYT J45702- A. 17 References [1] K . P . Bennett, E. Parrado-He rnn dez, The Int erpla y of O p timization and Mac hin e Learning Researc h, Journal of Mac h ine Learn in g Researc h 7 (2006) 12651 281. [2] A. Berrones, Generating Ran d om Deviates Consisten t with th e Long T er m Beha vior of Sto chastic Searc h Pro cesses in Global Optimization, in: Pro c. IW ANN 2007, Lecture Notes in C omputer Science, V ol. 4507 (Sp ringer, Berlin, 2007) 1-8. [3] A. Berrones, Stationary probabilit y densit y of stochastic searc h pr o cesses in global optimizatio n, J. Stat. Mec h. (200 8) P01013 . [4] A. Can t y , Hyp othesis T ests of C on v ergence in Mark ov Ch ain Monte Carlo, Journal of Computational and Graphical Statistic s 8 (1999) 93-108 . [5] R . Ch elouah, P . Siarr y , T abu Searc h Applied to Global Optimization. Europ ean Journal of Op erational Researc h 123 (2000) 256-270. [6] L . Devro y e, Non-Uniform Random V ariate Generation (Sprin ger, Berlin, 1986). [7] A. E. Eib en, J. E. Smith, In tro du ction to Ev olutionary Compu ting, (S pringer, Berlin, 2003) . [8] S . Geman, D. Geman, Sto c hastic relaxatio n, Gibbs distribu tions, and the Ba yesia n restoratio n of images, IEEE T r ans. Patt ern An al. Mac hine Intell. 6 (1984 ) 721-74 1. [9] S . Geman, C. R. Hw ang, Diffusions for Global Optimization SIAM J. Control Optim. 24 5 (1986) 1031-1043 . [10] D. Goldb er g, Genetic Algorithms in Searc h , Optimization and Mac hine Learning, (Addison–W esley , 1989). [11] A. Hertz, D. Kobler, A framew ork for the description of ev olutionary algorithms, Europ ean Journal of Op erational Researc h 126 (200 0) 1-12. [12] h ttp://www.informs.org/ [13] S. Kirkp atric k, C . D. Gelatt J r., M. P . V ecc hi, Optimization by Simulate d Annealing, Science 220 (19 83) 671-680. [14] D. J . C. MacKa y , A practical Ba y esian framew ork for bac kpropagation net w orks, Neur al C omputation 4 3 (1992) 448 - 472. [15] D. Malzahn, M. Opp er, Statistical Me c hanics of Learning: A V a riational Approac h for Real Data , P hys. Rev. Lett. 89 10 (2002) 108302 . [16] R. M. Neal , Bay esian Learning for Neural Net works, (Springer, Berlin, 1996) . [17] H. Risk en, Th e F okk er–Planc k Equation, (Spr inger, Berlin, 198 4). 18 [18] G. O. Rob erts, N. G. Pol son, On the Geometric Con ve rgence of the Gibb s Sampler J. R. Statist. So c. B 56 2 (1994) 377-384 . [19] P . Y. Papalam bros, D. J. Wilde, Prin ciples of Optimal Design: Mod eling and Computation, (Cam bridge Unive rsity Press, 20 00). [20] P . P arpas, B. Rus tem, E. N. Pistik op oulos, Linearly Constr ained Global Optimization and Stoc hastic Differen tial Equations Journal of Global Optimization 36 2 (200 6) 191- 217. [21] K. E. P arsop oulos, M. N. V rahat is, Recen t approac hes to global optimization problems thr ou gh Partic le Swarm O ptimization, Natural Compu ting 1 (2002) 235-3 06. [22] M. Pelik an, D. E. Goldb erg, F. G. Lob o, A Survey of Optimization by Building and Using Probabilistic Mod els, Computational Optimization and Applications 21 1 (20 02) 5-20. [23] D. Pe˜ na, R. S´ anc hez, A. Berrones, Stationary F okk er–Planc k Learn ing for the Optimization of P arameters in Nonlinear Mo dels, in: Pro c. MICAI 2007, Lecture Notes in Computer S cience V ol. 4827, (Spr in ger, Berlin, 2007) 94-104. [24] W. Pr ess, S. T euk olsky , W. V etterling, B. Flannery , Nu merical Recip es in C+ + , the Art of Scien tific Compu ting, (Cam br id ge Univ er s it y Press, 2005) . [25] J. A. K. Su yk ens, H. V errelst, J. V andew alle, O n–Line L earning F okk er–Planc k Mac hin e. Neural Pro cessing Letters 7 2 (1998) 81-89. [26] N. G. V an K amp en, Sto chasti c P ro cesses in Physics and Ch emistry , (North- Holland, 1992). 19

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment