Reservoir-size dependent learning in analogue neural networks
The implementation of artificial neural networks in hardware substrates is a major interdisciplinary enterprise. Well suited candidates for physical implementations must combine nonlinear neurons with dedicated and efficient hardware solutions for bo…
Authors: Xavier Porte, Louis Andreoli, Maxime Jacquot
Reserv oir-size dep enden t learning in analogue neural net w orks Xa vier P orte, ∗ Louis Andreoli, Maxime Jacquot, Lauren t Larger, and Daniel Brunner D´ ep artement d’Optique P. M. Duffieux, Institut FEMTO-ST, Universit´ e Bour go gne-F r anche-Comt ´ e CNRS UMR 6174, Besan¸ con, F r anc e. The implemen tation of artificial neural netw orks in hardware substrates is a ma jor in terdisci- plinary en terprise. W ell suited candidates for ph ysical implemen tations must com bine nonlinear neurons with dedicated and efficient hardw are solutions for both connectivity and training. Reser- v oir computing addresses the problems related with the net work connectivit y and training in an elegan t and efficient wa y . Ho wev er, imp ortant questions regarding impact of reservoir size and learning routines on the con vergence-speed during learning remain unaddressed. Here, we study in detail the learning process of a recently demonstrated photonic neural netw ork based on a reservoir. W e use a greedy algorithm to train our neural netw ork for the task of c haotic signals prediction and analyze the learning-error landscape. Our results unv eil fundamental prop erties of the system’s optimization h yperspace. P articularly , we determine the con v ergence speed of learning as a func- tion of reservoir size and find exceptional, close to linear scaling. This linear dep endence, together with our parallel diffractive coupling, represent optimal scaling conditions for our photonic neural net work scheme. ∗ javier.porte@femto-st.fr 2 I. INTR ODUCTION No wada ys neural netw orks (NNs) still remain extensively emulated b y traditional computers, which p osts imp or- tan t c hallenges in terms of parallelization, energy efficiency and ov erall computing sp eed. Ultimately , full hardware in tegration of NNs, where nonlinear no des, net work connectivity and optimization through learning are implemented via dedicated functionalities of the substrate is desirable. Optics-based solutions lik e opto electronic [1, 2] or all- photonic [3 – 5] neural netw orks are of particular interest b ecause they can av oid parallelization bottlenecks. In the context of hardware implementation, reservoir computing (RC) app eared as a particularly well suited ap- proac h to train and operate NNs [6, 7]. The con v enience of R C originates in a training that is restricted to optimization of the readout weigh ts, leaving the input as well as the internal connectivity b et ween neurons unaffected. Ho w ever, in ph ysically implemented NNs the training itself is conditioned b y hardware structure. Here, one of the fundamen- tal questions is ho w an error landscape is explored by a given learning algorithm when applied to a hardw are NN. Therefore, understanding the top ology of the cost function and its p oten tial conv exit y or presence of lo cal minima is of ma jor imp ortance. W e exp erimentally implemen t R C in an opto electronic analogue NN and train it to predict chaotic time series via a greedy algorithm, analogously to [8]. W e study in detail the error-landscap e, which we also refer as cost-function, differen tiating those features caused b y top ology from those originated b y noise. The mapp ed landscap e is ric h in features, on a v erage follo ws an exp onen tial top ology and contains numerous lo cal minima with comparable goo d- p erformance. W e demonstrate that by using our greedy algorithm learning conv erges systematically . Moreov er, we address the particular question of ho w the NN rate of con vergence and the prediction error dep end on the netw ork size. This is the first time that the fundamen tal characteristics of greedy learning are explored in a noisy physically implemen ted NN. I I. NEURAL NETW ORK CONCEPT AND TRAINING Our opto electronic NN is composed of up to 961 neurons whose state is encoded in the pixels of a spatial-ligh t mo dulator (SLM). The neurons are connected among themselves via diffractive optical coupling, which is inherently parallel and scalable [9]. A. Exp erimen tal setup The exp erimen tal implemen tation is schematically illustrated in Fig. 1(a). A laser dio de of intensit y | E 0 i | 2 illumi- nates the SLM, where the neurons’ states x i are enco ded. The SLM is imaged on the camera (CAM) after passing through the p olarizing beam splitter (PBS) and twice through the diffractiv e optical element (DOE). The information detected b y the camera is used to drive the SLM, realizing the NN recurren t connectivity . The dynamical ev olution of the recurren t NN is given by x i ( n + 1) = α | E 0 i | 2 cos 2 " β · α N X j W DO E i,j E j ( n ) 2 + γ W inj i u ( n + 1) + θ i # , (1) where E j ( n ) is the optical electric field for each neuron, β is the feedbac k gain, γ is the input injection gain, α is an empirical normalization parameter, and θ i are the phase offsets for each node. After optimization, operational parameters [ β , γ , α, θ i ] are k ept constant. The control PC reads the camera output and sets the new state of the SLM follo wing Eq. 1. Recurrency is stablished by previous neuron state E j ( n ), the external information is u ( n + 1), W DO E is the recurrent neurons’ internal coupling and W inj is the information injection matrix with random, indep endent and uniformly distributed weigh ts b et ween 0 and 1. F ollo wing the RC principle, the NN training is restricted only to the mo dification of the readout weigh ts. F or that, the neurons of the recurren t NN, i.e. the SLM pixels, are imaged on a digital micro-mirror display (DMD) and fo cused on the surface of a photo dio de. The output of the neural net work, as measured b y the photo diode is y out ( n + 1) ∝ N X i W DM D i ( E 0 i − E i ( n + 1)) 2 , (2) where W DM D i =1 ...N are the optically implemented readout w eights. The DMD mirrors can b e flipp ed only b et ween t wo p ositions, ± 12 ◦ . Thus, the readout weigh ts are strictly Bo olean and physically corresp ond to the orientation of the 3 Mirr or /4 MO1 DMD DOE /2 PBS POL MO2 MO3 DET SLM BS LD CAM Control PC: feedback + information injection (a) (b) FIG. 1. (a) Schematic of our recurren t neural netw ork. (b) NN p erformance for the Mac key-Glass c haotic time series prediction task: the target chaotic time series, the reservoirs output and the prediction error are depicted in orange, blue and green, resp ectiv ely . mirrors to wards or aw a y from the photodio de. By c ho osing which mirrors are directed tow ards the detector, we c ho ose the set of active neurons that con tribute to computing. Once trained, the DMD is turned into a passiv e device, op erating without bandwidth limitation or energy consumption [8]. B. T raining with a greedy algorithm Learning in our system optimizes the Bo olean readout weigh ts W DM D i =1 ...N ,k during successive learning ep ochs k such that the output gradually appro ximates the desired resp onse. Our greedy learning algorithm explores the error-landscap e by fav oring the selection of readout weigh ts not tested y et. The vector W select k is calculated at each iteration, giving a new v alue W select k = r and ( N ) · W bias to each readout w eight p osition. Here, W bias is a vector randomly initialized at the ep och k = 1 and the function r and ( N ) creates N random num bers uniformly distributed b et ween 0 and 1. At every learning ep och, the algorithm chooses a new DMD p osition l k as the p osition of the maximum v alue of W select , i.e. l k = max ( W select k ), and then changes its Bo olean readout weigh t W DM D l k ,k +1 = ¬ W DM D l k ,k . F or each epo ch k the mean square error (NMSE) k is calculated. The error is defined as function of the normalized output ˜ y out ( n + 1) and the target signal T , b oth normalized by the standard deviation and subtracted their offset: k = 1 T T X n =1 T ( n + 1) − ˜ y out k ( n + 1) 2 , (3) where T is the length of the c haotic time series used for training. W e train for one-step-ahead prediction, and the target signal T ( n + 1) is the injection signal one step ahead u ( n + 2). The calculation of the reward r ( k ) is based on the p erformance evolution in comparison to the previous ep o c h as r ( k ) = 1 if k < k − 1 0 if k ≥ k − 1 . (4) The mo dified configuration of the DMD is k ept depending on reward r ( k ): W DM D l k ,k = r ( k ) W DM D l k ,k + (1 − r ( k )) W DM D l k ,k − 1 . (5) If p erformance has not improv ed with resp ect to previous configuration, the DMD mirror is flipp ed back. In order to fav or the selection of readout weigh ts not yet tested, we implement W bias = 1 N + W bias , W bias l k = 0 , (6) 4 where the v alues of W bias are increased by 1 / N at each ep o c h, setting afterwards the v alue assigned to the current l k p osition to zero. Consequently , the bias for a previously mo dified w eight increases linearly , approaching unity after k = N learning iterations. Figure 1(b) sho ws an example of the NN performance after training. The task is prediction of Mac k ey-Glass chaotic time series. The reservoir’s output y ( n ) is accurately predicting the next step of the chaotic time series u ( n + 1). I II. RESUL TS ON LEARNING AND ERROR LANDSCAPE Tw o hundred p oin ts of the chaotic Mac key-Glass sequence are used as training signal, and the same sequence is rep eated for ev ery learning ep o c h k . A t each ep o c h, the greedy learning maximally mo difies the v alue of a single readout-w eight en try , hence v aries the system’s p osition within the error-landscap e p osition by distance one. This is the maximal Hamming distance asso ciated to ev ery learning step. An optimization path is a descent tra jectory from the readout w eight’s starting configuration tow ards a minimum. The optimized and system parameters are β = 0 . 8, γ = 0 . 25, θ 0 = 0 . 44 π , θ 0 + ∆ θ 0 = 0 . 95 π , µ = 0 . 45, N = 961. The normalization parameter α represen ts necessary attenuation suc h the camera is not saturated. A. Learning and top ology The descending trend of learning for this system w as already in troduced in [8]. Here we wan t to explore the systematic characteristics of greedy learning. Since the optimization path in the error-landscap e is randomized and sub ject to exp erimental noise, we study the statistical v ariability of learning b y measuring tw en t y different learning curv es with identical net work parameters. In order to restrict our findings to the prop erties of our learning-routine, w e start all measurements at an identical p osition in the error-landscap e, i.e. with an iden tical initial W DM D i, 1 . The results are depicted in Fig. 2(a). Twen t y individual learning curves are presen ted in gray and their av erage is plotted in red. W e can observ e that all the curves conv erge tow ards a common minimum, after which all curves will sligh tly increase. The av erage v alue of this minimum is k opt = (14 . 2 ± 1 . 5) · 10 − 3 . The green line is the exp onential fit to the av erage of the 20 learning curves. The blue line illustrates the system’s testing error, which w e determined using a set of 9000 data-p oin ts not used during the training sequence. W e observe that the testing error test = 15 · 10 − 3 matc hes excellently with the learning error. F rom this result, we can conclude that no ov er-fitting is present, whic h w e attribute to the role of noise in our analogue exp erimen tal NN. As shown in more detail in the inset of Fig. 2(a), individual learning curv es follow different tra jectories, ranging from a rather smo oth descent to paths including steep jumps. The large v ariability among the first learning ep och can b e attributed to exp erimen tal system’s noise b ecause the initial DMD configuration is iden tical for all curves. Ho w ever, the local v ariabilit y during the learning pro cess decreases and can therefore b e related to particular top ological prop erties of the v arious error-landscap e explorations rather than to noise. In order to further study the optimization paths’ top ology and the global characteristics of descent tra jectories, we calculate the gradien t of the a verage descen t. At eac h learning ep o c h the relativ e c hange in lo cal error is δ /δ k = min − k , where min is the previously smallest error achiev ed b y the system. W e define δ + /δ k ( δ − /δ k ) which con tains all p ositive (negativ e) δ /δ k . At each learning ep o c h we calculated the av erage δ + /δ k and δ − /δ k , which corresp ond to the av erage p ositiv e and negative gradients of the error landscap e at each learning ep o c h k . Data for the av erage p ositiv e (negativ e) gradients are shown as red (blue) in Fig. 2(b). Both, red and blue curves in Fig. 2(b) exp onen tially decrease, whic h one could exp ect given that they represent the deriv ativ e of the exp onen tially deca ying learning curves. This means that, on a verage, the error landscap e curv ature follo ws a decreasing exp onential. Moreov er, qualitatively different b ehaviour can b e observed b etw een b oth gradients when concentrating on the last learning ep o c hs, cf. insets in Fig. 2(b). While the negativ e gradien ts follo w a noisy con vergence tow ards zero, the positive gradien ts experience a sudden c hange of trend after th e learning epo ch k ' 950. This iteration corresp onds to the a verage epo c h when the learning curves reac h their minim um. W e therefore consider the NN learning process as comp osed of t wo parts: first an optimization path where the error and its gradien t decrease, reac hing a minim um in the e rror-landscape and b ecoming trapp ed. There is then a sharp increase in p ositive gradien ts, while at the same time negative gradients drop b elow the noise level. F rom the clear trends of the p ositive gradients, w e conclude that the optimization path around the minimum is not noise limited but defined by the top ology of the error landscap e. In fact, due to the rule Eq. 6, the probability of one readout weigh t to b e mo dified again increase linearly in 1 / N . Therefore, after N ep o c hs, the selection of p osition l k statistically rep eats the selection sequence carried out during the first part of the optimization. Once trapp ed in a minimum, the sequence of tested dimensions rev eal an inv erse organization of the error gradient. Consequently , the dimensions that hav e con tributed strongest in reducing the error when optimized, lead no w to least degradation of the p erformance. 5 (a) (b) FIG. 2. (a)Learning p erformance for the Mack ey-Glass c haotic time series prediction. Red asterisks are the a verage of the bac kground 20 gray curv es and the green line is its exp onen tial fit. The inset illustrates the strong initial local v ariabilit y depicting the first 200 epo chs for five exemplary learning curves. (b) Average of positive (red) and negativ e (blue) error gradien ts of the 20 learning curves display ed in panel (a). Insets: Zo oms of the tw o curves around the ep och k = 950. B. Learning scalability W e no w fo cus on an interesting characteristic that emerges from Fig. 2(a), where we found that the optimization ep ochs until con vergence to b est p erformance are similar for all learning curves. The 20 different curves conv erge on a verage in ∼ 961 ep o c hs. Thus, the av erage num ber of iterations required to optimize the NN is of the order of the n umber of neurons. W e test now if this particular ratio maintains when mo difying the size of the NN. Figure 3 shows the results of the optimal learning ep o c h for different net work sizes ranging from 9 to 961 no des. Impressively , the exp erimental results (blue circles) ha ve an almost p erfect linear distribution ov er three orders of magnitude. The slop e of the linear fit in logarithmic scale is 1.08. Crucially , the prediction performance contin uously impro ves for that larger NNs, hence optimization of all no des is relev an t also for the largest system. The 9 neurons net work p erforms ∼ 50 worse than the net work with 961 neurons. 6 10 1 10 2 10 3 Number of nodes 10 1 10 2 10 3 Lear ning duration FIG. 3. Scaling of the optimal learning ep o c hs in function of the num b er of neurons. Red dashed line sho es the p olynomial fit to the data, obtaining a co efficien t of 1.08 that indicates close-to-linear scaling. IV. CONCLUSIONS Our w ork addresses fundamen tal questions ab out the size-dep endent p erformance of analogue NNs and ab out the top ology of their learning-error landscap e. W e hav e inv estigated the features of greedy learning in analogue NNs. W e ha ve sho wn that applying our one- Bo olean-step exploration algorithm, learning systematically con verges to wards similar minima. Nev ertheless, different learning curv es do not follow the same paths but top ologically distinct tra jectories. This suggests that all the optimization paths are ultimately trapp ed in distinct lo cal minima with comparable prediction p erformance. W e hav e also exp erimentally demonstrated that the duration and effectiveness of the training are clearly correlated to the NN size. In particular, the num ber of ep ochs required for optimal learning scales almost p erfect linearly with the NN size. This is a crucial finding that combines with the inherent parallel nature of diffractive coupling to b o ost the scalability our photonic NN approach. Ac knowledgemen ts This w ork has been supp orted b y the EUR EIPHI program (Contract No. ANR-17-EURE-0002), by the BiPhoProc ANR pro ject (No. ANR-14-OHRI-0002-02), by the V olksw agen F oundation NeuroQNet pro ject and the ENER GETIC pro ject of Bourgogne F ranc he-Comt ´ e. X.P . has received funding from the Europ ean Unions Horizon 2020 researc h and innov ation programme under the Marie Sklodowsk a-Curie grant agreement No. 713694 (MUL TIPL Y). [1] Y. P aquot, F. Dup ort, A. Smerieri, J. Dambre, B. Sc hrauw en, M. Haelterman, and S. Massar, Scientific Reports 2 (2012), 10.1038/srep00287. [2] L. Larger, M. C. Soriano, D. Brunner, L. App eltan t, J. M. Gutierrez, L. Pesquera, C. R. Mirasso, and I. Fisc her, Optics express 20 , 3241 (2012), arXiv:9605103 [cs]. [3] F. Dup ort, B. Schneider, A. Smerieri, M. Haelterman, and S. Massar, Optics Express 20 , 22783 (2012). [4] D. Brunner, M. C. Soriano, C. R. Mirasso, and I. Fischer, Nature communications 4 , 1364 (2013). [5] G. V an Der Sande, D. Brunner, and M. C. Soriano, Nanophotonics 6 , 561 (2017). [6] W. Maass, T. Natschlager, and H. Markram, Neural Comput 14 , 2531 (2002). [7] H. Jaeger and H. Haas, Science (New Y ork, N.Y.) 304 , 78 (2004), 7 [8] J. Bueno, S. Makto obi, L. F ro ehly , I. Fischer, M. Jacquot, L. Larger, and D. Brunner, Optica 5 , 756 (2018), [9] D. Brunner, M. C. Soriano, G. V an der Sande, and (Editors), Photonic R eservoir Computing: Optic al R e curr ent Neur al Networks, DeGruyter, (2019).
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment