Backpropagation through nonlinear units for all-optical training of neural networks

Bac kpropagation through nonline ar units for all-optical training of neural net w orks Xianxin Guo, 1, 2, 3 , ∗ Thomas D. Barrett, 2 , † Zhiming M. W ang, 1 , ‡ and A. I. Lv o vsky 2, 4, § 1 Institute of F undamental and F r ontier Scienc es, University of Ele c tr onic Scienc e and T e chnolo gy of China, Chengdu, Sichuan 610054, China 2 University of Ox for d, Clar endon L ab or atory, Parks Ro ad, Oxfor d O X1 3PU, UK 3 Institute for Quantum Sci en c e and T e chnolo gy, University of Calgary, Calgary, Canada, T2N 1N4 4 Rus sian Quantum Center, Skolkovo, 143025, Mosc ow, Rus sia Bac kpropagation th rough nonlinear neurons is an outstanding challenge to the ﬁeld of optical neural n etw orks and the ma jor conceptual barrier to all-optical training schemes. Eac h neuron is req uired to exhibit a directionally d ependent resp on se to propagating optical signals, with the backw ards resp onse conditioned on the forwa rd signal, which is highly non-trivial to implement optically . W e propose a practical and surprisingly simple solutio n that uses saturable absorption to provide the netw ork n onlinea rity . W e ﬁnd that th e b ackw ard propagating gradien ts req uired to train t he netw ork can b e approximated in a pump -probe sc heme that requires only passive optical elemen ts. Sim ulations show that, with readily obtainable optical dep t hs, our approach can ac hieve equ iv alent p erformance to state-of-the- ar t computational n et works on image classiﬁcation b enc h marks, even in deep netw orks with multiple seq u en tial gradient app roximations. This scheme is compatible with leading optical neural netw ork prop osals and therefore provides a feasible path to wa rds end-to-end optical training. I. INTRO DUCTION Machine lear ning (ML) is c hanging the w ay in whic h we approach complex tasks, with a pplications ra nging from natura l la nguage pro cessing [1] and ima ge recogni- tion [2] to artiﬁcial intelligence [3 ] and fundamental sci- ence [4, 5]. At the hea rt (or ‘br ain’) of this re volution are a rtiﬁcial neural netw orks (ANNs), which are univer- sal function approximators [6, 7] capable , in principle, of representing an arbitra r y mapping of inputs to outputs. Remark ably , their function only requir es tw o basic o per- ations: matrix multiplication to co mm unicate informa- tion betw een lay er s, and so me non-linear transformation of individual neuron states (activ a tio n function). The former accounts for most of the computational co st asso- ciated with ML. This op eration ca n, howev er , b e rea dily implemen ted by leveraging the coherence a nd super posi- tion pr o perties o f linear optics [8]. Optics is therefore an attractive platform for realising the next ge ne r ation of neural netw orks, promising faster c o mputation with low power consumption [9, 10]. Prop osals fo r o ptical neura l netw orks (ONNs) hav e bee n around for ov e r thirty years [11, 12], and hav e bee n realised in b oth free-space [13 – 15] and in tegrated [9] settings. How e ver, the true power of neural netw orks is not only tha t they ca n a ppro ximate arbitra ry func- tions, but a lso that they can “learn” that approximation. The training of neural netw or ks is, almost universally , achiev ed by the backpropaga tion alg orithm [16]. Imple- ∗ These authors con tributed equally to this work.; xianxin.guo@ph ysics.ox.ac.uk † These authors con tributed equally to this work.; thomas.barrett@ph ysi cs.o x.ac.uk ‡ zhm wang @uestc.edu.cn § alex.lvo vsky@physics.o x.ac.uk men ting this algor ithm optically is c hallenging b ecause it r equires the the resp onse of the netw ork’s no nlinear elements to be diﬀerent for light pro pagating for w a rds or backw ards. Co nfr on ted with these challenges, exis ting ONNs are actually tr ained with, or heavily aided by , dig i- tal computers [9 , 13, 15, 17]. As a res ult, the gr eat adv a n- tages oﬀered b y optics remain larg e ly unexploited. Devel- oping an all-optically trained ONN to leverage these ad- v antages remains an unsolved pr oblem. Here, we addr ess this challenge and present a pra ctical training metho d capable of backpropaga ting the err or signa l through non- linear neurons in a sing le optical pass. The ba c k pr opagation algorithm aims to minimise a loss function that quantiﬁes the divergence of the net- work’s curr en t performance from the ideal, via gradient descent [1 6]. T o do so, the following steps are repea ted un til conv er gence: (1) forward propag ation of informa- tion throug h the netw ork; (2) ev aluation of the loss func- tion gradients w ith resp ect to the net work para meters at the output lay er; (3 ) backpropagation o f these gra - dient s to all pr evious lay er s; (4 ) para meter up dates in the dire c tion that maximally reduces the loss function. F orward propaga tio n (step (1)) requires b oth the afore- men tioned matrix m ultiplication, which maps informa- tion betw e en lay ers, and a suitable nonlinear activ ation function, which is applied individually to each neur on. Whilst this nonlinearity has so far b e en mostly a pplied digitally in hybrid optical-electro nic systems [9, 17, 1 8] – at the cost of rep eatedly measuring and generating the optical state – recent w ork has also realis e d optical non- linearites [15, 19]. How ever, obtaining and backpropaga ting the lo ss- function gradients (steps (2-3)) remains an outsta nding problem in an optical setting. Whilst backpropagat- ing through the linear in ter connection b etw een layers is rather straightforward, as linear optica l op erations ar e naturally bidirectiona l, the no nlinea rit y of neurons is a 2 challenge. This is because the backw ards-pr opagating signal must be mo dulated by the deriv atives of the acti- v ation function of each neuro n at its c ur ren t input v alue, and these deriv atives are not readily av a ilable in an ONN. In 198 7 , W agner et al. s uggested that a feedforward ONN could b e implemented and trained b y using F abr y- Perot etalons to approximate the req uired forwards and backw ards resp onse o f a sigmoid nonlinearity [20]. How- ever, this backpropagation appr oac h was never r ealised, or even analyzed in detail, la r gely due to its inherent ex- per imen ta l complexity , with a subsequent ONN demon- stration instead using digitally calculated er rors [21]. A further appr oac h to an optically-tra ined feedforward net- work was prop osed by Cruz-Cabr era et al. [22]. They used a hig hly non-standar d net work architecture that transforms a “co n tinuum of neur o ns” (a wa vefron t) as it passe s throug h a nonlinear crystal using cro ss-phase mo dulation with a s econdary “w eight” b eam. In a pro of- of-concept exp eriment, the learning of tw o-bit log ic was demonstrated. An a dditional challenge is to map fro m the gradients with resp ect to the (platform-agnostic) w eight matrices to the ph ysical par ameters that control these matrices in a sp eciﬁc O NN platform. In 2 018, Hughes et al. [1 7 ] pro- po sed an elegant metho d to direc tly obtain the gradients of these control parameter s by a n additiona l forward- propaga ting step. Howev er , this scheme assumes com- puting the der iv atives of the activ atio n functions digitally and applying them to the backpropag ating s ig nal electro- optically . An extensive review of these and o ther r elated works can b e found in the Supplemen tal Mater ial. This work directly addresses the issue of optical back- propaga tion throug h nonlinear units in a manner that is both consistent with mo dern neural netw ork archi- tectures and compatible with leading O NN prop os- als. W e co ns ider an optical nonlinear it y based on sat- urable a bsorption (SA) a nd show tha t, with the forward- propaga ting features and the backw ard-pro pagating er- rors tak ing the roles of pump and prob e resp ectiv ely , backpropagation can be rea lised using o nly passive op- tical elements. Our metho d is eﬀectiv e and sur prisingly simple – with the req uired optical op erations for b oth forwards and ba ckwards propagation realise d using the same physical ele men ts. Simulations with physically re- alistic pa r ameters show that the prop osed scheme c an train net works to p erformance levels equiv alent to state- of-the-art ANNs. When combined with optical calcula - tion o f the error ter m at the output lay er via interference, this presents a path to the all-o ptical training of ONNs. II. IMPLEMENTING OPTICAL BACKPR OP AGA TION W e b egin by r ecapping the opera tio n of a neura l net- work b efore discuss ing optica l implementations. Seeded with data a t the input lay er ( a (0) ), forward-propag a tion maps the neuron a ctiv ations from layer l − 1 to the neur o n Fig. 1. ONN with all-optical forw ard- and bac kward- propagation. a, A single ONN laye r whic h consists of w eighted in terconnections and a SA nonlinear activ ation func- tion. The forw ard- (red) and bac kward-propagating (orange dashed) optical signals are tapp ed oﬀ by b eam splitters an d measured, using reference beams, E ref , to ﬁnd the neuron ac- tiv ations, a ( l ) , and errors, δ ( l ) . b, Error calculation at the output lay er p erformed optically or digitally as describ ed in the main text. inputs at lay er l as z ( l ) j = X i w ( l ) j i a ( l − 1) i (1) via a weigh t matrix w ( l ) , b efore applying a nonlinear acti- v ation function individua lly to each neuron, a ( l ) j = g ( z ( l ) j ) (with subscripts lab elling individual neurons ). A t the output la yer we ev a luate the loss function, L , and calculate its g radien t with r espect to the w eights, ∂ L ∂ w ( l ) j i = ∂ L ∂ z ( l ) j ∂ z ( l ) j ∂ w ( l ) j i = δ ( l ) j a ( l − 1) i , (2) where δ ( l ) j ≡ ∂ L /∂ z ( l ) j is commonly refer r ed to as the ‘error ’ a t the j -th neuro n in the l - th lay er . F rom the chain rule we have δ ( l ) j = X k ∂ L ∂ z ( l +1) k ∂ z ( l +1) k ∂ z ( l ) j = g ′ ( z ( l ) j ) ρ ( l +1) j , (3) where ρ ( l +1) j = P k δ ( l +1) k w ( l +1) kj . Given the e rror a t the output la yer, i.e. δ ( L ) , whic h is ca lculated directly from the los s function, the erro rs δ ( L − 1) , . . . , δ (1) for all pre- ceding lay ers are sequentially found using Eq. ( 3 ). These 3 error s, a s well as the activ ations a ( l − 1) of all neurons, allow o ne to ﬁnd the gradients ( 2 ) o f the er ror function with resp ect to all the weigh ts, and hence apply g radien t descent. The transformatio n ( 1 ) is rea dily implemented as a linear optical (interferomen tric) op eration, with the neu- rons repr esen ted by real-v alued ﬁeld amplitudes in diﬀer- ent spa tial mo des [8]. Remark ably , calculating ρ ( l +1) in the right-hand side of the ba c kpr opagation equation ( 3 ) inv olves the same weigh t matrix, meaning that it ca n be implemen ted by physical backw ar d propagation o f an op- tical signal throug h the same linea r optica l arr angemen t [21], as shown in Fig. 1 (a). How ever, multiplying this sig- nal by the der iv ative of the activ ation function, g ′ ( z ( l ) ), is a challenge witho ut inv oking digital elec tr onics. T o addr ess this ch allenge, w e requir e an optica l imple- men tation o f the activ ation function with the following features: (i) nonlinear resp onse for the for w a rd input; (ii) linear resp onse for the ba c k w ar d input; (iii) mo du- lation of backw ard input with the deriv ative of the non- linear function. While it is natura l to use nonlinear op- tics fo r this purp ose, it is diﬃcult to satisfy the require- men t that the unit must resp ond diﬀerently to forward- and bac kward-propag a ting light. Here, we s ho w that this problem can b e addressed using s a turable abso rption in the well-kno wn pump-pr obe co nﬁguration. Consider passing a stro ng pump, E P , and a w eak prob e, E pr , thro ug h a tw o-level medium (e.g. atomic v ap our). The pump transmission is then a no nlinear function of the input, E P , out = g ( E P , in ) = exp − α 0 / 2 1 + E 2 P , in ! E P , in , (4) where α 0 is the resonant optical depth and a ll ﬁelds are assumed to b e nor malised by the s aturation thres ho ld. Fig. 2 (a) plots the pump trans mis sion g ( · ) at α 0 of 1 and 30. High optical depth induces strong nonlinearity in the unsa turated regio n, and a suﬃciently strong pump renders the medium nea r ly transparent in the satur a ted region. A suitably w ea k prob e, on the other hand, does not mo dify the transmissivity of the atomic media, a nd hence exp eriences linear abso rption with the absorption co eﬃcien t deter mined by the pump, E pr , out E pr , in = exp − α 0 / 2 1 + E 2 P , in ! . (5) Note that b oth b eams a re assumed to be resonant with the atomic tr a nsition and so, a s the phase o f the electric ﬁeld is unc hanged, w e trea t these a s rea l-v alued with- out a loss of genera lit y . Therefore, with the pump and prob e ta king the roles of forward-propag ating signal and backw ard-propa gating erro r in an ONN, required fea- tures (i) a nd (ii) of o ur optical no nlinear unit are met. Condition (iii), ho wev er, remains to be s atisﬁed. The − 10 0 10 E P , in exact approx rescaled − 10 0 10 E P , in 0 1 2 g ′ ( · ) exact approx rescaled b − 10 0 10 · 10 10 g ( · ) a (iii) (ii) (i) (iii) (ii) (i) Fig. 2. Saturable absorb er response . The transmission (a) and the transmission deriv ative (b) of a SA unit with op- tical depth s of 1 (left) and 30 (right), as deﬁned by Eqs. (4) and (6), resp ectiv ely . Also show n in (b) are the actual prob e transmissions given by Eq. (5) whic h approximate t he deriv a- tives, with and without the rescaling. The scaling factors are 1.2 (left) and 2.5 (right). Region (i) is th e unsaturated (non- linear) region exhibiting strong nonlinearity , and region (ii) is the saturated (linear) region. deriv ative of the pump transmissio n is g ′ ( E P , in ) =    1+ α 0 E 2 P , in  1 + E 2 P , in  2    exp − α 0 / 2 1 + E 2 P , in ! . (6) The deriv atives a t α 0 of 1 and 3 0 are plotted in Fig. 2 (b). Our key insight is that in many instances the square- brack eted factor in E q. ( 6 ) can b e considered constant, in which case, the backpropagation tr ansmission of ( 5 ) is a go o d approximation of the desired resp onse ( 6 ) up to a constant facto r . F eature (iii) is then satisﬁed because a constant sca ling o f the net work gr adien ts can be absor bed int o the learning r ate. This may app ear a c o arse approx- imation, how e ver, as we will see in the next section, it is only requir ed to hold within the nonlinear reg ion of the SA r e sponse, which is the case for our system [Fig. 2 (b)]. The prop osed scheme can b e implemented o n e ither int egra ted o r free-space platforms. In the in tegr ated setting, optical interference units that co m bine inte- grated phase-shifters and attenuators to realise int ra- lay e r weigh ts hav e b een demonstr ated [9] as has , s epa- rately , o n-c hip SA thr ough a tomic v a pour [23, 2 4 ] and other nonlinear media [1 9, 2 5]. A free-space implementa- tion of the require d ma tr ix multiplication can b e ac hieved using a spatial light mo dulator (SLM) [8] with the non- linear unit provided by a standa r d atomic v a pour cell. In the integrated case, an additional nontrivial step to map the weight gra dien ts ( 2 ) to suitable up dates o f the control par ameters (i.e. phase-shifters and attenuator) is required, how ever this c hallenge w as recently addressed by Hughes et al. [17]. A free-space implementation, b y contrast, ha s discrete blo cks of SLM pixels directly co n- trol individua l weigh ts, so the up date calculation is mor e 4 straightforward (se e Supplemental Material for details). Regardless of the chosen platfor m, pa s siv e optica l ele- men ts ca n only implement weigh ted connections that sat- isfy conser v ation o f energy . F o r net works with a single lay e r of nonlinear activ atio ns , this is not a practical lim- itation as the weigh t matrices can b e eﬀectively re alised with normalised weigh ts by corr espondingly rescaling the neuron activ ations in the input lay er . F or deep netw orks with multiple lay ers, abso rption thro ugh the v a pour cell will reduce the ﬁeld amplitude av aila ble to subsequent lay e r s. This c an be co un ter a cted b y in ter-layer ampliﬁ- cation using, for example, semiconductor optical ampli- ﬁers [26]. In our pr oposed ONN, the only parts that require elec- tronics ar e (a) real- v alued homo- or hetero dyne measure- men ts of the ta pped-oﬀ neuron activ a tions ( a ( l ) ) and er- ror terms ( δ ( l ) ) at e a c h lay er , (b) gener ating the netw ork input and reference b eams and (c) up dating the weigh ts. In pr a ctice, the up date (c) is calculated not for each in- dividual training s et element, but as av erage for multiple elements (a “mini-ba tc h”), hence the sp eed of this op era- tion is not critica l for the ONN p erformance. Gene r ating the inputs and targ ets is dec o upled fr om the calculation per formed by the ONN and requires fast optical modu- lators, which a re abunda n t on the mar k et. Finally , the measur emen ts (a) must b e follow ed by cal- culating the pro duct δ ( l ) j a ( l − 1) i and a veraging ov er the minibatch. This op eration can b e implemen ted using electronic ga te ar ra ys. F or a netw or k with L lay e r s of N neurons, this requir es 2 LN measur emen ts and L N 2 oﬄine mu ltiplications. Alternatively , the mult iplication can be realised b y dire ct optical int erference of the tw o signals with each other, follow ed b y intensit y mea sure- men t. The optical multiplication w ould require phas e stability betw een the forward- a nd bac k w a rd-propagating bea ms and the a dditional ov erhead of 2 LN 2 photo de- tectors, but elimina te the need for re fer ence b eams and oﬄine mu ltiplications. The primary latencies ass o ciated with the optical prop- agation of the signal in the ONN a r e due to the ba nd- widths of the SAs and in tra-layer ampliﬁers. F urther pro- cessing sp eed limitations are present in the pho todetec- tion a nd m ultiplication of δ ( l ) j a ( l − 1) i as well as conversion of the computed weigh t matrix g r adien ts to their a ctua- tors within the ONN [17]. This latter conv er sion how ever o ccurs o nce p er training batch, so this limitatio n can b e amortised by using large batches. The r emaining, not y et discussed, element of the ONN training is the calculatio n and r e -injection of the erro r δ ( L ) at the output lay er, to initiate backpropagation. T o implemen t this optica lly , we train the ONN with the mean-squar ed-error lo ss function, L = X i 1 2  z ( L ) i − t i  2 , (7) where t i is the target v alue for the i -th output neuron. This loss function implies δ ( L ) i = ∂ L /∂ z ( L ) i = z ( L ) i − t i , which is calculable by interference of the netw o rk o ut- puts with the target outputs on a ba la nced beam-splitter. This appr oac h to err or calculation is illustrated in the right panel o f Fig. 1 (b), wher e a s the left panel shows the standard approach in whic h the err ors are calcula ted of- ﬂine (electronica lly). II I. EXA M INI NG APPR OXIMA TION ERRORS T o inv estigate our propos ed backpropaga tion s c heme, and in par ticular ho w our a ppro ximated deriv atives af- fect netw ork p erformance, we consider the cano nical ML task of image cla s siﬁcation. Our ﬁrst set of n umerical exp erimen ts is to classify images of handwritten dig its from 0 to 9. W e use the MNIST [27] dataset that con- tains gre y scale bitmaps o f size 28 × 28, which are fed into the input layer o f the ONN. The output layer co n tains 10 neur ons whose target v alues ar e 0 o r 1 dep enden t on the digit enco ded in the bitma p (“ o ne-hot enco ding”). In this sectio n we use a netw o rk a rc hitecture with a single 128-neur on hidden la yer a s shown in Fig. 3 (a). F urther details of the ne tw orks , training and calculation of the ac- curacy metric for all exp erimen ts presented in this work can b e found in App endix A . Initially , we consider the activ ation function to b e pr o- vided by SA with an optica l depth o f α 0 = 10 . F or the chosen net work architecture, this provides (97 . 3 ± 0 . 1) % classiﬁcation accuracy after training, with no diﬀerence in p erformance regardless of whether the true der iv a- tives (Eq. ( 6 )) or the optically-o bta inable deriv ative ap- proximations ar e used. F r om Fig. 3 (b) w e ca n see that during training the ne ur ons are prima rily distributed in the unsa turated re g ion, of the SA activ ation function. This is a consequence of the fact that the expressive capacity of neural netw orks ar ises from the nonlinea r- it y of its neurons. There fore, to train the net work, the optically-obta ine d der iv atives need to approximate the exact deriv atives (up to a ﬁxed scaling a s pr e viously dis- cussed) in only this no nlinear r egion. It is interesting to inv estigate how training is a ﬀected by impr ecision in the deriv atives used. T o this end, we ev aluate the netw ork p erformance by replacing the deriv ative g ′ ( · ) with random functions of v arying simi- larity to the true de r iv ative within the nonlinear region (the quantitativ e measure, S , o f the similar ity is deﬁned in App endix B ). F r om Fig. 3 (d) w e see that the p erfor- mance app ears r obust to appr o ximation errors, deﬁned as 1 − S , o f up to ∼ 15 %. W e explain this p oten tially sur- prising obs erv a tion by no ting that gradient descent will conv e rge even if the up date vector deviates from the di- rection towards the exact minimum of the loss function, so long as this devia tio n is no t to o signiﬁcant. In the case of SA, i.e. when the approximate deriv atives given by Eq. ( 5 ) ar e used, this er ror satura tes a t ∼ 10 % for increasing o ptical depth, see Fig. 3 (e), so no signiﬁ- cant detrimental eﬀect on the tr aining ac c ur acy can b e exp ected. These results suggest that our scheme would 5 0 25 50 α 0 0 . 00 0 . 05 0 . 10 App o x. error e 0 . 0 0 . 2 0 . 4 Appro x. error 0 . 85 0 . 90 0 . 95 Accuracy SA ( α 0 =10, exact) d − 10 0 10 E (1) P , in ≡ z (1) − 10 0 10 g ( · ) 0 2 g ′ ( · ) exact approx (i) (ii) (ii) c 0 2 4 Ep och b Neuron density 784 10 128 Classification w (1) g ( · ) w (2) a (0) a (1) z (1) z (2) 28x28x1 0 1 2 3 4 5 6 7 8 9 a Fig. 3. Eﬀects of impe rfect appro ximation of the activ ation function deriv ative. a, F eed-forw ard neural netw ork arc hitecture using a single h idden la yer of 128 neurons. b, Distribution of n euron inpu ts ( E (1) P , in ≡ z (1) ), whic h is concentrated in the unsaturated region (i) of the SA activ ation function, g ( · ). As a result, the approximation error in the linear region (ii) is less impactful on t he training. c, The transmission of a SA unit with α 0 =10, along with the exact and (rescaled for easier comparison) optically appro ximated transmission deriv atives. d, Performance loss associated wi th appro ximating activ ation function deriv ative s g ′ ( · ) with random functions, plotted as a function of the ap p ro ximation error, for α 0 =10 (see App endix B for details). e, Average error of the deriv ative app ro ximation (5) as a function of the optical depth of a SA nonlinearit y . still b e eﬀective in a noisy exp erimen tal setting and that the approach studied here may function w ell for a broad range of optical no nlinea rities. IV. CASE STUDY: IMAGE CLASSIFICA TION Thu s far we have only used a simple netw o rk a rc hi- tecture to examine our deriv ative appr o ximation, how- ever we now co ns ider how ONNs with SA nonlineari- ties compares to state-of-the-art ANN s. T o do this, we use deep er net w ork a rc hitectures for a range of image classiﬁcation tas ks. T o o btain a compariso n b enc hmark, we computationally tr ain ANNs with equiv alent archi- tectures using standar d b est practices . Concretely , for ANNs we use ReLU (rectiﬁed linear unit) activ ation func- tions, deﬁned as g ReLU ( z ) = max(0 , z ), a nd the cate- gorica l cro ss-en tropy lo ss function, which is deﬁned as L = − P i t i log( p i ) where p i = exp  z ( L ) i  / P k exp  z ( L ) k  is the so ftmax pr o babilit y distribution of the netw o rk output (see App endix A for a discuss ion of the diﬀerent choice of loss function for ANNs a nd ONNs). T o b egin, we use a netw ork with tw o 128-neur on hid- den lay ers a s shown in Fig . 4 (a)(i) and, once a gain, con- sider the MNIST dataset. Fig. 4 (a)(ii) compar es the sim- ulated p erformance of the optica l a nd b enc hmark ne t- works. The ReLU-ba sed classiﬁer ac hieves an a c curacy of (98 . 0 ± 0 . 2) %, which pro vides an approximate upp er bo und on the achiev able p erformance of this net work a r- chit ecture for the chosen task [28]. An optica l netw ork with an optical depth of α 0 = 30 exactly matches this level o f perfor ma nce with a (98 . 0 ± 0 . 2) % cla s siﬁcation accuracy . As an additional b enc hmark, we train the optical netw o rk using the exact der iv ative, Eq. ( 6 ), of the activ atio n function, o btaining a similar accuracy of (98 . 1 ± 0 . 3) %. The convergence sp eed to near- optim um per formance during tra ining is unchanged a cross a ll of these net works. Fig. 4 (a)(iii) shows the trained p erformance of opti- cal netw orks as a function of the o ptical depth, which essentially determines the degr ee of nonlinear ity of the transmission function. As α 0 → 0, our netw ork can only learn linear functions of the input which restricts the clas- siﬁcation a ccuracy to (85 . 7 ± 0 . 4) %. F or la rger optical depths the p erformance of the netw ork impr o ves, with the strong p erformance observed at α 0 = 1 incr easing to near optimal levels once α 0 ≥ 10, whic h is readily obtainable exp e rimen ta lly . Eventually , for α 0 ≥ 30, we start to see the per formance of the approximated deriv a - tives reduced, although high a ccuracy is still obtained. This can b e attributed to the increasing approximation error s asso ciated with high optica l depths (see Fig. 3 (e)), which, a s previo us ly discussed, accumulate in the deep e r net work ar c hitectur e. T o prob e the limits of the achiev a ble p erformance using SA no nlinearities a nd optical ba c k pr opagation, we a lso consider the mor e challenging Kuzushiji- MNIST [29] (KMNIST) and Extended-MNIST [30] (EM- NIST) da tasets. F or these applica tions we use a deep net work ar c hitectur e with conv olutional la yers (see Ap- pendix A for deta ils), as illustrated in Fig. 4 (b)(i), which signiﬁcantly incr eases the ac hiev able classiﬁcation ac c u- racy to a level approaching the state-of-the-art. Whilst not the fo cus of this work, w e emphasise that conv o - lutional op erations are rea dily achiev a ble with optics. Current r esearch in to conv olutional ONNs either directly leverages imag ing sy stems [31] o r deco mposes the re- 6 a 784 10 128 w (1) g ( · ) 128 w (2) g ( · ) w (3) Classification 28x28x1 0 1 2 3 4 5 6 7 8 9 (iii) (ii) (i) 0.1 1 10 30 100 α 0 0 . 86 0 . 90 0 . 94 0 . 98 Accuracy Linear ReLU SA (exact) SA (approx.) 0 25 50 Epo ch 0 . 94 0 . 96 0 . 98 Accuracy ReLU SA ( α 0 =30, exact) SA ( α 0 =30, appro x.) b Dataset → g ( · ) ↓ MNIST KMNIST EMNIST SA (approx) (99 . 3 ± 0 . 1) % (95 . 4 ± 0 . 3) % (87 . 9 ± 0 . 4) % SA (exact) (99 . 4 ± 0 . 1) % (96 . 3 ± 0 . 1) % (88 . 1 ± 0 . 3) % ReLU (99 . 3 ± 0 . 1) % (96 . 1 ± 0 . 3) % (88 . 6 ± 0 . 2) % T anh (99 . 2 ± 0 . 1) % (95 . 6 ± 0 . 2) % (87 . 5 ± 0 . 2) % Sigmoid (99 . 0 ± 0 . 1) % (95 . 8 ± 0 . 2) % (87 . 7 ± 0 . 3) % Linear Class. (92 . 3 ± 0 . 2) % (69 . 6 ± 0 . 7) % (67 . 8 ± 0 . 3) % Feature extraction Classification 1024 10 128 w (2) w (1) g ( · ) 4x4x64 64 C g ( · ) MP 12x12x32 32 C g ( · ) MP 28x28x1 (ii) (i) Fig. 4. Performance on im age classiﬁcation. a, (i) The fully- connected netw ork architecture. ( ii) Learning curv es for the S A (with either exact or app ro ximated deriv atives) and b enc h mark ReLU netw orks. (iii) The ﬁ nal classiﬁcation accuracy ac hieved as a function of the optical depth, α 0 , of the SA cell. b, (i) The conv olutional netw ork architecture. Sequential conv olution la yers of 32 and 64 c h annels conv ert a 28 × 28 p ixel image in to a 1024-dimensional feature v ector whic h is then classiﬁed (into N C = 10 classes for MNIST and KMNIST, and N C = 47 classes for EMNI ST) by fully- conn ected la yers. P o ol ing lay ers are not shown for simplicity . (ii) Classiﬁcation accuracy of conv olutional netw orks when using vari ous activ ation functions. The same deep netw ork arc hitecture is applied to all datasets, but the SA net w orks use mean-p ooling while the b enc h mark netw orks u se max-p ooling. The last row sho ws t he p erformance of a simple linear classiﬁer as a b as eline. quired conv olution in to optical matrix multiplication [32– 34]. In addition to co n volutional lay ers, convolutional neu- ral netw orks also cont ain p o oling lay ers, whic h lo cally aggre g ate neuron activ ations. The common implemen- tation of these is ma x-po oling, ho wev er this oper ation do es not rea dily transla te to an optical setting. There- fore, for O NN s we deploy mean-p o o ling, where the acti- v ation o f neuro ns is lo cally av eraged, which is a str a igh t- forward linear optical ope r ation. In co n tra st, our b enc h- mark ANNs utilize ma x -po oling. Fig. 4 (b)(ii) compares the obta ined per formance with SA nonlinearities (with α 0 = 10) to that achiev ed with benchmark ANNs that use v a rious standard a ctiv ation functions. W e see a n equiv a le nt level of perfor mance, despite the a ppro ximation in the backpropagation phase. This result sugg ests that a ll-optical backpropagatio n can be utilized to train sophisticated netw orks to state-of- the-art levels o f p erformance. V. DISCUSSION This work presents an eﬀective and surpr isingly simple approach to achieving optical backpropagation through nonlinear units in a neural net work – an outsta nding chal- lenge in the pursuit of truly all-optical net works. With our s c heme, the information propaga tes thro ugh the net- work in both directions without in terconversion b et ween optical and electronic form. The role of dig ital e lectron- ics is r educed to the prepara tion of the netw or k input, photo detection and up dating the netw or k parameters. In these elements of the netw ork, the conversion speed is not critical, particula r ly for larg e batches of tr aining da ta . The scheme is compa tible with a v ariety o f O NN plat- forms; in the Supplemental Material, w e discus s inte- grated and free- s pace implementations. W e also a n tic- ipate that a broader class of nonlinear optical phenom- ena ca n b e used to implemen t the activ ation function. F or example, one co uld consider directly using sa tura- tion of int ra- layer a mpliﬁers for this ro le, circ umven ting the need for SA units entirely . A preliminary numeri- cal exp eriment to this eﬀect is discussed in Supplemental Material. T he r efore, as w ell as presenting a path tow ards the end-to-end optical tra ining of neural net w orks , this work sets out an impor tan t co nsideration for nonlinear i- ties in the design of analog neural net works of any nature. ACKNO WLEDGEMENTS A.L.’s researc h is partially supp or ted by Russian S cience F ound atio n (19-71-10092). X.G. ackno wledges funding from the Un iversit y of Electronic Science and T ec hnology of China. A.L. thank s William Andregg for in tro ducing him to ONNs. AUTHOR CONTRIBUTIONS The researc h was conceived by X.G. and A.L. X.G. pro- p os ed the idea of using SA to implemen t the activ ation func- tion. T .D.B. and X.G. join tly p erformed the sim ulations. The manuscri pt was p repared by T.D.B., X.G. and A.L. All wo rk w as done under th e sup ervision of A.L. and Z.M.W. 7 App endix A: Netw ork details 1. Image datasets W e consider th ree diﬀerent datasets, all containing 28 × 28 pixel greyscale images: MNIST [27], Kuzushiji-MNIS T (KM- NIST) [29] and Extended-MNI ST (EMNIST) [30]. MNIS T corresponds to h andwritten digits from 0 to 9, KMNIS T con- tains 10 classes of handwritten Japanese cu rsive characters, and w e use the EMNI S T Balanced dataset, which contains 47 classes of han dwri tten digits and letters. MNIST and KM- NIST ha ve 70 000 imag es in total, split into 60 000 training and 10 0 00 test instances. EMNIST has 131 6 00 images, with 112 800 (18 800) training (test) instances. F or all datasets, the training and testing sets hav e all classes equ al ly represen ted. 2. Netw ork architectur es The fully-conn ected net wo rk we train to classify MNIST (correspond ing to the results in Fig. 4 (a)) ﬁrst unrolls eac h image into a 784-dimensional input vector, b efore tw o 128- neuron hidd en la yers and a 10-neu ron output lay er. The conv olutional netw ork depicted in Fig. 4 (b)(i) has tw o conv olutional lay ers of 32-channel and 64-c hannels, respec- tively . Each la yer con volv es the input with 5 × 5 ﬁlters (with a stride of 1 and no padding), follo w ed by a nonlinear ac- tiv ation funct io n and ﬁnally a p ooling operation (with both kernel size and stride of 2). After the con volutional netw ork, classiﬁcation is carried out by a fully-conn ected netw ork with a single 128-neuron hidd en la yer and N C -neuron output la yer, where N C is the number of classes in th e target dataset. Multila yer ON Ns are assumed to hav e th e same optical depth of their saturable absorbers in all la yers. 3. Netw ork loss function As stated in the main text, we train ONNs using the mean- squared-error (MSE) loss function, whereas the AN N base- lines use categorical cross-entro py ( CCE). This c hoice was made as the gradients of MSE loss are readily cal culable in an optical setting, whereas t h e softmax op eration in CCE w ould require oﬄine calculation. How ever, our ANNs use CCE as this is the stand ar d choi ce for classiﬁcation prob- lems in the deep learning community . F or completeness, w e re-trained our ANN baselines for MNIST classiﬁcation using MSE. The fully-connected classiﬁer (Fig. 4 (a)(i)) provided a classiﬁcation accuracy of (98 . 0 ± 0 . 2) %, while the con vo- lutional classiﬁer (Fig. 4 (a)(ii)), using R eLU nonlinearities, scored (99 . 5 ± 0 . 1) %. In b oth cases, th e p erf ormance of MSE is essential ly equiv alent to that of CCE. 4. Netw ork training All netw orks are trained with a mini-batch size of 64. W e used the A dam optimiser with a learning rate of 5 × 10 − 4 , in- dep enden t of t he optical depth of t he SA. F or each netw ork, the test images of the target dataset are split evenly in t o a ‘v alidation’ and ‘test’ set. After every ep och, the performance of the netw ork is ev aluated on the h el d-out ‘v alidation’ im- ages. The b est ONN parameters found ov er training are then used to v erify the p erformance on t he ‘test’ set. Therefore, learning curves showing the p erforma nce during training ( i. e. Fig. 4 (a)(ii)) are plotted with resp ect to the ‘v alidation’ set, with all other rep orted results corresponding to the ‘test’ set. The fully-connected n et works were trained on MNI ST for 50 ep ochs. The convol utional netw orks are trained for 20 ep ochs when using R eLU, T anh or Sigmoid nonlinearities, and 40 ep ochs when using SA nonlinearities. T raining performance is empirically observ ed to b e sensi- tive to the initia lisation of t h e weigh ts, which w e ascrib e to the small deriv atives aw ay from th e nonlinear region of the SA resp onse curve. F or lo w optical depths, α 0 ≤ 30, all la y- ers are initialised as a normal d is tribution of width 0.1 cen- tred around 0. F or higher optical dept h s, the w eights of the fully-connected ONN shown in Fig. 4 (a) are initialised to a double-p eak ed distribution comprised of tw o normal distribu - tions of width 0.15 cen tred at ± 0 . 15. W e do not constrain our wei ght matrices during training b ecause, as discussed in the main text, conserv ation of en erg y can alwa ys b e satisﬁed by rescaling the inp ut p o w er or output threshold for the ﬁrst and last linear transformation and u sing intra-la yer ampliﬁers in deep er architectures. F or all images, t h e inpu t is rescaled to b e b etw een 0 and 1 (whic h practically would corresp ond to 0 ≤ E (0) P , in ≤ 1) when passing it to an netw ork with computational nonlinearities (i.e. ReLU, Sigmoid or T anh). D ue to ‘absorption’ i n netw orks with SA n onli nearities, w e empirically observe that rescaling the inpu t data to higher v alues results in faster converg ence when t raining conv olutional n et works with multiple hidden la yers. Therefore, the fully connected netw orks in Fig. 4 (a) use inpu ts b etw een 0 and 1 and the conv olutional netw orks in Fig. 4 b use inputs normalised b et w een 0 and 5 (15) for α 0 ≤ 10 ( α 0 > 10). App endix B: Calculation of deriv ative appro ximation error As d is cussed in the main text, we approximate the true deriv atives g ′ ( · ) of the activ ation functions by random fun c- tions f ( · ) to test th e eﬀ ect of the approximation error on training. Here w e discuss how these funct ions are generated and how the similarit y measure is deﬁned. The response of a saturable absorption nonlinearity can b e considered in tw o regimes, nonlinear (unsaturated) and linear (saturated), which are lab elled (i) and (ii) in Fig. 2 , respectively . During the n et work training, the neuron inpu t v alues ( z ( l ) j ) are primarily distributed in the nonlinear region, as seen in Fig. 3 (b) and discussed in the main text. Therefore, w e model th e neuron input as a Gaussian distribution within this region, p ( z ) = 1 √ 2 π σ exp  − z 2 2 σ 2  , (B1) where 2 σ is the width of region (i). W e th en deﬁn e th e sim- ilarit y as the reweigh t ed normalised scalar pro duct b et ween the accurate and approximate deriv ative s, S = | R f ( z ) g ′ ( z ) p ( z ) dz | 2 R [ f ( z )] 2 p ( z ) dz · R [ g ′ ( z )] 2 p ( z ) dz . (B2) According t o the Cauch y-Sch w arz inequality , S is b ounded by 1 and t herefo re so is th e av erage app roximation error, 1 − S . 8 T o obtain th e results in Fig. 3 (d-e), we generate 200 ran- dom fun ctions for f , with d iﬀeren t approximation errors. W e ﬁrst generate an arra y of pseudo-random num bers ranging from 0 to 1, concatenate it with the ﬂipp ed arra y to make them symmetric like th e d eri v ative g ′ ( · ), and th en use shap e- preserving in terp olatio n to obtain a smooth and symmetric random function. The netw ork is then t ra ined once with eac h of the generated f ’s. App endix C: Co de av ail abi lit y Source co de h a s b een made publicly av ailable at https://g ithub.com/tomdbar/all - optical- neural- netw orks . [1] E. Cambria and B. White, Jumping nlp curves: A review of natural language pro cessi ng research, I EEE Comput. Intell. Mag. 9 , 48 (2014). [2] W. R a wat and Z. W ang, Deep convo lutional neural net- w orks for image class ilcation: A comprehensiv e review, Neural Comput. 29 , 2352 (2017). [3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V an Den Driessc he, J. Schrittw ieser, I. Antonoglou, V. P anneershelv am, M. Lan ct ot, et al. , Mastering the game of go with deep neural netw orks and tree searc h, Nature 529 , 484 (2016). [4] J. Gilmer, S. S. Sc ho enholz, P . F. Riley , O. Viny als, and G. E. Dahl, Neural message passing for quantum c hem- istry , in Pr o c e e dings of the 34th International Confer enc e on Machine L e arning-V olume 70 (JMLR. org, 2017) pp. 1263-1272 . [5] G. T orlai, G. Mazzola, J. Carrasquilla, M. T roy er, R . Melk o, and G. Carleo, N eural -netw ork q uan tum state to- mograph y , Nat. Physics 14 , 447 (2018). [6] K. Hornik, M. Stinchcom b e, and H. White, Multilay er feedforw ard netw orks are un iv ersal approximators, Neu- ral Netw. 2 , 359 (1989). [7] G. Cyb enko , Ap pro x ima tion by sup erpositions of a sig- moidal function, Math. Con trol Signals Syst. 2 , 303 (1989). [8] P . N . T amura and J. C. Wyan t, Two-dimensional matrix multipli cation using coherent optical t ec hn iq ues, O pt. Eng. 18 , 182198 (1979). [9] Y. Shen, N. C. Harris, S . Skirlo, M. Prabhu, T. Baehr- Jones, M. Ho c hberg, X. Su n, S. Zhao, H . Larochelle, D . Englund, et al. , Deep learning with coherent nanopho- tonic circuits, Nat. Photonics 11 , 441 (2017). [10] L. De Marinis, M. Co co ccioni, P . Castoldi, and N. And ri- olli, Photonic n eural netw orks: A survey , IEEE Access 7 , 175827 (2019). [11] Y. Abu-Mostafa and D. Psaltis, Optical neu ra l comput- ers, Sci. A m. 256 , 88 (1987). [12] S. Jutamulia and F. Y u, Overview of hybrid optical neu- ral netw orks, Opt. Laser T echnol. 28 , 59 (1996). [13] J. Bueno, S . Makto obi, L. F ro ehly , I. Fischer, M. Jacquot, L. Larger, an d D. Brunner, R ei nforcement learning in a large-scale ph otonic recurrent neural n et work, Op tica 5 , 756 (2018). [14] X. Lin , Y. Rivenson, N. T. Y ardimci, M. V eli, Y. Lu o, M. Jarrahi, and A. Ozcan, A ll-opt ic al machine learning using d iﬀractive deep neural n et works, S cience 361 , 1004 (2018). [15] Y. Zuo, B. Li, Y. Zh ao , Y. Jiang, Y.-C. Chen, P . Chen, G.-B. Jo, J. Liu, and S. Du, All-optical neural net- w ork with nonlinear activ ation functions, O ptica 6 , 1132 (2019). [16] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Na- ture 521 , 436 EP (2015). [17] T. W. Hughes, M. Minko v , Y. Shi, and S. F an, T raining of photonic neu ral netw ork s through in situ b ac kp ropa- gation and gradien t measurement, Optica 5 , 864 (2018). [18] I. A. D. Williamson, T. W. Hu gh es, M. Minkov, B. Bartlett, S. Pai, and S. F an, R eprogra mmable electro- optic nonlinear activ ation functions for optical neural netw orks, IEEE J. Sel. T op. Quantum Electron. 26 , 1 (2020). [19] Z. Cheng, H. K. Tsang, X. W ang, K. Xu , and J.-B. Xu, In-plane optical absorption and free carrier absorption in graphene-on-silicon w av eguides, IEEE J. Sel. T op. Quan- tum Electron. 20 , 43 (2013). [20] K. W agner and D. Psaltis, Multila yer optical learning netw orks, Appl. Op t. 26 , 5061 (1987). [21] D. Psaltis, D. Brady , and K. W agner, A daptiv e optical netw orks using photorefractive crystals, Appl. Opt. 27 , 1752 (1988). [22] A. A. Cruz- C abrera, M. Y ang, G. Cui, E. C. Behrman, J. E. St ec k, an d S. R. Skinner, Reinforcemen t and back- propagation training for an optical neural netw ork using self-lensing eﬀects, IEEE T rans. Neural Net w. 11 , 1450 (2000). [23] W. Y ang, D. B. Conkey , B. W u, D. Yin, A. R. Haw kins, and H. Schmidt, Atomic sp ectroscop y on a chip, Nat. Photonics 1 , 331 (2007). [24] R. Ritter, N. Gruhler, W. Pernice, H. K ¨ ubler, T. Pfau, and R. L¨ ow, Atomic v ap or sp ectrosco py in integrated photonic structures, Appl. Phys. Lett. 107 , 041101 (2015). [25] Q. Ba o, H. Zhang, Y. W ang, Z. N i, Y. Y an, Z. X. Sh en, K . P . Loh, and D. Y. T ang, Atomic-la yer graph en e as a sat- urable absorber for ultrafast pulsed lasers, Adv. F unct. Mater. 19 , 3077 (2009). [26] M. J. Connelly , Semiconductor optical ampliﬁers (Springer S cience & Business Media, 2007). [27] Y. LeCun, C. Cortes, and C. Burges, Mnist hand- written d ig it database, A TT Labs [Online]. Av ailable: http://y an n .l ecun.com/exdb/mnist 2 (2010). [28] Y. LeCun, L. Bottou, Y. Bengio, P . Haﬀer, et al. , Gradien t-based learning applied to do cument recogni- tion, Proc. IEEE 86 , 2278 (1998). [29] T. Clan uw at, M. Bober- Irizar, A. Kitamoto, A. Lamb, K. Y amamoto, and D. Ha, Deep learning for classical japanese literature, arXiv:1812.01 718 (2018). [30] G. Cohen, S. Afshar, J. T apson, and A. v an Sc haik, Emnist: an extension of mnist to h andwritten letters, arXiv:1702.05 373 (2017). [31] J. Chang, V. Sitzmann , X . Dun, W. Heidrich, and G. W etzstein, Hy b rid optical-electronic conv olutional n eu- 9 ral n et works with optimized diﬀractive optics for image classiﬁcation, Sci. Rep. 8 , 12324 (2018). [32] H. Bagherian, S. Skirlo, Y . Shen, H. Meng, V. Cep eric, and M. Solja ˇ ci´ c , On-c hip optical conv olutional n eural netw orks, arXiv:1808.0 3303 (2018). [33] S. Xu , J. W an g, R . W ang, J. Chen, and W. Zou, High- accuracy optical convol ution unit arc hitecture for con- vol utional neural netw orks by cascaded acousto-optical mod u la tor arra ys, Op t. Express 27 , 19778 (2019). [34] R. Hamerly , L. Bernstein, A. Sludds, M. Solja ˇ ci´ c , and D. Englun d , Large-scale optical neural netw orks b as ed on photo electric multiplication. Phys. Rev . X 9 , 021032 (2019). Supplemen tal Material: B a c kpropagation through nonlinear units for all-optical training of neural netw orks Xianxin Guo, 1, 2, 3 , ∗ Thomas D. Barrett, 2 , † Zhiming M. W ang, 1 , ‡ and A. I. Lvovsky 2, 4 , § 1 Institute of F undamental and F r ontier Scienc es, University of Ele ctr onic Scienc e and T e chnolo gy of China, Chengdu, S ichuan 610054, China 2 University of Ox for d, Clar endon L ab or atory, Parks Ro ad, Oxfor d O X1 3PU, UK 3 Institute for Quantum Sci en c e and T e chnolo gy, Uni ve rsity of Calgary, Calgary, Canada, T 2N 1N4 4 Rus sian Quantum Center, Skolkovo, 143025, Mosc ow, Rus sia (Dated: Octob er 9, 2020) 1. RELA TED WORKS Optics pr o vides a n attractive pla tf orm for the implementation of neural netw orks, a nd o ptical neural netw orks (ONNs) have bee n inv estigated for a long time. In the 1980’s to 1 990’s, most of ONN research fo cused on the realization of the Hopﬁeld netw ork, a simple type of recurrent neur al netw o rks, utilising binary neurons, which serves as an asso ciative memory sys tem [S1]. This netw o rk is des igned to store multiple patterns in the ﬁxed weigh ts, a nd ﬁnd the b est match up on request with a new pattern input – even when the input only contains partial informatio n o f a stored pattern. In opto electronic Hopﬁeld net works, the neuron interconnection is realised with o ptics , for exa mple, via o ptical vector-matrix multiplication, with the threshold a nd feedba c k s tages handled with electronically [S2 , 3]. In 1987, Abu-Mosta fa and his colleag ues in Caltech constr ucted an all-o ptical Ho pﬁeld netw ork [S4]. They s tored four patterns in a hologr a m, fabrica ted optical threshold units with nonlinea r photore f ractive materia l, a nd built a n optical feedback lo op. In that sys tem the weigh ts are enco ded as the Bragg gr ating streng ths of the hologram, and neur on int erconnectio n is a c hieved by the Bra gg selectivity . Although Hopﬁeld netw o rks w ere the ﬁrs t optica lly constructed and trained neural netw o rks, their application is r ather limited, and they represent a fundamentally diﬀerent pa radigm to our fo cus of optical feedforward net works. Compared to the Hopﬁeld netw o r k, feedforward netw orks hav e r ic her str uctures to handle diﬀer en t tasks, and they a re ess en tial in to day’s deep lea rning technology . In 19 87, W agner et al. [S5] prop osed an all- optical m ultilay er feedforward netw ork that utilised volume holog rams for bidire c tio nal neuro n interconnection, F abr y -P erot etalo ns as the nonlinear unit, and optical backpropagation to train the net work. F o rw ard- and backw ard-pr opagating b eams ar e orthogo nally p olarised and rotated in the hologra m such that they interfere with the corr e ct reference b eams for the weigh ted int erconnec tio n in b oth directions. A bistable F abry-Perot etalon simulates a sigmoid activ a tio n function for forward-propaga ting strong signals, and ro ughly approximates the der iv ative of the sigmoid for backw ard-pro pa gating weak err or ﬁelds. T o mitigate the signiﬁcant deviation b et ween the backw ard etalon resp onse and the exa ct sig moid deriv ative, they suggested ins ertion of a birefrigent plate or adoption of a dual-etalon scheme where the t wo etalons are optimized for for w a rd and backward resp onses r espectively . Later they de mo nstrated pro of-of-principle holog raphic neuron interconnection in forward direction a nd e rror-driven upda te of the hologra m [S6], ho wev er the err o r was calculated digitally . There has never b een a full ex perimental demonstra tion o f the prop osed scheme, larg ely due to the inherent exp eriment al co mplex it y a ssocia ted with the polaris ation-switc hing bidirectional volume holog ram and the intricate F abr y -P erot etalon design. In o ur work, whils t we also ado pt a s imilar pump-prob e scheme, we aim to simplify the o ptical backpropag ation thro ugh the nonlinear unit. Our SA nonlinearity re sem bles the ReL U activ atio n function, whic h ha s replaced s igmoid activ atio ns as the standar d b est pr actice in mo dern deep feedforward netw orks. The SA nonlinea rit y is also ea sy to b e realised with a bundan t materia l choices. Moreov er , in our scheme the error s can b e ba c kpr opagated through the same system without additiona l exp erimen tal complications . In 1995, Skinner et al. [S7] prop osed a specia l t yp e of optica lly trainable feedfor w a rd netw ork based on se lf-lensing media. The wa vefront of the for ward b eam passing throug h a pattern ma sk is reg arded as a contin uum of input neurons, and the electric ﬁeld co upling during free-space propaga tion re a lises the virtual neuron interconnection. The forward b eam is then steered by additiona l “weight” beams in a lay er of self-lensing media via cr oss-phase mo dulation, ∗ Electronic address: These authors con tributed equally to this wo rk.; xianxin.guo@ph ysics.ox.ac.uk † Electronic address: These authors con tributed equally to this wo rk.; thomas.barrett@ph ysics.ox .ac.uk ‡ Electronic address: zhm wa ng@uestc.edu.cn § Electronic address: alex.lv ovsky @physics.o x. ac .uk 2 and propa gated towards the detector. In this design, the weigh t b eams can b e tra ined by backpropagating the error bea ms through the same system. The experimental system w as subseq ue ntly built and optically traine d to clas sify logic ga tes [S8]. This is the ﬁr st, and perhaps only e x perimental demonstratio n of an o ptica lly-trained feedforward net work. How ever, the netw or k structure diﬀer s signiﬁcantly from that o f a regular co mput ational o ne : there are no discrete neurons, the in ter connection cannot b e a rbitrary since it’s a c hieved via fr e e-space propag ation of lig h t, a nd the nonlinearity is mediated by the “weigh ts” thr ough the phase. In r ecen t years, several gro ups rejoined the eﬀort with adv a nced optical technologies and realis ed optical feedfor w a rd net works with integrated photonics [S9] and free-spac e diﬀraction [S10]. The use of atomic nonlinearities [S11] has a lso bee n demonstrated. Some of these ONNs w ere not designed to b e trained optically : the tra ining w a s to be implemen ted via a digital co mputer. In others, brute force training is p ossible by up dating para meters with ﬁnite diﬀerence metho d – per turbing control para meters to direc tly co mpute the req uired gradients [S9] – yet this is extremely ineﬃcient as compared to the ba c kpr opagation algorithm. A fur ther challenge is that, while the backpropagation a lgorithm is designed to o bta in gradie nts of the weight matrix for its subsequent up dates, in ONNs these weights a re access ed not directly , but through a set of actua to rs — for example, phas e shifters in an integrated in terferometer of Ref. [S9]. The ma pping to these para meters can be computationally e x pensive. T o tackle with this problem, Hughes et al. [S12] recently prop osed an in situ optical backpropagation and gra dien t measurement scheme. They s ho wed that gr adien ts of the control par ameters in thes e int egra ted pho tonic neura l netw o rks can b e o btained from a ser ies of forward a nd backward in tensity measur emen ts. Their method may also b e extended to other ONN pla tf orms since the deriv atio n starts from the Maxwell equa tions. How ever, in their scheme, the nonlinear a ctiv ation function is a ssumed to be applied digitally . The light has to b e detected, ele ctronically mo dulated, and r e-injected betw een ea ch pair of lay er s in b oth dire c tio ns. Because our scheme considers backpropagation through no nlinear neur on ac tiv ations, it is complementary to tha t of Ref. [S1 2] and could be applied in concer t with it. 2. PHYSICAL IMPLEMENT A TION OF THE OPTICAL TRAINING SCHEME 2.1. Bidirectional weight matrices Bidirectional w eighted int erconnectio n o f neurons can be exper imen tally realised with v a rious metho ds. Here we describ e tw o such metho ds: integrated photonics and free -space optics. A real-v alued weigh t matrix c a n be factorised via singular v alue decomp osition (SVD) int o the form U Σ V † , where U and V are unitary matrices and Σ is a r ectangular dia gonal matrix. In optics, it’s well known that any unitary matrix can b e implemented with a set of Ma c h- Z ehnder interferometers (MZIs) cons isting of b eam splitters and phase shifters [S13]. The diagona l matr ix can b e re a lised with optical attenuators. In in tegrated photonics , optical int erference units (OIUs) with thermo-optical or electro-optica l phase shifters together with in tegrated atten uators can b e used to represe n t the weigh t matrix. Progr ammable OIUs w ith 88 MZIs ha s b een demonstrated [S14], and a t wo-lay er netw ork with four to ﬁve neurons at ea ch layer has b een realised. Increasing the netw o rk size is a challenging task since the requir e d integrated comp onen ts in OIUs scales quadratically with the neuron num be r. T o up date the weigh t matrix after o btaining the w eig h t gra dien ts, one still needs to map the new w e ig h ts to phas e shifter settings following the matrix transfor mation as shown in [S13]. Alternatively , the in situ optical backpropagation scheme [S1 2] can b e applied in conjunction with o ur s to obtain gra dien ts of phase shifter p ermittivities o ptica lly . The up date sp eed of the OIUs is determined b y the phase shifter bandwidth, which is usually ov e r 10 0 kHz for thermo-optical implemen tations [S15] a nd 1 0 GHz for electro -optical implementations with ferro electric cr ystals [S1 6]. In a free-space setting, weigh ted neur on interconnection ca n be realised with optical vector-matrix m ultiplication (VMM). Neuron v a lues ar e enco ded on the ele c tr ic ﬁeld o f the propaga ting bea m, and real-v alued w eight matrices can b e enco ded on liquid- c rystal spa tial light mo dulators (LC-SLMs) or digital micromirror devices (DMDs). Pre cise amplitude and phase control of light can b e achieved by mo dulating the phase g rating patter n of the LC-SLMs [S18]. Although DMDs are designed as binary amplitude mo dulators, m ultilevel control ca n be easily achiev ed by g rouping m ultiple pixels as a unit blo c k. T aking a blo c k of 10 × 10 mo dulator pix els to represent a neuro n/ w eig h t blo ck, free- space ONNs with 20 0-400 neur o ns p er layer can be built with currently av ailable high- resolution LC- SLMs/DMDs. A co her en t pro grammable VMM system can thus b e constructed with cylindrical lenses p erforming 4 F imaging and F ourier tra nsform [S17], as illustrated in ﬁgure S1. In this setting, only zero spatia l frequency compo nen ts at the output plane ca rry the corr ect VMM re s ult, so the o utp ut b eam has to pass thro ugh a nar ro w o ptica l slit. T o ev aluate the p o wer eﬃciency of the slit, w e s et the vector and matr ix entries to b e one so that the output plane shows a sinc sp ectrum (assuming s quare ap erture o f the system), from which we estimate that with av er a ge o utp ut accuracy of ab o ut 95 %, the p o w er eﬃciency o f the slit is a bout 50 %. Higher power eﬃcienc y c a n be o btained at the cost of low er accuracy . 3 ݂ ௫ଵ ݂ ௫ଶ ݂ ௫ଷ Ɛůŝƚ ݂ ௬ଵ ݂ ௬ଶ ݂ ௬ଷ DĂƚƌŝǆ ŵĂƐŬ ǀĞĐƚŽƌ ŝŶƉƵƚ 900PD[ 900PD[ FIG. S1: F ree-space coherent optical VMM. The input vector a i is prepared as a set of sp atial modes distributed horizon tally . Each of these modes initially diverge in the vertical ( y ) dimension until collimation by a cylindrical lens f y 1 . The vector comp onents are imaged in the horizontal ( x ) dimension by a pair of cylindrical lenses f x 1 and f x 2 to the matrix mask plane. In th is plane, t h e vector comp onen ts are multiplied by the matrix elements w j i , so the sp atial conﬁguration of t h e ﬁeld after th e matrix mask is given by w j i a i . A pair of cylindrical lenses f y 2 , f y 3 realise 4F imaging of matrix mask plane in the y dimension, and a cy lind rica l lens f x 3 realises a F ourier t ransf orm in the x dimension. A narrow slit along y is placed at output plane to pass the near-zero spatial frequency comp onen ts of the F ourier t ran sformed ﬁeld, corresponding to the summation P i w j i a i . The DMD bandwidth is ab out 10 kHz, and LC-SLM maximum bandwidth is sub-kHz [S19], hence the upda te sp eed of VMM is slower than that of OIU. An adv antage of the free-space implementation, on the other hand, is that ea c h weigh t in an VMM is independently controlled by a block of pixels on the LC-SLM o r DMD. There fo re, the weight upda te ca n b e implemen ted with weigh t gradients via a calibrated lo ok-up table. 2.2. Saturable absorb er Saturable abso r ption is a common nonlinear o ptical phenomenon, a nd there are many diﬀeren t materia l choices for a satura ble absorb er in an ONN. An a tomic v a por cell or a cold atomic c lo ud in a magneto-optica l trap is a viable optio n in free- space. Optical depths o f α 0 & 10 can b e easily obtained. Although element-wise activ ation is needed in a neural netw ork, one can ac c ommodate m ultiple neurons /beams in a single atomic cloud or v ap or cell. T o pr ev e nt the b eams from sig niﬁcan t div ergence inside atomic medium, the Rayleigh leng th z R = π w 2 0 /λ should be larger tha n ato mic sa mple thic kness, which is typically on the order of ce ntimeter. Therefor e, the b eam waist w 0 in the atomic medium can be ab out 10 0 µ m, where w e take the reso nan t wa velength o f the 87 Rb D 2 line transition. W e can accommo date 100 neurons within a sa mple with a width of 2 cm. A tomic v ap or cells can also b e in tegrated on a s ilic o n chip and co upled to integrated wa veguides, as demonstrated in [S20, 2 1 ]. Optical depth of α 0 =1 to α 0 =2 hav e b een achiev ed. Other satur a ble abs o rbers such a s semiconductors or graphene lay ers featuring low threshold and large mo dulation ba ndwidth [S22] may a lso b e in tegrated into nanopho - tonic cir cuits [S23]. These o ptical depths allow the O NN to achiev e strong p erformance as demonstrated in the main text. 2.3. Int ra-lay e r ampliﬁcation In our s im ulatio ns we do not co nstrain the weigh ts w ( l ) , but in a realis tic passive physical system the weight s a re bo unded b ecause the weigh ted in ter connection must not violate energy co ns erv a tion. T o addre ss this, in a ph y s ical implemen tation, we place an ampliﬁer of gain A ( l ) in front of a passive weigh ted interconnection with a matrix W ( l ) , so tog ether they comprise the des ir ed w eight matrix w ( l ) = A ( l ) W ( l ) . In this section, we ev aluate the necessar y ga in A ( l ) of such a n a mpliﬁer, s pecialising to the w eight matr ix w (2) of the simulation in Fig. 4(a) in the main text. 4   (SRFK               (SRFK           E D FIG. S2: Low er ( a) and higher (b ) b ounds on th e ampliﬁer gain req u ired for netw ork training with u nconstrained w eights. W e use t wo es tima tes to provide the lo wer and upper bound of this required ga in. F or the lo wer b ound, we note that the energ y conserv ation in a passive system implies that X j ( W ( l ) ij ) 2 ≤ 1 X i ( W ( l ) ij ) 2 ≤ 1 . (S1) In order to satisfy these conditions, the gain A must not b e low er than max  max i P j ( w ( l ) ij ) 2 , max j P i ( w ( l ) ij ) 2  . This bo und is plotted in Fig. S2(a). T o estimate the upp er b ound, we take the squa re of the highe s t sing ular v alue Σ max of the weigh t matrix w (2) . Indeed, if A ≥ Σ max , then no sing ular v alues of W (2) exceed 1, meaning that this matrix can b e implemen ted as discussed in Sec. 2.1. W e plo t Σ 2 max as a function of the training epo ch num b er in Fig. S2(b). Semiconductor optical ampliﬁers (SOAs) ca n oﬀer 30 dB a mpliﬁcation with hundreds o f ps resp onse time, and can be in tegrated on wa veguides [S24]. F rom the plot we see that at α 0 ∼ 30 , one stage of p o wer ampliﬁcation with ab out 10 dB gain is needed. At low er optical depth, the required ga in is genera lly smaller. 2.4. Optical p o wer consumption The optical power c onsumption in an ONN dep ends on the netw ork architecture and implementation details. Therefore, for concr eteness, we now consider a fully- connected netw or k with N = 10 00 units p er lay er, with SA optical nonlinearities implemented o n the 87 Rb D 2 line. Recalling Fig. 3(b) from the main text, we note that during training the input p o wer to each neuron is typically restricted to the unsaturated reg ion, (i), of the nonlinear it y resp onse. F or the SA nonlinearities we co nsider, the sa turation intensit y is given by [S25] I sat = ~ ω Γ 2 σ 0 = 16 . 6 µ W mm − 2 , (S2) where Γ = 2 π × 6 MHz is the natural linewidth, a nd σ 0 = 3 λ 2 / (2 π ) is the resonant a bsorption cross section. F o r bea ms with a waist of w 0 = 100 µ m, this corr esponds to a s a turation p o wer of P sat ≈ 5 00 nW p er neuron, and total SA input p o wer on the o rder of 5 00 µ W. T o s a turate the SA, the optical pulse needs to be longer than the excited state life time Γ − 1 = 26 ns. The energy cost of a single forward pass through the netw o rk is then on the order of a fr action o f a nano joule, and the backpropagation energy cost is neglig ible. Since a single interla yer transition inv olves a VMM with N 2 m ultiplications, one can estimate the energy cost p er ﬂo ating p oin t oper ation to be less than a fem to joule. Howev e r , these estimates do not include per ipheral energ y cos ts in p o wering and sustaining the instrumen ts and stabilising the s ystem, so the actual pow er consumption can b e exp ected to be signiﬁcantly hig her. 5 a 784 10 128 w (1) g ( · ) 128 w (2) g ( · ) w (3) Classification 28x28x1 0 1 2 3 4 5 6 7 8 9 b c 0 25 50 Ep och 0 . 94 0 . 96 0 . 98 Accuracy ReLU GS ( g 0 =3, approx.) 0.1 1 10 g 0 0 . 86 0 . 90 0 . 94 0 . 98 Accuracy Linear ReLU GS (approx.) − 5 0 5 E ( l ) P , in − 5 0 5 g ( · ) 0 4 g ′ ( · ) exac t appro x d FIG. S3: Opt ical backpropagation through GS nonlinearity . (a) F ully-conn ected netw ork architecture, which is the same as Fig. 4(a) except for the n onlinear ity . (b) T ransmission and transmission deriv atives of the GS unit with gain factor g 0 = 3. (c) Learning curves for t he GS-b as ed ONN and ben c hmark ReLU netw orks. ( ( d ) The ﬁnal classiﬁcation accuracy ac hieved as a function of the gain. 3. OPTICAL BACKPR OP AGA TION WITH GAIN SA T URA TION In optical ampliﬁers , gain satura tio n (GS) ta kes place when a suﬃciently hig h input p o w er depletes the excited state of the ga in medium is depleted. In a tw o-level sys tem, this pr ocess can be describ ed simila rly to sa tur able absorption by simply replacing the optical depth term α 0 in Eq. (4) of the main text with a po sitiv e g a in factor g 0 . The transmission, exact and optically-appr o ximated tr ansmission der iv atives are plotted in Fig. S3b with g 0 =3. The deriv ative curves hav e the inv erted shap es o f the SA deriv ative curves, and resemble the s igmoid function deriv ative. W e use this feature of the GS nonlinea rit y to implement o ptical backpropagation. T o examine this, we r e place the SA nonlinea rit y with GS nonlinea rit y in the fully-connected netw ork as s ho wn in Fig. 4 (a) of the main text, and rep eat the optica l tra ining sim ulation. The MNIST image classiﬁca tion p erformance is shown in Fig. S3(c, d). High accuracy can b e ac hieved with gain facto r as small a s 1, and the best result scores (97 . 3 ± 0 . 1) % at g 0 =3, slightly low er than that of the b enchm ark ReLU netw o rk and SA-based ONN. Since the der iv ative approximation er ror of the GS nonlinea rit y is the same as that of the SA nonlinea r it y , the p e r formance degradation is mainly attributed to the nonlinearity itself, how e ver, higher per formance ma y be achiev able thro ugh car eful hyperpar ameter tuning. 6 [1] C. Den z, Optical n eu ral n etw orks, Sp ringer Science and Business Media (2013). [2] N. H. F arhat, D. Psaltis, A. Prata, and E. Paek, Optical implementation of the H opﬁeld model, Appl. Opt. 24 , 1469 (1985). [3] T. S. F rancis, T. Lu, X. Y ang, and D. A. Gregory , Op tica l neural netw ork with p ock et-sized liquid-crystal televisions, Opt. Lett. 15 , 863 (1990). [4] Y. S. Abu-Mostafa and D. Psaltis, Optical neu ral compu t ers , S ci. Am . 256 , 88 ( 1987). [5] K. W agner, and D. Psaltis, Multila yer optical learning netw orks, Appl. Opt . 26 , 5061 (1987). [6] D. Psaltis, D. Brady , and K. W agner, Adaptive optical netw orks using photorefractiv e cry sta ls, App l. Opt. 27 , 1752 (1988). [7] S. R. Skinner, E. C. Behrman, A. A. Cruz-Cabrera , and J. E. Stec k, Neural netw ork imp le mentation using self -lensing media, App l. O pt. 34 , 4129 ( 19 95). [8] A. A. Cruz-Cabrera, M. Y ang, G. Cui, E. C. Behrman, J. E. St eck, and S. R. Skinn er, Reinforcement and bac kpropagation training for an optical neural netw ork using self-lensing eﬀects, IEEE T rans. Neural Netw. 11 , 1450 (2000). [9] Y. Shen, N. C. Harris, S . Skirlo, M. Prabhu, T. Baehr-Jones, M. Ho c hberg, X. Sun, S. Z h ao , H. Laro c helle, D. Englund, et al. , Deep learning with coheren t nanoph otonic circuits, N at. Photonics 11 , 441 (2017). [10] X. Lin, Y. Rivenson, N. T. Y ardimci, M. V eli, Y. Luo, M. Jarrahi, an d A. Ozcan, All-optical mac hine learning using diﬀractive deep neural netw orks, S cience 361 , 1004 (2018). [11] Y. Zuo, B. Li, Y . Zhao, Y . Jiang, Y.-C. Chen, P . Chen, G.-B. Jo, J. Liu, and S. Du, All-optical neural netw ork with nonlinear activ ation functions, Optica 6 , 1132 (2019). [12] T. W. H ughes, M. Minko v, Y . Shi, and S. F an, T raining of photonic neural netw orks through in situ backpropagati on and gradien t measuremen t, Op t ic a 5 , 864 (2018). [13] M. Reck, A. Zeilinger, H. J. Bernstein, and P . Bertani, Ex perimental realization of any discrete unitary op erato r, Phys. Rev. Lett. 73 , 58 (1994). [14] N. C. H arri s, G. R. Steinbrec her, M. Prabhu, Y. Lahini, J. Mow er, D. Bunandar, C. Chen, F. N. W ong, T. Baehr-Jones, M. Ho c hberg, and S. Lloyd, Quantum transport simulatio ns in a programmable n anophotonic p rocessor, Nat. Photonics 11 , 447 (2017). [15] N. C. Harris, Y . Ma, J. Mo we r, T. Baehr-Jones, D . Englund, M. Ho c hberg, and C. Galland, Eﬃcient, compact and low loss thermo-opt ic phase shifter in silicon, Op t. Express 22 , 10487 (2014). [16] G. T. Reed, G. Mashanovic h, F. Y. Gardes, and D. J. Thomson, Silicon optical mo dulators, Nat. Photonics 4 , 518 (2010). [17] P . N. T amura and J. C. Wyan t, Tw o- d imensio nal matrix multiplication using coheren t optical techniques, Opt. Eng. 18 , 182198 (1979). [18] V. Arriz´ on, U. Ruiz, R . Carrada, and L. A. Gonz´ al ez, Pixelated phase computer holograms for the accurate en coding of scalar complex ﬁelds, J. O pt. So c. A m. A 24 , 3500 (2007). [19] H. M. P . Chen, J. P . Y ang, H . T. Y en, Z. N . H su, Y. Huang, and S. T. W u, Pursuing h ig h q ualit y p hase-only liquid crystal on silicon (LCoS) devices, A ppl. Sci. 8 , 2323 (2018). [20] W. Y ang, D. B. Conkey , B. W u, D. Yin, A. R . Haw kins, and H. Schmidt, Atomic spectroscopy on a chip, N at. Photonics 1 , 331 (2007). [21] R. Ritter, N. Gruh ler , W. Per nice, H. K ¨ u bler, T. Pfau, and R . L ¨ ow, Atomic vapor sp ectrosco py in integrated photonic structures, Ap pl. Phys. Lett. 107 , 041101 (2015). [22] Q. Bao, H. Zh ang, Y. W ang, Z. Ni, Y. Y an, Z. X . Shen, K. P . Loh, and D. Y . T ang, Atomic-la yer graphene as a saturable absorber for u ltraf ast pu ls ed lasers, Adv. F unct . Mater. 19 , 3077 (2009). [23] Z. Chen g, H. K. Tsang, X. W ang, K. Xu, and J. B. Xu, In-plane optical absorption and free carrier absorption in graphene- on-silicon wa veguides, IEEE J. Sel. T op. Quantum Electron. 20 , 43 (2013). [24] M. J. Connelly , Semiconductor optical ampliﬁers (S p ringer S cience & Business Media, 2007). [25] D. A. Steck, Ru bidium 87 D Line Data, http://stec k .us/a lk alidata/rubidium87num b ers.pdf .

Backpropagation through nonlinear units for all-optical training of neural networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment