Reprogrammable Electro-Optic Nonlinear Activation Functions for Optical Neural Networks

Reprogrammable Electro-Optic Nonlinear Activ ation F unctions for Optical Neural Net w orks Ian A. D. Williamson, 1 , ∗ T yler W. Hughes, 2 Momc hil Minko v, 1 Ben Bartlett, 2 Sunil P ai, 1 and Shanh ui F an 1 , † 1 Dep artment of Ele ctric al Engine ering and Ginzton L ab or atory, Stanfor d University, Stanfor d, CA 94305, USA 2 Dep artment of Applie d Physics and Ginzton Lab or atory, Stanfor d University, Stanfor d, CA 94305, USA W e in tro duce an electro-optic hard ware platform for nonlinear activ ation functions in optical neu- ral netw orks. The optical-to-optical nonlinearity operates by con verting a small portion of the input optical signal into an analog electric signal, which is used to in tensity-modulate the original opti- cal signal with no reduction in pro cessing sp eed. Our scheme allows for complete nonlinear on-oﬀ con trast in transmission at relatively low optical p o wer thresholds and eliminates the requirement of having additional optical sources b et w een each lay er of the netw ork. Moreo ver, the activ ation function is reconﬁgurable via electrical bias, allo wing it to b e programmed or trained to synthesize a v ariet y of nonlinear resp onses. Using n umerical simulations, w e demonstrate that this activ ation function signiﬁcantly improv es the expressiv eness of optical neural netw orks, allo wing them to p er- form well on tw o b enc hmark machine learning tasks: learning a m ulti-input exclusive-OR (XOR) logic function and classiﬁcation of images of handwritten n umbers from the MNIST dataset. The addition of the nonlinear activ ation function improv es test accuracy on the MNIST task from 85% to 94%. I. INTR ODUCTION In recen t years, there has b een signiﬁcant interest in alternativ e computing platforms sp ecialized for high p er- formance and eﬃciency on machine learning tasks. F or example, graphical pro cessing units (GPUs) ha ve demon- strated p eak p erformance with trillions of ﬂoating p oin t op erations p er second (TFLOPS) when performing ma- trix m ultiplication, which is several orders of magnitude larger than general-purpose digital pro cessors suc h as CPUs [1]. Moreo ver, analog computing has been ex- plored for achieving high performance b ecause it is not limited by the b ottlenec ks of sequential instruction exe- cution and memory access [2–6]. Optical hardw are platforms are particularly app ealing for computing and signal pro cessing due to their ultra- large signal bandwidths, lo w latencies, and reconﬁgura- bilit y [7–9]. They ha ve also gathered signiﬁcant interest in mac hine learning applications, suc h as artiﬁcial neu- ral netw orks (ANNs). Nearly three decades ago, the ﬁrst optical neural net works (ONNs) were prop osed based on free-space optical lens and holography setups [10, 11]. More recen tly , ONNs hav e b een implemen ted in chip- in tegrated photonic platforms [12] using programmable w av eguide interferometer meshes whic h p erform matrix- v ector m ultiplications [13]. In theory , the performance of suc h systems is comp etitiv e with digital computing platforms b ecause they ma y perform matrix-vector m ul- tiplications in constant time with respect to the matrix dimension. In con trast, matrix-vector multiplication has a quadratic time complexity on a digital processor. Other approac hes to p erforming matrix-v ector multiplications in c hip-in tegrated ONNs, such as microring weigh t banks and photo diodes, hav e also b een prop osed [14]. ∗ iwill@stanford.edu † shanhui@stanford.edu Nonlinear activ ation functions pla y a k ey role in ANNs b y enabling them to learn complex mappings b et w een their inputs and outputs. Whereas digital pro cessors ha ve the expressiveness to trivially apply nonlinearities suc h as the widely-used sigmoid , ReLU , and tanh func- tions, the realization of nonlinearities in optical hardw are platforms is more c hallenging. One reason for this is that optical nonlinearities are relatively weak, necessitating a com bination of large in teraction lengths and high sig- nal p o w ers, which imp ose low er b ounds on the ph ysical fo otprin t and the energy consumption, resp ectiv ely . Al- though it is possible to resonantly enhance optical non- linearities, this comes with an una voidable trade-oﬀ in reducing the op erating bandwidth, thereby limiting the information processing capacity of an ONN. Additionally , main taining uniform resonant responses across many ele- men ts of an optical circuit necessitates additional con trol circuitry for calibrating eac h element [15]. A more fundamen tal limitation of optical nonlinearities is that their responses tend to b e ﬁxed during device fab- rication. This limited tunability of the nonlinear optical resp onse preven ts an ONN from b eing reprogrammed to realize diﬀerent forms of nonlinear activ ation functions, whic h may b e imp ortan t for tailoring ONNs for diﬀeren t mac hine learning tasks. Similarly , a ﬁxed nonlinear re- sp onse may also limit the p erformance of v ery deep ONNs with many la yers of activ ation functions since the optical signal p o w er drops b elo w the activ ation threshold, where nonlinearit y is strongest, in later lay ers due to loss in previous lay ers. F or example, with optical saturable ab- sorption from 2D materials in wa veguides, the activ ation threshold is on the order of 1-10 mW [16–18], meaning that the strength of the nonlinearity in each subsequent la yer will be successively w eak er as the transmitted p o w er falls b elo w the threshold. In light of these challenges, the ONN demonstrated in Ref. 12 implemented its activ ation functions by de- tecting each optical signal, feeding them through a con- 2 v entional digital computer to apply the nonlinearity , and then mo dulating new optical signals for the subsequent la yer. Although this approach beneﬁts from the ﬂexi- bilit y of digital signal pro cessing, conv entional pro ces- sors ha ve a limited num ber of input and output chan- nels, which make it challenging to scale this approach to v ery large matrix dimensions, which corresp onds to a large num b er of optical inputs. Moreo ver, digitally applied nonlinearities add latency from the analog-to- digital con version pro cess and constrain the computa- tional sp eed of the neural netw ork to the same GHz- scale clo c k rates which ONNs seek to o vercome. Thus, a hardw are nonlinear optical activ ation, which do esn’t re- quire rep eated bidirectional optical-electronic signal con- v ersion, is of fundamental interest for making integrated ONNs a viable mac hine learning platform. In this article, w e propose an electro-optic arc hitecture for synthesizing optical-to-optical nonlinearities whic h al- leviates the issues discussed ab o ve. Our architecture fea- tures complete on - oﬀ contrast in signal transmission, a v ariet y of nonlinear resp onse curv es, and a lo w ac- tiv ation threshold. Rather than using traditional opti- cal nonlinearities, our sc heme op erates by measuring a small p ortion of the incoming optical signal p o w er and using electro-optic mo dulators to mo dulate the original optical signal, without an y reduction in op erating band- width or computational sp eed. Additionally , our scheme allo ws for the possibility of performing additional non- linear transformations on the signal using analog elec- trical comp onen ts. Related electro-optical architectures for generating optical nonlinearities hav e b een previously considered [19–21]. In this work, we fo cus on the applica- tion of our arc hitecture as an element-wise activ ation in a feedforw ard ONN, but the synthesis of low-threshold op- tical nonlinearities could b e of broader interest to optical computing and information processing. The remainder of this pap er is organized as follo ws. First, w e review the basic op erating principles of ANNs and their in tegrated optical implementations in wa veg- uide interferometer meshes. W e then in tro duce our electro-optical activ ation function arc hitecture, sho wing that it c an b e reprogrammed to synthesize a v ariety of nonlinear resp onses. Next, w e discuss the performance of an ONN using this arc hitecture by analyzing the scal- ing of pow er consumption, latency , pro cessing sp eed, and fo otprin t. W e then draw an analogy betw een our pro- p osed activ ation function and the optical Kerr eﬀect. Fi- nally , using numerical simulations, w e demonstrate that our arc hitecture leads to impro v ed p erformance on tw o diﬀeren t machine learning tasks: (1) learning an N-input exclusiv e OR (XOR) logic function; (2) classifying images of handwritten num bers from the MNIST dataset. I I. FEEDF OR W ARD OPTICAL NEURAL NETW ORKS In this section, we brieﬂy review the basics of feedfor- w ard artiﬁcial neural netw orks (ANNs) and describe their implemen tation in a reconﬁgurable optical circuit, as pro- p osed in Ref. 12. As outlined in Fig. 1(a), an ANN is a function which accepts an input v ector, x 0 and returns an output vector, x L . This is accomplished in a lay er- b y-lay er fashion, with each lay er consisting of a linear matrix-v ector multiplication follow ed b y the application of an elemen t-wise nonlinear function, or activation , on the result. F or a lay er with index i , con taining a weigh t matrix ˆ W i and activ ation function f i ( · ), its op eration is describ ed mathematically as x i = f i  ˆ W i · x i − 1  (1) for i from 1 to L . Before they are able to p erform a given mac hine learn- ing task, ANNs must b e trained. The training pro cess is t ypically accomplished b y minimizing the prediction er- ror of the ANN on a set of training examples, whic h come in the form of input and target output pairs. F or a given ANN, a loss function is deﬁned to quantify the diﬀerence b et ween the target output and output predicted by the net work. During training, this loss function is minimized with resp ect to tunable degrees of freedom, namely the elemen ts of the w eigh t matrix ˆ W i within each lay er. In general, although less common, it is also possible to train the parameters of the activ ation functions [22]. Optical hardware implemen tations of ANNs hav e b een prop osed in v arious forms ov er the past few decades. In this w ork, w e fo cus on a recen t demonstration in whic h the linear op erations are implemented using an in tegrated optical circuit [12]. In this scheme, the in- formation b eing processed b y the netw ork, x i , is en- co ded in to the mo dal amplitudes of the wa veguides feed- ing the device and the matrix-vector m ultiplications are accomplished using meshes of integrated optical inter- ferometers. In this case, training the net work requires ﬁnding the optimal settings for the in tegrated optical phase shifters controlling the inteferometers, which ma y b e found using an analytical mo del of the chip, or using in-situ bac kpropagation tec hniques [23]. In the next section, we present an approach for re- alizing the activ ation function, f i ( · ), on-chip with a hy- brid electro-optic circuit feeding an inteferometer. In Fig. 1(b), w e show ho w this activ ation scheme ﬁts into a single la yer of an ONN and sho w the sp eciﬁc form of the activ a- tion in Fig. 1(c). W e also giv e the sp eciﬁc mathematical form of this activ ation and analyze its p erformance in practical op eration. 3 Figure 1. (a) Blo c k diagram of a feedforward neural netw ork of L la yers. Each la y er consists of a ˆ W i blo c k representing a linear matrix which multiplies vector inputs x i − 1 . The f i blo c k in each la yer represents an element-wise nonlinear activ ation function op erating on vectors z i to pro duce outputs x i . (b) Sc hematic of the optical interferometer mesh implementation of a single la yer of the feedforward neural netw ork. (c) Schematic of the prop osed optical-to-optical activ ation function which achiev es a nonlinear response by conv erting a small portion of the optical input, z into an electrical signal, and then intensit y mo dulating the remaining p ortion of the original optical signal as it passes through an interferometer. I II. NONLINEAR ACTIV A TION FUNCTION AR CHITECTURE In this section, we describ e our prop osed nonlinear ac- tiv ation function arc hitecture for optical neural net w orks, whic h implements an optical-to-optical nonlinearity b y con verting a small p ortion of the optical input pow er in to an electrical voltage. The remaining p ortion of the origi- nal optical signal is phase- and amplitude-mo dulated by this v oltage as it passes through an interferometer. F or an input signal with amplitude z , the resulting nonlin- ear optical activ ation function, f ( z ), is a result of the resp onses of the interferometer under mo dulation as well as the comp onen ts in the electrical signal path wa y . A schematic of the architecture is sho wn in Fig. 1(c), where blac k and blue lines represent optical wa v eguides and electrical signal pathw a ys, resp ectiv ely . The input signal ﬁrst enters a directional coupler which routes a p ortion, α , of the input optical p o w er to a photodetec- tor. The photo detector is the ﬁrst element of an optical- to-electrical conv ersion circuit, which is a standard com- p onen t of high-sp eed optical receiv ers for conv erting an optical intensit y into a voltage. In this w ork, we assume a normalization of the optical signal such that the total p o wer in the input signal is giv en by | z | 2 . The optical-to- electrical con version process consists of the photo detector pro ducing an electrical curren t, I pd = R · α | z | 2 , where R is the photo detector resp onsivit y , and a transimp edance amplifying stage, c haracterized by a gain G , conv erting this current into a v oltage V G = G · R · α | z | 2 . The out- put voltage of the optical-to-electrical conv ersion circuit then passes through a nonlinear signal conditioner with a transfer function, H ( · ). This comp onen t allows for the application of additional nonlinear functions to transform the voltage signal. Finally , the conditioned v oltage sig- nal, H ( V G ) is combined with a static bias v oltage, V b to induce a phase shift of ∆φ = π V π  V b + H  G R α | z | 2  (2) for the optical signal routed through the lo w er p ort of the directional coupler. The parameter V π represen ts the v oltage required to induce a phase shift of π in the phase mo dulator. This phase shift, deﬁned b y Eq. 2, is a non- 4 linear self-phase mo dulation b ecause it dep ends on the input signal intensit y . An optical delay line b et w een the directional coupler and the Mach-Zehnder interferometer (MZI) is used to matc h the signal propagation dela ys in the optical and electrical pathw a ys. This ensures that the nonlinear self- phase modulation deﬁned by Eq. 2 is applied at the same time that the optical signal which generated it passes through the phase mo dulator. F or the circuit shown in Fig. 1(c), the optical delay is τ opt = τ oe + τ nl + τ rc , ac- coun ting for the contributions from the group dela y of the optical-to-electrical c on version stage ( τ oe ), the dela y as- so ciated with the nonlinear signal conditioner ( τ nl ), and the RC time constan t of the phase mo dulator ( τ rc ). The nonlinear self-phase mo dulation ac hieved by the electric circuit is conv erted into a nonlinear amplitude resp onse b y the MZI, whic h has a transmission dep ending on ∆φ as t MZI = j exp  − j ∆φ 2  cos  ∆φ 2  . (3) Dep ending on the conﬁguration of the bias, V b , a larger input optical signal amplitude causes either more or less p o wer to be diverted a w a y from the output p ort, resulting in a nonlinear self-intensit y mo dulation. Combining the expression for the nonlinear self-phase mo dulation, given b y Eq. 2, with the MZI transmission, given by Eq. 3, the mathematical form of the activ ation function can b e written explicitly as f ( z ) = j √ 1 − α exp − j 1 2 " φ b + π H  G R α | z | 2  V π #! · cos 1 2 " φ b + π H  G R α | z | 2  V π #! z , (4) where the contribution to the phase shift from the bias v oltage is φ b = π V b V π . (5) F or the remainder of this work, we fo cus on the case where no nonlinear signal conditioning is applied to the electrical signal pathw ay , i.e. H ( V G ) = V G . How ev er, ev en with this simpliﬁcation the activ ation function still exhibits a highly nonlinear resp onse. W e also neglect saturating eﬀects in the OE conv ersion stage which can o ccur in either the photo detector or the ampliﬁer. Ho w- ev er, in practice, the nonlinear optical-to-optical transfer function could take adv antage of these saturating eﬀects. With the ab o v e simpliﬁcations, a more compact ex- pression for the activ ation function resp onse is f ( z ) = j √ 1 − α exp − j " g φ | z | 2 2 + φ b 2 #! · cos g φ | z | 2 2 + φ b 2 ! z , (6) 0 1 (a) Output f ( z ) · p g φ /π 0 1 φ b = 1.00 π (b) T ransmission 0 1 (c) 0 1 φ b = 0.85 π (d) 0 1 (e) 0 1 φ b = 0.00 π (f) 0 1 0 1 (g) 0 1 0 1 φ b = 0.50 π (h) Input z · p g φ /π Figure 2. Activ ation function output amplitude (blue lines) and activ ation function transmission (green lines) as a func- tion of input signal amplitude. The input and output are normalized to the phase gain parameter, g φ . P anel pairs (a),(b) and (c),(d) corresp ond to a ReLU -like resp onse, with a suppressed transmission for inputs with small amplitude and high transmission for inputs with large amplitude. P anel pairs (e),(f ) and (g),(h) correspond to a clipped response, with high transmission for inputs with small amplitude and re- duced transmission for inputs with larger amplitude. where the phase gain parameter is deﬁned as g φ = π αG R V π . (7) Equation 7 indicates that the amount of phase shift p er unit input signal p o w er can b e increased via the gain and photo diode resp onsivit y , or by conv erting a larger fraction of the optical p o w er to the electrical domain. Ho wev er, tapping out a larger fraction optical p o wer also results in a larger linear loss, which is undesirable. The electrical biasing of the activ ation phase shifter, represen ted by V b , is an imp ortan t degree of freedom for 5 determining its nonlinear resp onse. W e consider a repre- sen tative selection, consisting of four diﬀerent resp onses, in Fig. 2. The left column of Fig. 2 plots the output signal amplitude as a function of the input signal ampli- tude i.e. | f ( z ) | in Eq. 6, while the right column plots the transmission co eﬃcien t i.e. | f ( z ) | 2 / | z | 2 , a quan tit y whic h is more commonly used in optics than mac hine learning. The ﬁrst tw o rows of Fig. 2, corresp onding to φ b = 1 . 0 π and 0 . 85 π , exhibit a resp onse which is compa- rable to the ReLU activ ation function: transmission is low for small input v alues and high for large input v alues. F or the bias of φ b = 0 . 85 π , transmission at low input v alues is slightly increased with resp ect to the resp onse where φ b = 1 . 00 π . Unlik e the ideal ReLU resp onse, the acti- v ation at φ b = 0 . 85 π is not en tirely monotonic b ecause transmission ﬁrst go es to zero b efore increasing. On the other hand, the resp onses sho wn in the b ottom tw o rows of Fig. 2, corresp onding to φ b = 0 . 0 π and 0 . 50 π , are quite diﬀeren t. These conﬁgurations demonstrate a saturating resp onse in which the output is suppressed for higher in- put v alues but enhanced for low er input v alues. F or all of the resp onses sho wn in Fig. 2, we hav e assumed α = 0 . 1 whic h limits the maximum transmission to 1 − α = 0 . 9. A b eneﬁt of ha ving electrical con trol o ver the activ a- tion resp onse is that, in principle, its electrical bias can b e connected to the same control circuitry whic h pro- grams the linear interferometer meshes. In doing so, a single ONN hardw are unit can then b e reprogrammed to syn thesize many diﬀeren t activ ation function resp onses. This op ens up the p ossibilit y of heuristically selecting an activ ation function response, or directly optimizing the the activ ation bias using a training algorithm. This re- alization of a ﬂexible optical-to-optical nonlinearity can allo w ONNs to b e applied to muc h broader classes of mac hine learning tasks. W e note that Fig. 2 shows only the amplitude response of the activ ation function. In fact, all of these responses also introduce a nonlinear self-phase mo dulation to the output signal. If desired, this nonlinear self-phase mo du- lation can be suppressed using a push-pull in terferometer conﬁguration in which the generated phase shift, ∆φ , is divided and applied with opp osite sign to the top and b ottom arms. IV. PERF ORMANCE AND SCALABILITY In this section, we discuss the p erformance of an in- tegrated ONN whic h uses meshes of integrated optical in terferometers to perform matrix-v ector multiplications and the electro-optic activ ation function, as sho wn in Fig. 1(b),(c). Here, w e fo cus on characterizing ho w the p o wer consumption, computational latency , ph ysical fo otprin t, and computational sp eed of the ONN scale with resp ect to the num b er of net work lay ers, L and the dimension of the input vector, N , assuming square matrices. The sys- tem parameters used for this analysis are summarized in T able I and the ﬁgures of merit are summarized in T able T able I. Summary of parameter v alues P arameter V alue Mo dulator and detector rate 10 GHz Photo detector resp onsivity ( R ) 1 A/W Optical-to-electrical circuit p o wer consumption 100 mW Optical-to-electrical circuit group delay ( τ eo ) 100 ps Phase mo dulator RC delay ( τ rc ) 20 ps Mesh MZI length ( D MZI ) 100 µ m Mesh MZI height ( H MZI ) 60 µ m W a veguide eﬀective index ( n eﬀ ) 3.5 1 10 30 Modulator V π (v olts) 30 35 40 45 50 55 60 65 Optical-t o-electrical gain G (dB Ω ) P t h = 0.1 mW P t h = 1.0 mW P t h = 10.0 mW Figure 3. Contours of constan t activ ation threshold as a func- tion of the optical-to-electrical gain and the mo dulator V π of the activ ation function shown in Fig. 1(c) with a photo detec- tor resp onsivit y R = 1.0 A/W. I I. A. P ow er consumption The p o w er consumption of the ONN, as shown in Fig. 1(b), consists of con tributions from (1) the programmable phase shifters inside the interferometer mesh, (2) the optical source supplying the input vectors, x 0 , and (3) the active comp onen ts of the activ ation function such as the ampliﬁer and photodetector. In principle, the con tribution from (1) can b e made negligible b y using phase change materials or ultra-low p o wer MEMS phase shifters. Therefore, in this section we fo cus only on con- tributions (2) and (3) which p ertain to the activ ation function. T o quantify the p o w er consumption, we ﬁrst consider the minim um input optical p o wer to a single activ ation that triggers a nonlinear response. W e refer to this as the 6 T able I I. Summary of p er-la y er optical neural netw ork p erformance using the electro-optic activ ation function Scaling P er-lay er ﬁgures of merit Mesh Activ ation N = 4 N = 10 N = 100 P ow er consumption ∗ LN 0.4 W 1 W 10 W Latency LN L 125 ps 132 ps 237 ps F o otprin t LN 2 LN 2.5 mm 2 6.6 mm 2 120.0 mm 2 Sp eed LN 2 1 . 6 × 10 11 MA C/s 1 × 10 12 MA C/s 1 × 10 14 MA C/s Eﬃciency ∗ N − 1 2.5 pJ/MAC 1 pJ/MAC 100 fJ/MAC ∗ Assuming no p o wer consumption in the interferometer mesh phase shifters activ ation function threshold, whic h is mathematically deﬁned as P th = ∆φ | δ T =0 . 5 g φ = V π π αG R · ∆φ | δ T =0 . 5 , (8) where ∆φ | δ T =0 . 5 is the is phase shift necessary to gener- ate a 50% change in the p o wer transmission with resp ect to the transmission with null input for a given φ b . This threshold corresp onds to z p g φ /π = 0 . 73 in Fig. 2(b), to z p g φ /π = 0 . 85 in Fig. 2(d), to z p g φ /π = 0 . 73 in Fig. 2(f ), and to z p g φ /π = 0 . 70 in Fig. 2(h). In general, a low er activ ation threshold will result in a low er opti- cal p o w er required at the ONN input, | x 0 | 2 . According to Eq. 8, the activ ation threshold can b e reduced via a small V π and a large optical-to-electrical con v ersion gain, G R ∼ 1 . 0 V/mW. The relationship betw een G and V π for activ ation thresholds of 0.1 mW, 1.0 mW, and 10.0 mW is shown in Fig. 3 for a ﬁxed R = 1 A/W. Additionally , in Fig. 3 we conserv atively assume φ b = π whic h has the highest threshold of the activ ation function biases shown in Fig. 2. If w e take the low est activ ation threshold of 0.1 mW in Fig. 3, the optical source to the ONN would then need to supply N · 0 . 1 mW of optical pow er. The pow er con- sumption of integrated optical receiver ampliﬁers v aries considerably , ranging from as low as 10 mW to as high as 150 mW [24–26], dep ending on a v ariety of factors whic h are b ey ond the scop e of this article. Therefore, a conserv ative estimate of the pow er consumption from the optical-to-electrical conv ersion circuits in all activ ations is L · N · 100 mW. F or an ONN with N = 100, the p o wer consumption p er la y er from the activ ation function w ould b e 10 W and w ould require a total optical input p o wer of N · P th = 100 · 0 . 1 mW = 10 mW. Thus, the total p o wer consumption of the ONN is dominated by the activ ation function electronics. B. Latency F or the feedforw ard neural net work arc hitecture shown in Fig. 1(a), the latency is deﬁned by the elapsed time b et ween supplying an input vector, x 0 and detecting its corresp onding prediction vector, x L . In an integrated ONN, as implemented in Fig. 1(b), this delay is simply the tra vel time for an optical pulse through all L -la yers. F ollo wing Ref. 12, the propagation distance in a square in terferometer mesh is D W = N · D MZI , where D MZI is the length of each MZI within the mesh. In the nonlin- ear activ ation lay er, the propagation length will be domi- nated b y the delay line required to match the optical and electrical delays, and is given by D f = ( τ oe + τ nl + τ rc ) · v g , (9) where the group velocity v g = c 0 /n eﬀ is the sp eed of optical pulses in the wa v eguide. Therefore, latency = L · N · D MZI · v g − 1 | {z } Interferometer mesh + L · ( τ oe + τ nl + τ rc ) | {z } Activ ation function . (10) Equation 10 indicates that the latency contribution from the in terferometer mesh scales with the pro duct LN , whic h is the same scaling as predicted in Ref. 12. On the other hand, the activ ation function adds to the la- tency indep enden tly of N b ecause eac h activ ation circuit is applied in parallel to all N -vector elemen ts. F or concreteness, w e assume D MZI = 100 µ m and n eﬀ = 3 . 5. F ollowing our assumption in the previous section of using no nonlinear electrical signal conditioner in the activ ation function, τ nl = 0 ps. T ypical group dela ys for in tegrated transimpedance ampliﬁers used in optical receiv ers can range from τ oe ≈ 10 to 100 ps. More- o ver, assuming an RC-limited phase mo dulator speed of 50 GHz yields τ rc ≈ 20 ps. Therefore, if w e assume a conserv ative v alue of τ oe = 100 ps, a net work dimension of N ≈ 100 w ould hav e a latency of 237 ps per lay er, with equal contributions from the mesh and the activ a- tion function. F or a ten lay er netw ork ( L = 10) the total latency w ould b e approximately 2.4 ns, still orders of magnitude lo wer than the latency typically asso ciated with GPUs. C. Ph ysical fo otprin t The physical footprint of the ONN consists of the space tak en up by b oth the linear interferometer mesh and the optical and electrical comp onen ts of the activ ation func- tion. Neglecting the electrical control lines for tuning 7 eac h MZI, the total fo otprin t of the ONN is A = L · N 2 · A MZI | {z } Interferometer mesh + L · N · A f | {z } Activ ation function , (11) where A MZI = D MZI · H MZI is the area of a single MZI elemen t in the mesh and A f = D f · H f is the area of a single activ ation function. In the direction of propagation, D f is dominated b y the w av eguide optical delay line required to match the delay of the electrical signal path w ay . Based on the previous discussion of the activ ation function’s latency , τ opt = 120 ps corresp onds to a total wa v eguide length of D f ≈ 1 cm. F or simplicity , w e assume this delay is achiev ed using a straigh t wa v eguide, which results in a large fo otprin t but with optical losses that can b e very low. F or example, in silicon w av eguides losses b elo w 0.5 dB/cm hav e b een ex- p erimen tally demonstrated [27]. In principle, incorp orat- ing wa veguide b ends or resonant optical elemen ts could signiﬁcan tly reduce the activ ation function’s fo otprin t. F or example, coupled micro ring arrays hav e exp erimen- tally achiev ed group dela ys of 135 ps o ver a bandwidth of 10 GHz in a 0.03 mm × 0.25 mm fo otprin t [28]. T ransv erse to the direction of propagation, the acti- v ation function fo otprin t will b e dominated by the elec- tronic comp onen ts of the optical-to-electrical conv ersion circuit. In principle, compact wa v eguide photo detectors and modulators can b e utilized. How ev er, the comp o- nen ts of the transimp edance ampliﬁer may b e challeng- ing to integrate in the area av ailable betw een neigh- b oring output wa v eguides of the in terferometer mesh. One possibility tow ards ac hieving a fully in tegrated opto- electronic ONN would b e to use so-called ampliﬁer-fr e e optical receivers [26], where ultra-low capacitance detec- tors provide high-sp eed opto-electronic conv ersion. Sim- ilarly to the exp erimen tal demonstration in Ref. 29, the ampliﬁer-free receiver could b e integrated directly with a high eﬃciency (e.g. eﬀectively a low V π ) electro- optic mo dulator. Compact electro-absorption mo dula- tors could also b e utilized. In addition to achieving a compact fo otprin t, op erating without an ampliﬁer w ould also result in an order of magnitude reduction in b oth p o wer consumption and latency , with the later reducing the required length of the optical delay line and thus the fo otprin t. F or the purp oses of our analysis, w e assume no inte- gration of the electronic transimp edance ampliﬁer and, therefore, that the on-chip comp onen ts of the activ ation function ﬁt within the heigh t of eac h in terferometer mesh ro w, D f ≤ D MZI = 60 µ m. Under this assumption and follo wing the scaling in Eq. 11, the total fo otprin t of a single ONN la yer of dimension N = 10 w ould b e 11.0 mm × 0.6 mm. In terestingly , following the latency discussion in the previous section, a single ONN lay er of dimension N = 100 would ha ve a fo otprin t of 20.0 mm × 6.0 mm, with equal con tribution from the activ ation function and from the mesh. D. Sp eed The sp eed, or computational capacit y , of the ONN, as sho wn in Fig. 1(a), is determined b y the num b er of input v ectors, x 0 that can be process ed p er unit time. Here, we argue that although our activ ation function is not fully optical, it results in no sp eed degradation compared to a linear ONN consisting of only in terferometer meshes. The reason for this is that a fully in tegrated ONN w ould also include high-sp eed mo dulators and dete ctors on-c hip to p erform fast mo dulation and detection of se- quences of x 0 v ectors and x L v ectors, resp ectiv ely . W e therefore argue that the same high-speed detector and mo dulator elemen ts could also b e in tegrated b et w een the linear net work lay ers to pro vide the optical-electrical and electrical-optical transduction for the activ ation function. State of the art in te grated transimp edance ampliﬁers can already op erate at sp eeds comparable to the optical mod- ulator and detector rates, which are on the order of 50 - 100 GHz [24, 30], and thus w ould not b e a limiting factor in the sp eed of our architecture. T o perform a matrix-v ector m ultiplication on a conv en- tional CPU requires N 2 m ultiply-accumulate (MA C) op- erations, eac h consisting of a single multiplication and a single addition. Therefore, assuming a photo detector and mo dulator rate of 10 GHz means that an ONN can eﬀec- tiv ely p erform N 2 · L · 10 10 MA C/sec. This means that one la yer of an ONN with dimension N = 10 would eﬀectiv ely p erform 10 12 MA C/sec. Increasing the input dime n sion to N = 100 would then scale the p erformance of the ONN to 10 14 MA C/sec p er lay er. This is tw o orders of magni- tude greater than the p eak p erformance obtainable with mo dern GPUs, which typically hav e p erformance on the order of 10 12 ﬂoating p oin t operations/sec (FLOPS). Be- cause the p o w er consumption of the ONN scales as LN (assuming passive phase shifters in the mesh) and the sp eed scales as LN 2 , the energy per op eration is mini- mized for large N (T able II). Thus, for large ONNs the p o wer consumption associated with the electro-optic con- v ersion in the activ ation function can b e amortized ov er the parallelized op eration of the linear mesh. W e note that the activ ation function circuit shown in Fig. 1(c) can b e mo diﬁed to remov e the matched optical dela y line by using very long optical pulses. This mo di- ﬁcation may be adv an tageous for reducing the fo otprin t of the activ ation and would result in τ opt  τ ele . How- ev er, this results in a reduction of the ONN sp eed, which w ould then be limited by the combined activ ation dela y of all L nonlinear lay ers in the netw ork, ∼ ( L · τ ele ) − 1 . V. COMP ARISON WITH THE KERR EFFECT All-optical nonlinearities such as bistabilit y and sat- urable absorption ha ve b een previously considered as po- ten tial activ ation functions in ONNs [10, 31]. An al- ternativ e implementation of the activ ation function in Fig. 1(c) could consist of a nonlinear MZI, with one 8 0 10 20 30 40 50 Ampliﬁer gain G (dB Ω ) 10 0 10 2 10 4 10 6 (W · m) − 1 (a) α = 0.01 α = 0.10 α = 0.50 Si K err 20 40 60 80 100 Modulator V π L (V · mm) 10 1 10 2 10 3 10 4 10 5 (W · m) − 1 (b) G = 10 dB Ω G = 20 dB Ω G = 30 dB Ω Si K err Figure 4. Nonlinear parameter Γ EO for the electro-optic acti- v ation as a function of (a) gain, G , for α = 0.50, 0.10, and 0.01 and (b) mo dulator V π L . The nonlinear parameter asso ciated with the optical Kerr eﬀect, Γ Kerr in a Silicon wa v eguide of cross sectional area A = 0 . 05 µ m 2 corresp onds to the black dotted line. of its arms ha ving a material with Kerr nonlinear op- tical resp onse. The Kerr eﬀect is a third-order optical nonlinearit y whic h generates a change in the refractiv e index, and thus a nonlinear phase shift, which is pro- p ortional to the input pulse in tensity . In this section w e compare the electro-optic activ ation function in tro duced in the previous section [Fig. 1(c)] to such an alterna- tiv e all-optical activ ation function using the Kerr eﬀect, highligh ting how the electro-optic activ ation can ac hieve a low er activ ation threshold. Unlik e the electro-optic activ ation function, the Kerr eﬀect is lossless and has no latency b ecause it arises from a nonlinear material resp onse, rather than a feedforward circuit. A standard ﬁgure of merit for quantifying the strength of the Kerr eﬀect in a w av eguide is through the amoun t of nonlinear phase shift generated p er unit input p o wer p er unit wa veguide length. This is given mathe- matically by the expression Γ Kerr = 2 π λ 0 n 2 A , (12) where n 2 is the nonlinear refractive index of the mate- rial and A is the eﬀectiv e mo de area. Γ Kerr ranges from 100 (W · m) − 1 in chalcogenide to 350 (W · m) − 1 in silicon [32]. An equiv alen t ﬁgure of merit for the electro-optic feedforw ard sc heme can b e mathematically deﬁned as Γ EO = π α R G V π L , (13) where V π L is the phase mo dulator ﬁgure of merit. The ﬁgures of merit describ ed in Eqs. 12-13 can b e repre- sen ted as an activ ation threshold (Eq. 8) via the rela- tionship P th = ∆φ | δT =0 . 5 Γ L , for a giv en w a veguide length, L where the electro-optic phase shift or nonlinear Kerr eﬀect take place. A comparison of Eq. 12 and Eq. 13 indicates that while the strength of the Kerr eﬀect is largely ﬁxed b y w av eguide design and material c hoice, the electro-optic sc heme has sev eral degrees of freedom which allow it to p oten tially achiev e a stronger nonlinear resp onse. The ﬁrst design parameter is the amount of p o wer tapp ed oﬀ to the photo detector, which can b e increased to gener- ate a larger voltage at the phase mo dulator. How ev er, increasing α also increases the line ar signal loss through the activ ation whic h does not con tribute to the nonlin- ear mapping b et ween the input and output of the ONN. Therefore, α should b e minimized as long as the optical p o wer routed to the photo detector is large enough to b e ab o ve the noise equiv alent p o wer lev el. On the other hand, the pro duct R G determines the con version eﬃciency of the detected optical p o w er into an electrical voltage. Fig. 4(a) compares the nonlinear- it y strength of the electro-optic activ ation (blue lines) to that of an implementation using the Kerr eﬀect in silicon (blac k dashed line) for several v alues of α , as a func- tion of G . The resp onsivit y is ﬁxed at R = 1 . 0 A/W. W e observ e that tapping out 10% of the optical pow er requires a gain of 20 dB Ω to achiev e a nonlinear phase shift equiv alent threshold to that of a silicon wa v eguide where A = 0.05 µ m 2 for the same amoun t of input optical p o wer. T apping out only 1% of the optical pow er requires an additional 10 dB Ω of gain to maintain this equiv a- lence. W e note that the gain range considered in Fig. 4(a) is well within the regime of what has b een demon- strated in integrated transimpedance ampliﬁers for op- tical receivers [24–26]. In fact, many of these systems ha ve demonstrated muc h higher gain. In Fig. 4(a), the phase mo dulator V π L was ﬁxed at 20 V · mm. How ev er, b ecause a lo w er V π L translates into an increased phase shift for a given applied voltage, this parameter can also b e used to enhance the nonlinearity . Fig. 4(b) demon- strates the eﬀect of changing the V π L for several v alues of of G , again, with a ﬁxed resp onsivit y R = 1 . 0 A/W. This demonstrates that with a reasonable lev el of gain and phase mo dulator p erformance, the electro-optic ac- tiv ation function can trade oﬀ an increase in latency for a signiﬁcan tly lo w er optical activ ation threshold than the Kerr eﬀect. 9 VI. MA CHINE LEARNING T ASKS In this section, we apply the electro-optic activ ation function introduced ab o ve to several machine learning tasks. In Sec. VI A, w e sim ulate training an ONN to implement an exclusive-OR (X OR) logical op eration. The netw ork is mo deled using neuroptica [33], a cus- tom ONN simulator written in Python, whic h trains the sim ulated net works only from ph ysically measurable ﬁeld quan tities using the on-chip backpropagation algorithm in tro duced in Ref. 23. In Sec. VI B, we consider the more complex task of using an ONN to classify hand- written digits from the Mo diﬁed NIST (MNIST) dataset, whic h we mo del using the neurophox [34, 35] pack age and tensorflow [36], which computes gradients using automatic diﬀerentiation. In b oth cases, we mo del the v alues in the net work as complex-v alued quantities and represen t the in terferometer meshes as unitary matrices parameterized by phase shifters. A. Exclusiv e-OR Logic F unction An exclusiv e-OR (XOR) is a logic function which takes t wo inputs and pro duces a single output. The output is high if only one of the tw o inputs is high , and low for all other p ossible input com binations. In this example, we consider a multi-input X OR whic h tak es N input v al- ues, given by x 1 . . . x N , and pro duces a single output v alue, y . The input-output relationship of the multi- input XOR function is a generalization of the tw o-input X OR. F or example, deﬁning logical high and low v alues as 1 and 0, resp ectiv ely , a four-input XOR has an out- put table indicated the desired v alues in Fig. 5(b). W e select this task for the ONN to learn b ecause it requires a non-trivial level of nonlinearity , meaning that it could not b e implemented in an ONN consisting of only linear in terferometer meshes. The architecture of the ONN used to learn the X OR is sho wn sc hematically in Fig. 5(a). The netw ork consists of L lay ers, with each la yer constructed from an N × N unitary interferometer mesh follo wed by an array of N parallel electro-optic activ ation functions, with each ele- men t corresp onding to the circuit in Fig. 1(c). After the ﬁnal la yer, the low er N − 1 outputs are dropp ed to pro- duce a single output v alue which corresp onds to y . Un- lik e the ideal X OR input-output relationship described ab o ve, for the XOR task learned by the ONN we nor- malize the input v ectors such that they alwa ys ha ve an L 2 norm of 1. This constraint is equiv alent to enforc- ing a constant input p o wer to the netw ork. Additionally , b ecause the activ ation function causes the optical p o w er lev el to b e atten uated at eac h la y er, we take the high out- put state to b e a v alue of 0.2, as shown in Fig. 1(b). The low output remains at a v alue of 0.0. An alternative to using a smaller amplitude for the output high state would b e to add additional p orts with ﬁxed p o wer biases to in- crease the total input p o wer to the netw ork, similarly to Figure 5. (a) Architecture of an L -lay er ONN used to imple- men t an N -input XOR logic function. (b) Red dots indicate the learned input-output relationship of the XOR for N = 4 on an 2-la y er ONN. Electro-optic activ ation functions are con- ﬁgured with gain g = 1 . 75 π and biasing phase φ b = π . (c) Mean squared error (MSE) versus training ep och. (d) Final MSE after 5000 epo c hs a v eraged o v er 20 independent training runs vs activ ation function gain. Diﬀeren t lines corresp ond to the resp onses shown in Fig. 2, with φ b = 1 . 00 π , 0 . 85 π , 0 . 00 π , and 0 . 50 π . Shaded regions correspond to the range (minim um and maximum) ﬁnal MSE from the 20 training runs. the XOR demonstrated in Ref. 23. In Fig. 5(b) w e sho w the four-input XOR input-output relationship which was learned by a tw o-lay er ONN. The electro-optic activ ation functions w ere conﬁgured to ha v e a gain of g = 1 . 75 π and biasing phase of φ b = π . This biasing phase conﬁguration corresp onds to the ReLU -lik e resp onse shown in Fig. 2(a). The black mark ers indi- cate the desired output v alues while the red circles indi- cate the output learned b y the t wo-la y er ONN. Fig. 5(b) indicates excellen t agreemen t betw een the learned out- put and the desired output. The evolution of the mean squared error (MSE) b et w een the ONN output and the desired output during training conﬁrms this agreement, 10 as shown in Fig. 5(c), with a ﬁnal MSE b elo w 10 − 5 . T o train the ONN, a total of 2 N = 16 training ex- amples w ere used, corresp onding to all p ossible binary input com binations along the x-axis of Fig. 5(b). All 16 training examples w ere fed through the netw ork in a batch to calculate the mean squared error (MSE) loss function. The gradient of the loss function with resp ect to each phase shifter was computed b y backpropagating the error signal through the netw ork to calculate the loss sensitivit y at each phase shifter [23]. The ab o v e steps w ere rep eated until the MSE con verged, as shown in Fig. 5(c). Only the phase shifter parameters were optimized b y the training algorithm, while all parameters of the activ ation function w ere unchanged. T o demonstrate that the nonlinearity provided by the electro-optic activ ation function is essen tial for the ONN to successfully learn the XOR, in Fig. 5(d) w e plot the ﬁnal MSE after 5000 training ep ochs, a veraged ov er 20 indep enden t training runs, as a function of the activ ation function gain, g φ . The shaded regions indicates the min- im um and maximum range of the ﬁnal MSE o ver the 20 training runs. The four lines sho wn in Fig. 5(d) corre- sp ond to the four activ ation function bias conﬁgurations sho wn in Fig. 2. F or the blue curve in Fig. 5(d), which corresp onds to the ReLU-like activ ation, we observe a clear impro ve- men t in the ﬁnal MSE with an increase in the nonlinearit y strength. W e also observ e that for v ery high nonlinearit y , ab o ve g φ = 1 . 5 π , the range betw een the minim um and maxim um ﬁnal MSE broadens and the mean ﬁnal MSE increases. How ever, the b est case (minimum) ﬁnal MSE con tinues to decrease, as indicated by the lo wer b order of the shaded blue region. This trend indicates that al- though increasing nonlinearity improv es the ONN’s abil- it y to learn the XOR function, very high levels of non- linearit y may also preven t the training algorithm from con verging. A trend of decreasing MSE with increasing nonlinear- it y is also observed for the activ ation corresp onding to the green curve in Fig. 5(d). How ev er, the range of MSE v alues b egins to broaden at a low er v alue of g φ = 1 . 0 π . Suc h broadening may b e a result of the changing slop e in the activ ation function output, as shown in Fig. 2(e). F or the activ ation functions corresponding to the red and orange curves in Fig. 5(d), the ﬁnal MSE decreases some- what with an increase in g φ , but generally remains muc h higher than the other tw o activ ation function resp onses. W e conclude that these tw o resp onses are not as well suited for learning the XOR function. Overall, these re- sults demonstrate that the ﬂexibility of our architecture to ac hiev e speciﬁc forms of nonlinear activ ation functions is imp ortan t for the successful op eration of an ONN. B. Handwritten Digit Classiﬁcation The second task w e consider for demonstrating the ac- tiv ation function is classifying images of handwritten dig- its from the MNIST dataset, which has b ecome a stan- dard b enc hmark problem for ANNs [37]. The dataset consists of 70,000 gra yscale 28 × 28 pixel images of hand- written digits b et ween 0 and 9. Sev eral represen tative images from the dataset are sho wn in Fig. 6(a). T o reduce the n umber of input parameters, and hence the size of the neural net work, w e use a prepro cessing step to conv ert the images in to a F ourier-space represen- tation. Sp eciﬁcally , w e compute the 2D F ourier trans- form of the images which is deﬁned mathematically as c ( k x , k y ) = P m,n e j k x m + j k y n g ( m, n ), where g ( m, n ) is the gray scale v alue of the pixel at lo cation ( m, n ) within the image. The amplitudes of the F ourier co eﬃcien ts c ( k x , k y ) are sho wn b elo w their corresp onding images in Fig. 6(a). These coeﬃcients are generally complex- v alued, but b ecause the real-space map g ( m, n ) is real- v alued, the condition c ( k x , k y ) = c ∗ ( − k x , − k y ) applies. W e observe that the F ourier-space proﬁles are mostly concen trated around small k x and k y , corresponding to the center region of the proﬁles in Fig. 6(a). This is due to the slo wly v arying spatial features in the images. W e can therefore exp ect that most of the information is carried by the small- k F ourier comp onents, and with the goal of decreasing the input size, we can restrict the data to N co eﬃcien ts with the smallest k = q k 2 x + k 2 y . An additional adv antage of this prepro cessing step is that it reduces the computational resources required to p erform the training pro cess b ecause the neural netw ork dimen- sion do es not need to accommo date all 28 2 = 784 pixel v alues as inputs. F ourier prepro cessing is particularly relev ant for ONNs for tw o reasons. First, the F ourier transform has a straigh tforward implementation in the optical domain using tec hniques from F ourier optics inv olving standard comp onen ts such as lens and spatial ﬁlters [38]. Second, this approach allows us to take adv an tage of the fact that ONNs are c omplex -v alued functions. That is to sa y , the N complex-v alued coeﬃcients c ( k x , k y ) can b e han- dled by an N -dimensional ONN, whereas to handle the same input using a real-v alued neural netw ork requires a t wice larger dimension. The ONN architecture used in our demonstration is shown schematically in Fig. 6(a). The N F ourier co eﬃcien ts closest to k x = k y = 0 are fed into an optical neural net work consisting of L la yers, after which a drop-mask reduces the ﬁnal output to 10 comp onen ts. The in tensit y of the 10 outputs are recorded and normalized by their sum, whic h creates a probabil- it y distribution that may b e compared with the one-hot enco ding of the digits from 0 to 9. The loss function is deﬁned as the cross-en trop y b et ween the normalized output intensities and the correct one-hot vector. During each training ep och, a subset of 60,000 images from the dataset w ere fed through the netw ork in batc hes of 500. The remaining 10,000 image-label pairs w ere used to form a test dataset. F or a t wo-la y er netw ork with N = 16 F ourier comp onen ts, Fig. 6(b) compares the clas- siﬁcation accuracy ov er the training dataset (solid lines) 11 Figure 6. (a) Schematic of an optical image recognition setup based on an ONN. Images of handwritten n umbers from the MNIST database are prepro cessed b y conv erting from real-space to k -space and selecting N F ourier co eﬃcien ts asso ciated with the smallest magnitude k -v ectors. (b) T est accuracy (solid lines) and training accuracy (dashed lines) during training for a t wo lay er ONN without activ ation functions (blue) and with activ ation functions (orange). N = 16 F ourier comp onen ts were used as inputs to the ONN and each vector was normalized such that its L 2 norm is unit y . The activ ation function parameters w ere g φ = 0 . 05 π and φ b = 1 . 00 π . (c) Cross entrop y loss during training. (d) Confusion matrix, sp eciﬁed in p ercen tage, for the trained ONN with the electro-optic activ ation function. and testing dataset (dashed lines) while Fig. 6(b) com- pares the cross entrop y loss during optimization. The blue curves corresp ond to an ONN with no activ ation function (e.g. a linear optical classiﬁer) and the orange curv es corresp ond to an ONN with the electro-optic acti- v ation function conﬁgured with g φ = 0 . 05 π , φ b = 1 . 00 π , and α = 0 . 1. The gain setting in particular was selected heuristically . W e observe that the nonlinear activ ation function results in a signiﬁcan t impro v ement to the ONN p erformance during and after training. The ﬁnal v alida- tion accuracy for the ONN with the activ ation function is 93%, which amounts to an 8% diﬀerence as compared to the linear ONN which achiev ed an accuracy of 85%. The confusion matrix computed ov er the testing dataset is sho wn in Fig. 6(d). W e note that the pre- diction accuracy of 93% is high considering that only N = 16 complex F ourier components were used, and the net work is parameterized by only 2 × N 2 × L = 1024 free parameters. Moreov er, this prediction accuracy is comparable with the 92 . 6% accuracy achiev ed in a fully- connected linear classiﬁer with 4010 free parameters tak- ing al l of the 28 2 = 784 real-space pixel v alues as inputs [37]. Finally , in T able I II w e show that the accuracy can b e further improv ed by including a third lay er in the ONN and b y making the activ ation function gain a trainable parameter. This brings the testing accuracy to 94%. Based on the parameters from T able I and the scaling from T able I I, the 3 lay er handwritten digit clas- siﬁcation system would consume 4.8 W while p erforming 7 . 7 × 10 12 MA C/sec. Its prediction latency would b e 1.5 ns. 12 T able I II. Accuracy on the MNIST testing dataset after optimization # Lay ers Without activ ation With activ ation Un trained T rained ∗ 1 85.00% 89.80% 89.38% 2 85.83% 92.98% 92.60% 3 85.16% 92.62% 93.89% ∗ The phase gain, g φ , of each lay er was optimized during training VI I. CONCLUSION In conclusion, we ha ve introduced an architecture for syn thesizing optical-to-optical nonlinearities and demon- strated its use as a nonlinear activ ation function in a feed forw ard ONN. Using n umerical sim ulations, w e ha ve sho wn that such activ ation functions enable an ONN to b e successfully applied to t w o machine learning b enc h- mark problems: (1) learning a multi-input XOR logic function, and (2) classifying handwritten num b ers from the MNIST dataset. Rather than using all-optical non- linearities, our activ ation architecture uses intermediate signal path wa ys in the electrical domain whic h are ac- cessed via photo detectors and phase mo dulators. Sp ecif- ically , a small p ortion of the optical input p o w er is tapp ed out which undergo es analog pro cessing b efore mo dulat- ing the remaining p ortion of the same optical signal. Whereas all-optical nonlinearities hav e largely ﬁxed re- sp onses, a b eneﬁt of the electro-optic approac h demon- strated here is that signal ampliﬁcation in the electronic domain can o vercome the need for high optical signal p o wers to ac hieve a signiﬁcantly low er activ ation thresh- old. F or example, w e show that a phase modulator V π of 10 V and an optical-to-electrical conv ersion gain of 57 dB Ω , b oth of which are exp erimen tally feasible, result in an optical activ ation threshold of 0.1 mW. W e note that this nonlinearit y is compatible with the in situ training proto col prop osed in Ref. 23, which is applicable to ar- bitrary activ ation functions. Our activ ation function architecture can utilize the same integrated photo detector and mo dulator technolo- gies as the input and output lay ers of a fully-integrated ONN. This means that an ONN using this activ ation suﬀers no reduction in pro cessing sp eed, despite using analog electrical comp onen ts. The only trade oﬀ made b y our design is an increase in latency due to the electro- optic conv ersion pro cess. How ev er, w e ﬁnd that an ONN with dimension N = 100 has a total prediction latency of 2.4 ns/lay er, with approximately equal con tributions from the propagation of optical pulses through the in- terferometer mesh and from the electro-optic activ ation function. Conserv atively , we estimate the energy con- sumption of an ONN with this activ ation function to b e 100 fJ/MAC, but this ﬁgure of merit could p oten tially b e reduced by orders of magnitude using highly eﬃcient mo dulators and ampliﬁer-free opto electronics [29]. Finally , w e emphasize that in our activ ation function, the ma jorit y of the signal pow er remains in the optical domain. There is no need to ha ve a new optical source at each nonlinear lay er of the netw ork, as is required in previously demonstrated electro-optic neuromorphic hardw are [14, 21, 39] and reservoir computing architec- tures [40, 41]. Additionally , eac h activ ation function in our prop osed scheme is a standalone analog circuit and therefore can b e applied in parallel. While we hav e fo- cused here on the application of our architecture as an activ ation function in a feedforw ard ONN, the synthesis of low-threshold optical nonlinearlities using this circuit could b e of broader in terest for optical computing as well as microw a ve photonic signal pro cessing applications. A CKNOWLEDGMENTS This work was supp orted by a US Air F orce Oﬃce of Scientiﬁc Researc h (AFOSR) MURI pro ject (Grant N o F A9550-17-1-0002). I.A.D.W. ackno wledges helpful discussions with Avik Dutt. [1] Vivek K. P allipuram, Mohammad Bhuiy an, and Melissa C. Smith, “A comparativ e study of GPU pro- gramming mo dels and architectures using neural net- w orks,” The Journal of Sup ercomputing 61 , 673–718 (2012). [2] Jeﬀrey M. Shainline, Sonia M. Buc kley , Ric hard P . Mirin, and Sae W o o Nam, “Sup erconducting Optoelectronic Circuits for Neuromorphic Computing,” Ph ysical Review Applied 7 , 034013 (2017). [3] Bhavin J. Shastri, Alexander N. T ait, Thomas F er- reira de Lima, Mitchell A. Nahmias, Hsuan-T ung Peng, and Paul R. Prucnal, “Principles of Neuromorphic Photonics,” arXiv:1801.00016 [physics] , 1–37 (2018), arXiv:1801.00016 [physics]. [4] F. D. Coarer, M. Sciamanna, A. Katum ba, M. F reib erger, J. Dam bre, P . Bienstman, and D. Rontani, “All-Optical Reserv oir Computing on a Photonic Chip Using Silicon- Based Ring Resonators,” IEEE Journal of Selected T op- 13 ics in Quantum Electronics 24 , 1–8 (2018). [5] Julie Chang, Vincen t Sitzmann, Xiong Dun, W olfgang Heidric h, and Gordon W etzstein, “Hybrid optical- electronic conv olutional neural netw orks with optimized diﬀractiv e optics for image classiﬁcation,” Scientiﬁc Re- p orts 8 , 12324 (2018). [6] Shane Colburn, Yi Chu, Eli Shilzerman, and Ark a Ma- jumdar, “Optical fron tend for a con volutional neural net- w ork,” Applied Optics 58 , 3179–3186 (2019). [7] Jos´ e Capmany and Dalma Nov ak, “Micro wa v e photonics com bines tw o worlds,” Nature Photonics 1 , 319 (2007). [8] David Marpaung, Chris Ro eloﬀzen, Ren´ e Heideman, Arne Leinse, Salv ador Sales, and Jos´ e Capmany , “Inte- grated micro w av e photonics,” Laser & Photonics Reviews 7 , 506–538 (2013). [9] Paolo Ghelﬁ, F rancesco Laghezza, Filipp o Scotti, Gio v anni Seraﬁno, Amerigo Capria, Sergio Pinna, Daniel Onori, Claudio Porzi, Mirco Scaﬀardi, Anto- nio Malacarne, V aleria V ercesi, Emma Lazzeri, F abrizio Berizzi, and Antonella Bogoni, “A fully photonics-based coheren t radar system,” Nature 507 , 341 (2014). [10] Y aser S. Abu-Mostafa and Demetri Psaltis, “Optical Neural Computers,” Scientiﬁc American 256 , 88–95 (1987). [11] Demetri Psaltis, David Brady , Xiang-Guang Gu, and Stev en Lin, “Holograph y in artiﬁcial neural netw orks,” Nature 343 , 325–330 (1990). [12] Yichen Shen, Nicholas C. Harris, Scott Skirlo, Mihik a Prabh u, T om Baehr-Jones, Michael Ho c hberg, Xin Sun, Shijie Zhao, Hugo Laro c helle, Dirk Englund, and Marin Solja ˇ ci´ c, “Deep learning with coherent nanophotonic cir- cuits,” Nature Photonics 11 , 441–447 (2017). [13] David A. B. Miller, “Self-conﬁguring universal linear op- tical comp onen t,” Photonics Research 1 , 1 (2013). [14] Alexander N. T ait, Thomas F erreira de Lima, Ellen Zhou, Allie X. W u, Mitchell A. Nahmias, Bhavin J. Shastri, and Paul R. Prucnal, “Neuromorphic photonic netw orks using silicon photonic weigh t banks,” Scientiﬁc Rep orts 7 , 7430 (2017). [15] Marina Radulaski, Rano joy Bose, Tho T ran, Thomas V an V aerenbergh, David Kielpinski, and Raymond G. Beausoleil, “Thermally T unable Hybrid Photonic Arc hi- tecture for Nonlinear Optical Circuits,” ACS Photonics 5 , 4323–4329 (2018). [16] Qiaoliang Bao, Han Zhang, Zhenhua Ni, Y u W ang, Lakshminara yana P olav arapu, Zexiang Shen, Qing-Hua Xu, Dingyuan T ang, and Kian Ping Loh, “Monola yer graphene as a saturable absorber in a mode-lo c k ed laser,” Nano Research 4 , 297–307 (2011). [17] Nam Hun Park, Hwanseong Jeong, Sun Y oung Choi, Mi Hye Kim, F abian Rotermund, and Dong-Il Y eom, “Monola yer graphene saturable absorb ers with strongly enhanced ev anescent-ﬁeld interaction for ultrafast ﬁb er laser mo de-locking,” Optics Express 23 , 19806 (2015). [18] Xiantao Jiang, Simon Gross, Michael J. Withford, Han Zhang, Dong-Il Y eom, F abian Rotermund, and Alexan- der F uerbach, “Low-dimensional nanomaterial saturable absorb ers for ultrashort-pulsed wa veguide lasers,” Opti- cal Materials Express 8 , 3055 (2018). [19] A. L. Lentine and D. A. B. Miller, “Evolution of the SEED technology: Bistable logic gates to optoelectronic smart pixels,” IEEE Journal of Quantum Electronics 29 , 655–669 (1993). [20] Ark a Ma jumdar and Armand Rundquist, “Ca vity- enabled self-electro-optic bistabilit y in silicon photonics,” Optics Letters 39 , 3864 (2014). [21] Alexander N. T ait, Thomas F erreira de Lima, Mitc hell A. Nahmias, Heidi B. Miller, Hsuan-T ung P eng, Bha vin J. Shastri, and P aul R. Prucnal, “Silicon Photonic Modula- tor Neuron,” Physical Review Applied 11 , 064043 (2019). [22] Edmondo T rentin, “Net works with trainable amplitude of activ ation functions,” Neural Netw orks 14 , 471–493 (2001). [23] Tyler W. Hughes, Momchil Minko v, Y u Shi, and Shan- h ui F an, “T raining of photonic neural net works through in situ bac kpropagation and gradien t measuremen t,” Op- tica 5 , 864–871 (2018). [24] M. N. Ahmed, J. Chong, and D. S. Ha, “A 100 Gb/s transimp edance ampliﬁer in 65 nm CMOS technology for optical comm unications,” in 2014 IEEE International Symp osium on Cir cuits and Systems (ISCAS) (2014) pp. 1885–1888. [25] K. T. Settaluri, C. Lalau-Keraly, E. Y ablonovitc h, and V. Sto janovi ´ c, “First Principles Optimization of Opto- Electronic Comm unication Links,” IEEE T ransactions on Circuits and Systems I: Regular Papers 64 , 1270–1283 (2017). [26] K. Nozaki, S. Matsuo, A. Shiny a, and M. No- tomi, “Ampliﬁer-F ree Bias-F ree Receiver Based on Low- Capacitance Nanophoto detector,” IEEE Journal of Se- lected T opics in Quantum Electronics 24 , 1–11 (2018). [27] Shank ar Kumar Selv ara ja, P eter De Heyn, Gustaf Win- roth, Patric k Ong, Guy Lepage, Celine Cailler, Ar- naud Rigny , Konstantin K. Bourdelle, Wim Bogaerts, Dries V an Thourhout, Joris V an Camp enhout, and Philipp e Absil, “Highly uniform and low-loss passive sil- icon photonics devices using a 300mm CMOS platform,” in Optic al Fiber Communic ation Conferenc e (2014), Pa- p er Th2A.33 (Optical Society of America, 2014) p. Th2A.33. [28] Jaime Cardenas, Mark A. F oster, Nicol´ as Sherwoo d- Droz, Carl B. Poitras, Hugo L. R. Lira, Beibei Zhang, Alexander L. Gaeta, Jacob B. Khurgin, Paul Morton, and Michal Lipson, “Wide-bandwidth contin uously tun- able optical delay line using silicon microring resonators,” Optics Express 18 , 26525 (2010). [29] Kengo Nozaki, Shinji Matsuo, T akuro F ujii, Ko ji T akeda, Akihik o Shiny a, Eiichi Kuramochi, and Masa ya Notomi, “F em tofarad opto electronic integration demonstrating energy-sa ving signal conv ersion and nonlinear functions,” Nature Photonics (2019), 10.1038/s41566-019-0397-3. [30] G. Y u, X. Zou, L. Zhang, Q. Zou, M. zheng, and J. Zhong, “A low-noise high-gain transimp edance ampli- ﬁer with high dynamic range in 0.13 ` ım CMOS,” in 2012 IEEE International Symp osium on R adio-F r e quency In- te gr ation T e chnolo gy (RFIT) (2012) pp. 37–40. [31] Yichen Shen, Nicholas C. Harris, Scott Skirlo, Mihik a Prabh u, T om Baehr-Jones, Michael Ho c hberg, Xin Sun, Shijie Zhao, Hugo Laro c helle, Dirk Englund, and Marin Solja ˇ ci´ c, “Supplemen tary information: Deep learning with coherent nanophotonic circuits,” Nature Photonics 11 , 441–446 (2017). [32] C. Ko os, L. Jacome, C. Poulton, J. Leuthold, and W. F reude, “Nonlinear silicon-on-insulator wa v eguides for all-optical signal pro cessing,” Optics Express 15 , 5976–5990 (2007). 14 [33] “Neuroptica: An optical neural net work simulator,” https://github.com/fancompute/neuroptica/ . [34] “Neurophox: A simulation framework for unitary neural net works and photonic devices,” https://github.com/ solgaardlab/neurophox/ . [35] Sunil P ai, Ben Bartlett, Ola v Solgaard, and David A. B. Miller, “Matrix Optimization on Univ ersal Unitary Photonic Devices,” Physical Review Applied 11 , 064044 (2019). [36] Mart ´ ın Abadi, Ashish Agarw al, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghe- ma wat, Ian Go odfellow, Andrew Harp, Geoﬀrey Irving, Mic hael Isard, Y angqing Jia, Rafal Jozefo wicz, Luk asz Kaiser, Manjunath Kudlur, Josh Leven berg, Dande- lion Man´ e, Ra jat Monga, Sherry Mo ore, Derek Mur- ra y , Chris Olah, Mike Sch uster, Jonathon Shlens, Benoit Steiner, Ilya Sutskev er, Kunal T alwar, Paul T uc ker, Vincen t V anhouck e, Vijay V asudev an, F ernanda Vi´ egas, Oriol Vin yals, P ete W arden, Martin W attenberg, Mar- tin Wic ke, Y uan Y u, and Xiaoqiang Zheng, “T ensor- Flo w: Large-Scale Machine Learning on Heterogeneous Systems,” (2015), soft ware av ailable from tensorﬂo w.org. [37] Y. Lecun, L. Bottou, Y. Bengio, and P . Haﬀner, “Gradien t-based learning applied to do cumen t recogni- tion,” Pro ceedings of the IEEE 86 , 2278–2324 (1998). [38] Joseph W. Go odman, Intr o duction to F ourier Optics (Rob erts and Company Publishers, 2005). [39] H. Peng, M. A. Nahmias, T. F. de Lima, A. N. T ait, and B. J. Shastri, “Neuromorphic Photonic Integrated Circuits,” IEEE Journal of Selected T opics in Quantum Electronics 24 , 1–15 (2018). [40] L. Larger, M. C. Soriano, D. Brunner, L. Appeltant, J. M. Gutierrez, L. P esquera, C. R. Mirasso, and I. F i scher, “Photonic information processing beyond T uring: An op- to electronic implemen tation of reservoir computing,” Op- tics Express 20 , 3241 (2012). [41] F ran¸ cois Dup ort, Anteo Smerieri, Akram Akrout, Marc Haelterman, and Serge Massar, “F ully analogue photonic reserv oir computer,” Scien tiﬁc Rep orts 6 , 22381 (2016).

Reprogrammable Electro-Optic Nonlinear Activation Functions for Optical Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment