A Winograd-based Integrated Photonics Accelerator for Convolutional Neural Networks

JOURNAL OF SELECTED TOPICS IN QU ANTUM ELECTRONICS 1 A W inograd-based Inte grated Photonics Accelerator for Con v olutional Neural Networks Armin Mehrabian, Member , IEEE, Mario Miscuglio, Member , OSA, Y ousra Alkabani, Member , IEEE, V olk er J. Sorger , Senior Member , IEEE and T arek El-Ghazawi, F ellow , IEEE Abstract —Neural Networks (NNs) ha ve become the main- stream technology in the artiﬁcial intelligence (AI) renaissance over the past decade. Among different types of neural networks, con volutional neural networks (CNNs) have been widely adopted as they hav e achieved leading results in many ﬁelds such as computer vision and speech recognition. This success in part is due to the widespread availability of capable underlying hard- ware platf orms. In parallel, hardware specialization can expose us to novel architectural solutions, which can outperf orm general purpose computers for the tasks at hand. Although different applications demand for different performance measures, they all share speed and energy efﬁciency as high priorities. Meanwhile, photonics processing has seen a resurgence due to its inherited high speed and low power natur e. Here, we in vestigate the potential of using photonics in CNNs by proposing a CNN accelerator design based on W inograd ﬁltering algorithm. Our evaluation results show that while a photonic accelerator can compete with current state-of-the-art electronic platforms in terms of both speed and power , it has the potential to improve the ener gy efﬁciency by up to three orders of magnitude. Index T erms —Conv olutional Neural Networks, Photonics, Winograd I . I N T RO D U C T I O N T HE ﬁeld of AI has undergone rev olutionary progress over the past decade. W ide av ailability of data and cheaper than ev er compute resources hav e contributed immensely to this growth. At the same time, advancements is the ﬁeld of modern neural networks, kno wn as deep learning (DL) hav e attracted the attention of academia and industry . This popularity is mainly owed to neural networks’ success in a large gamut of AI applications including but not limited to computer vision, speech recognition, and natural language processing. Among the dif ferent types of neural networks, CNNs are considered the most viable architecture for AI applications. CNNs are remarkably versatile in most AI tasks. Howe ver , all of this comes at the price of high computational costs. In the meantime, the use of integrated photonics in neural networks for implementing neuron functionalities has shaped to be an attainable alternative near future technology for limiting the power consumption and increasing the operating speed [1][2][3]. Photonics beneﬁt from the coherent nature of electromagnetic wav es, which interfere while propagate through a photonic integrated circuit (PIC). Central to many AI techniques and algorithms is implementation of hardware solutions that mimic the multiply and accumulate (MAC) function. The main advantage of photonic neural networks ov er electronics is that the energy consumption for perform- ing a series of multiplications and additions does not scale with MA C speed. The training of an optical neural network necessitates an acti ve modulation of the optical signal in a hybrid opticalelectronic conﬁguration [4]. For this reason, these architectures face signiﬁcant hurdles when compared to their electronic counterparts. T o be competitive, they are expected to have low power consumption and high-speed electro-optic modulation [5][6][7]. Additionally , they require to pair with electrical to optical (EO), optical to electrical (OE) con verters, and I/O interfaces. Howe ver , when trained, photonic neural networks do not rely on any additional energy for activ e switching. Therefore the architectures that perform tasks such as weighting, can be realized completely passiv e, and the computations happen without the consumption of any dynamic power [8][9][10]. In this panorama, all-optical neural networks (A ONNs) represent a promising future. Current all- optical implementations in free space [11] and in integrated photonics [12][13][14] can outperform their electronic coun- terparts providing promises of great energy efﬁcienc y and speed enhancement for learning tasks. In this manuscript, we explore the potentials of using high-speed, low-po wer photonics in a CNN accelerator by e xploiting coherent all-optical matrix multiplication in wa velength di vision multiplexing (WDM), using microring resonator weight banks (MRRs). Our architecture is inspired by [15][16], where W inograd ﬁltering algorithm is adopted to perform conv olution to speedup the ex ecution time and reduce the computational complexity . W e in vestig ate the performance of our proposed architecture in terms of speed and power . Since our proposed architecture is analog at the core, we also in vestigate the robustness of neural networks ex ecuted on our proposed design in terms of tolerance against noise. W e summarize the main contrib utions of this work as, • a ﬁrst proposed photonic CNN architecture based on the W inograd ﬁltering algorithm • an analytical framework to ev aluate the speed perfor- mance of our proposed accelerator • an in-house simulator based on a modiﬁed Google T en- sorﬂow tool to simulate the performance of our proposed photonic accelerator with power and noise awareness c  2019 IEEE JOURNAL OF SELECTED TOPICS IN QU ANTUM ELECTRONICS 2 • a modiﬁed training process to enhance robustness to inevitable hardware noise sources during the inference stage I I . C O N V O L U T I O NA L N E U R A L N E T W O R K S ( C N N S ) A CNN is a neural network comprised of one or more con volutional layers. CNNs are mostly known for their great performance on image data, howe ver , their application e xtend to many other data types with local features. At the very high level, each conv olution layer uses a collection of feature detectors, known as ﬁlters, that scan input data for presence or absence of a particular set of features. Hence, in a CNN layer , inputs and outputs are referred to as feature maps (fmap). By cascading multiple of these con v olutional layers, a hierarchy of feature detectors are formed. In this hierarchy , feature detectors closer to the input detect primitiv e features. As we mov e tow ards the ﬁnal layers, the type of features detected become more abstract. Con ventionally , the dimension of each ﬁlter in a a CNN is 3 D with the two ﬁrst dimensions being the height and width of the ﬁler and the last dimension, known as the channel dimension, represents various ﬁlters. The use of conv olutional ﬁlters to scan input data had been practiced well before the rise of the ﬁeld of deep learning and CNNs. Howe ver , in traditional signal processing, such ﬁlters are hand- engineered by experts, which can be costly , only designed for speciﬁc purposes, and vulnerable to designer bias. In a modern CNN, these ﬁlters are learned through the training process. Figure 1 sho ws the ov erall architecture of a CNN layer . Fig. 1: A single layer of a CNN. Each of the N ﬁlters (left) scan the input feature maps (middle) for features. This results in output feature maps, with N channels equal to the total number of ﬁlters. I I I . P H O T ON I C R E A L I Z A T I O N O F C N N S In data communication and computation, photonics has the potential to offer practical solutions to overcome some of the limitations currently facing electronic systems. In a neuromorphic system, processing elements (PEs) are arranged in a distributed fashion with ideally lar ge number of incoming (fan-in) and outgoing (fan-out) connections. Inspired by biological neural systems, some of these connection are required to connect neurons across farthest parts of the brain. In addition, neuromorphic PEs are in large part speciﬁc-purpose processors in contrast to the general purpose processors. Neuromorphic processing can beneﬁt from photonics in three major ways. First, photonics can signiﬁcantly reduce the amount of ener gy consumed in interconnects among PEs by av oiding energy dissipation due to charging and discharging of electrical wires. Secondly , current neuromorphic algorithms known to neural networks, and in particular in CNNs, heavily rely on the multiply and accumulate (MA C) operation, which can be realized with very low energy budgets in photonics. Thirdly , photonics can increase communication and computa- tion bandwidth by exploiting WDM. The adoption of WDM allow for higher density of computation and communication between PEs by packing more channels and parallel compu- tations in a neuromorphic processor . A. Photonic Con volution K ernels and MA C Operation One major adv antage of a photonic MAC operation is that it can be performed with almost zero ener gy [17]. Howe ver , if the signal is con verted from optical to electrical, the con version and successiv e electronic manipulations impose additional energy loss. T o b uild a photonic con v olutional ﬁlter , we use a microring resonator (MRR) network proposed in [18]. Figure 2 depicts a single MRR neuron. In this schema input WDM signals are weighted through tunable MRRs. These weighted inputs are later incoherently summed up using a photodetector, which amounts to a MA C operation. Thus, by the use of N wav elengths, it is possible to establish up to N 2 independent connections. Maximum N with current technologies is estimated to be around 108 channels resulting in a total of 10k connections [19]. It should be note that having closely spaced wav elengths as multiple laser sources while tuning the rings to match both resonance and FSR is a very challenging task. Although, on the source side, a set of phase-locked, equally spaced laser frequency lines can be generated using tunable optical resonators in a chip-based frequency comb generator [20]. Moreov er, on the MRR side, our system can le verage on Dense-WDM (DWDM). This is achiev able due to strong optical conﬁnement of silicon wa veguides using tunable MRRs with more than 50 nm Free Spectral Range (FSR) with Quality factors Q close to 10 4 , which allow as many as 50 channels [21]. Assuming approximately 0 . 8 nm channel spacing, the resonance bandwidth can be broadened up to 0 . 4 nm , while maintaining an estimated cross-talk lev el of − 10 dB [22]. Most of the modern neural networks hav e one or more fully- connected layers, which creates N 2 synaptic connections. On the other hand, 10 k connections are barely sufﬁcient to ev en implement miniature fully-connected neural networks and simple benchmark datasets such as MNIST with 728 neurons only in the input layer . In contrast, CNNs beneﬁt JOURNAL OF SELECTED TOPICS IN QU ANTUM ELECTRONICS 3 from sparse connections between the local input regions and ﬁlters. A common CNN architecture usually has ﬁlters of shape 3 × 3 up to 11 × 11 that connect receptiv e ﬁelds and the ﬁlters. From functional point of view smaller ﬁlters are fa vored over larger ﬁlters, as they are capable of detecting ﬁner local patterns. Lar ger and global patterns are detected in the layers closer to the output of CNNs. These features are more abstract and are b uilt on top of the previously low-le vel features. Fig. 2: A broadcast-and-weight neuron. Inputs X i modulate different wa velength lasers. Modulated beams are then bundled through WDM. Fig. 3: Microring resonator (MRR) operation for performing point-wise multiplication. W e use the proposed scheme in Figure 2, to perform two heuristic W inograd transformations and one element-wise matrix multiplication (EWMM) on each w avelength. Figure 3 shows the details of a MRR weighting function that operates on a single wav elength λ 0 . The MRR acts as a tunable analog ﬁlter centred at λ 0 , in which the voltage applied to the EOM module lets only a portion of light to travel through the wa veguide. The modulation can be triggered by an analog electric ﬁeld generated by a memristor . In this work we use a memristor de vice which can store the weights with 6 bits of resolution [23]. The transmission spectrum (T) of the ring has a natural resonant frequency of λ 0 . When WDM light passes through the coupled wav eguide, the component with wa velength λ 0 is coupled into the ring. By raising the bias voltage to V 1 , the resonant frequency shifts to λ 1 due to the change in the ef fectiv e refractiv e index of the ring. The difference between V 0 and V 1 controls the difference between λ 0 and λ 1 , i.e. the transmission ( ∆ i ). The variation of the transmission at λ 0 represents, in our scheme, the point-wise multiplication. The most used MRR modulator has silicon based p-i-n junction that is side coupled to a wav e guide as described in [24] or p-n junction reported [25]. Current silicon-based MRR modulators [26][27][28], as well as foundry level implementations, exhibit a speed up to 50 GHz, with a driving voltage of usually a few V olts (1 − 2 V ) and an efﬁcienc y (V π l) of fe w tenths of Vcm. Experimental results that corroborate our estimation are reported in [29], where Silicon-based electro-optic MRRs exhibit a modulation in a working spectrum of 0 . 1 nm and a speed of 11 GH z while hav e an insertion loss of as low as only 2 dB . This by no means is a limiting factor in the inference stage, considering that the network has been trained and the weights are set. Therefore, the latency of the network is giv en by the time-of-ﬂight of the photon. Beside the uncertainty due to fabrication imperfections, which could be compensated, the main source of noise that affects a MRR modulator is electrical noise and, in this case, e ventual non-ideality in setting the analog v oltages with memristive de vice that could vary over time. Moreover , for high data rate situations( > 20 Gb/s ), the intra-channel cross-talk becomes rele vant, and po wer penalties need to be considered [30][31]. Re garding the operating dynamic power , the maximum allowed optical po wer ﬂo wing in each physical channel of the photonic accelerator is bound by the optical power that would produce nonlinearities in the silicon wa veguides and the minimum power that the photo-detector can distinguish from noise (SNR=1). This sets the upper and lower operating range of photodiode, which we refer to as the dynamic range of the photodiode. Foundry lev el [32] integrated Germanium photo-diodes can reach up to 40 GHz with a responsivity of 0 . 6 A/W and a Noise equiv alent power (NEP) of around 1pW/ √ H z operating in re verse bias ( − 2 V). Research-lev el photodetectors working in the 100s of GHz range hav e also been demonstrated [33][34][35]. Howe ver , the dynamic range of the photodiode needs to be accurately set to a void saturation and account for the bit resolution [12]. For this scheme, according to the bit resolution, the estimated dynamic range is 20 dB . The speed of the optical part of the accelerator , without considering the I/O interface, according to [36] is giv en by the total number of the MRR and their pitch. Photodetection and phase cross-talk are expected to be the main sources of error in the proposed scheme. Another issue in using MRRs is attributed to variations in device fabrications, which can result in the spectral shift of the resonance frequency . The resonance frequency of MRRs can be tuned in multiple way . Due to high optical sensiti vity of materials such as silicon to temperature, thermal tuning is the most widely adopted tuning technique for MRRs. This can be achiev ed by placing micro-heaters on top of each MRR [37][38]. JOURNAL OF SELECTED TOPICS IN QU ANTUM ELECTRONICS 4 B. Memristors as Analogue W eight Storage Neuromorphic systems inspired by human brain rely upon two major principles, namely massi vely distributed processing and proximity of local memory to these processing elements. These memory units demand some le vel of programability (plasticity), with their programming speed requirements being in the M H z regime. At this time, almost all state-of-the-art neural networks, perform the training and the inference phases separately . This means that once the weights are trained and set, for the inference phase, one does not need to change the weights. In addition, weights in our proposed system are represented by an analog voltage bias of MRRs. A potential weight storage for our system should be analog and non-v olatile with long retention time. Having said that, memristive memory devices hav e attracted the attention of researchers due to their interesting charac- teristics including but not limited to non-volatility , long state retention time, and ultra-low power consumption [39][40][23]. Over the past few years the bit-resolution of such memristi ve memory devices has risen monotonically [41][42][43][44]. Recently , authors in [23] proposed, fabricated, and ev aluated an analog multibit memristi ve memory with bit-resolution of up to 6 . 5 bits. Each memristive device takes up 20 µ × 20 µ in area and can retain the resistance state for up to 8 hours. In AlexNet the 3rd conv olutional layer has the largest number of con volutional ﬁlter weights equal to 884 , 736 . Assuming ov erhead circuitry increases the footprint to approximately 50 µ × 50 µ , the memristi ve memory required for the largest layer of Ale xNet can be realized in less than 0 . 25 cm 2 . I V . F A S T A L G O R I T H M S F O R C O N VO L U T I O N O P E R A T I O N As the name suggest the con volution operation account for the bulk of all operations in a CNN during both training and inference stages. Ho wev er, each of the training and inference stages demand a different type of performance requirement. During the training, the emphasis is more on throughput rather than time. This is mainly due to the fact that the model under train needs to observe a large ”ensemble” of data, the batch, as fast as possible. Therefore, time is amortized ov er many inputs. On the other hand, during the inference stage, applications are mostly latency sensitive. For instance, in a self-driving car application only a few input image scenes are needed to be processed per second, but that is required to be at a very low latency timescale. Ha ving said that, a neuromorphic processor designed for inference is expected to satisfy stringent timing requirements. An important parameter that is shown to have a signiﬁcant impact on the latency of CNNs is the size of their ﬁlters. It is generally kno wn from a functional point of view that CNNs with smaller ﬁlters are preferred over CNNs with large ﬁlters [45][46][47]. T able I sho ws the breakdown of ﬁlter size for some of the state of the art CNN architectures. This is mainly due to the fact that small ﬁlters are better in ﬁnding local features without sacriﬁcing the resolution. More abstract and more global features can be detected in T ABLE I: Kernel size breakdo wn in state-of-the-art CNNs. It can be seen that ﬁlter of size 5 × 5 comprise only a minute fraction of total ﬁlters. CNN 1 × 1 (%) 3 × 3 (%) Small 1D ﬁlters 5 × 5 (%) GoogLeNet 64.9 17.5 1.7 15.9 Inception V3 43.2 17.9 35.7 3.2 Inception V4 40.9 16.1 43 0 MobileNet 93.3 6.7 0 0 ResNet50 68.5 29.6 1.9 0 VGG16 0 100 0 0 higher layers of a CNN b uilt on previous local layer features. As we discussed in section III, a physical implementation of photonic MRRs fa vors small size ﬁlters due to limited number of av ailable wavelength bands. This synergy between functional and photonic realization of CNNs is the primary motiv ation behind this work. At the time of writing this paper , there are three major ways to speed up the con volution operation. First, the General Matrix Multiplication (GEMM) approach, in which the con- volution is con verted to matrix multiplication operation using T oeplitz matrix. The downside to this method is that T oeplitz con version expands the input by a factor of r × r where r is the size of the ﬁlter . Second method uses Fast F ourier T ransform (FFT) to perform tiled conv olution operations. From Fourier theorem we know that cyclic con v olution can be performed by transforming the input and ﬁlters into Fourier domain. An element-wise multiplication (also known as Hadamard multiplication) results in an equiv alent of conv olution, but in Fourier domain. An inv erse FFT operation transforms the cal- culated conv olution back into the original domain. FFT -based con volution had been the method of choice for conv olution operation [48][49][50] until the recent past. Lately , it is sho wn that FFT -based con volution is better suited for lar ger ﬁlter sizes [15]. The third method uses the W inograd ﬁltering algorithm, which we e xplain in detail in the following section. A. W inograd Algorithm In a 2D con volution, a single output component of the con volution is calculated by , y n,k,p,q = c X n =1 r X x =1 r X y =1 x n,c,p + x,q + y × w k,c,x,y (1) The operation in equation 1 is repeated for all outputs con v olu- tion components. In a brute-force conv olution the total number of multiplications required to perform a full conv olution is equal to ( m × r ) 2 (2) where m is the size of the output feature map channel and r is the size of the ﬁlter . At the time of writing this paper , W inograd con volution in the most ef ﬁcient conv olution algorithm being used for CNNs [15]. W inograd con volution is based on the minimal ﬁltering principles. The algorithm JOURNAL OF SELECTED TOPICS IN QU ANTUM ELECTRONICS 5 Fig. 4: High-lev el ﬂow diagram of Winograd ﬁltering technique for conv olution operation. Unlike con ventional con volution, which computes a single output at a time, W inograd algorithm computes a tile of output, here of size m × m . In order to generate an output tile, Winograd requires to fetch an input tile of size n × n . Both input tile and ﬁlter are transformed into the W inograd domain. W ithin the W inograd domain, previously transformed input and ﬁlter are multiplied in an element-wise fashion. Finally , the output of the element-wise multiplication is transformed back into the original domain and channels are collapsed into a single value per tile element. states that in order to calculate m outputs with a ﬁnite impulse response (FIR) ﬁlter of size r , denoted by F ( m, r ) , the number of required multiplications is, n = ( m + r − 1) (3) While equation 3 is deriv ed for the 1D con v olution operation, one can nest it with itself to acquire a 2D con v olution. Therefore, the number of multiplications needed for the same 2D con volution is giv en by , ( m + r − 1) 2 (4) From equation 2 and 4 we can infer that Winograd results in a reduction in the complexity by a factor of, ( mr ) 2 ( m + r − 1) 2 (5) It should be noted that in our proposed photonic accelerator, multiplication operations are carried out by MRRs. Any reduction in the total number of multiplication operations, and thus MRRs, can sav e us not only in footprint of the design, b ut also in the design comple xity . Now , in order to understand how minimal W inograd works, let us ﬁrst consider the case for 1D con volution. Let matrix W be the matrix of weights, and matrix D be the data matrix. W inograd computes the F (2 , 1) con volution as following  d 0 d 1 d 2 d 1 d 2 d 3    w 0 w 1 w 2   =  m 1 + m 2 + m 3 m 2 − m 3 − m 4  (6) where v alues m i are intermediate v alues found by m 1 = ( d 0 − d 2 ) × w 0 m 2 = ( d 1 + d 2 ) × w 0 + w 1 + w 2 2 m 3 = ( d 2 − d 1 ) × w 0 − w 1 + w 2 2 m 4 = ( d 0 − d 3 ) × w 2 Abov e equations show that with only 4 multiplications between inputs and weights, W inograd can compute a F (2 , 3) con volution. All w i terms can be pre-computed after the training stage. In order word, during the inference time, while data v alues d i , corresponding to inputs change, the w i values remain the same throughout the inference. The 1D W inograd can be e xpressed by a closed matrix form as Y = A T [( G × w )  ( B T × d )] (7) where A T , B T , and G are three heuristic transforms described by equations 8, 9, and 10. w is the weight vector and d is the input vector . A T =  1 1 1 0 0 1 − 1 − 1  (8) B T =     1 0 − 1 0 0 1 1 0 0 − 1 1 0 0 1 0 − 1     (9) G =     1 0 0 1 2 1 2 1 2 1 2 − 1 2 1 2 0 0 1     (10) One conclusion from equation 6 is that to compute a single output of 1D con volution only a window of ( m + r − 1) input values are needed. In a modern CNN the b ulk of con volution operations are comprised of 2D conv olutions. Equation 7 can be easily extrapolated for 2D con v olution by nesting two 1D W inograd con volutions. The resulting 2D W inograd would be, Y = A T [( G × w × G T )  ( B T × d ) × B ] (11) JOURNAL OF SELECTED TOPICS IN QU ANTUM ELECTRONICS 6 From [15] for the case of F (4 × 4 , 3 × 3) matrices B T , G , and A T hav e the forms, A T =     1 1 1 1 1 0 0 1 − 1 2 − 2 0 0 1 1 4 4 0 0 1 − 1 8 − 8 0     (12) B T =       4 0 − 5 0 1 0 0 − 4 − 4 1 1 0 0 4 − 4 − 1 1 0 0 − 2 − 1 2 1 0 0 4 0 − 5 0 1       (13) G =         1 4 0 0 − 1 6 − 1 6 − 1 6 − 1 6 1 6 − 1 6 1 24 1 12 1 6 1 24 − 1 12 1 6 0 0 1         (14) The number of addition and multiplications required for W inograd transform, not the element-wise multiplications, increases quadratically with the tile size. Thus, Winograd is expected to perform most efﬁciently for smaller ﬁlter sizes, and thus smaller input tiles. Algorithm 1: W inograd for 2D con volution for r ow=0; r ow < H; r ow+=m do for column=0; column < H; column+=m do for channel=0; channel < c; channel+=1 do for ﬁlter=0; ﬁlter < N; kerne;+=1 do load input tile; transform input tile; load transformed ﬁlter; perform EWMM end end end Output W inograd con volution result end V . A R C H I T E C T U R E D E S I G N In this paper we propose a photonic CNN accelerator based on Winograd algorithm and realized using the photonic neuron introduced in [19]. Figure 4 depicts the architecture of a single W inograd PE. Our proposed accelerator processes a single layer of a CNN at a time. This is mainly due to the fact that in a CNN different tiles of output feature maps are computed sequentially , and thus arrive at different times. But, in order to initiate processing of the next layer, all the inputs from the previous layer need to be av ailable and synchronized. Our approach to process one layer at a time enforces this synchronization. Furthermore, implementing multiple layers of a CNN will result in lar ge area ov erheads. At the input of our accelerator, an input tile of shape n × n × c along with ﬁlters of size r × r × c are transformed into the Winograd domain. Input and ﬁlters’ transforms are then multiplied element by element. The output of this multiplication needs to be transformed back using an in verse W inograd transform. The signals at this stage are digitized using an array of ADCs and placed onto the output line buf fers to be stored back in the of f-chip memory . Figure 5 presents the ov erview of the our proposed architecture. Our proposed architecture runs on two clock domains. First a high speed 5 GH z clock domain, which accommodates low latency components of the accelerator including the photonic components. In section VI-A we explain our rationale on how we arriv e at the 5 GH z high speed clock frequenc y . The rest of the accelerator including input feature map buf fers, ﬁlter b uffers, ﬁlter Winograd DSP module, and ﬁlter path D A C run on a slower clock domain because there is no time sensitivity on ﬁlter path, and data transfer from/to of f-chip memory . At the heart of our accelerator , we ha ve an Element-W ise Matrix Multiplication unit, which we implement in photonics using photonic neurons. W e store the input feature maps and ﬁlters in an off-chip memory . Both the input feature maps and ﬁlters require to go through Winograd transformation, which are matrix multiplications described in equations 13 and 14. It should be noted that while input feature maps change for different tiles of inputs, ﬁlters are ﬁxed for each layer . For that, we implemented the input feature map transformations in photonics and ﬁlter W inograd transformations in electronic DSP . This way , we will not pay the overhead associated with photonic implementations including the conv ersion of electronic ﬁlters to photonics. Later , the transformed ﬁlters and input feature map tiles are con verted into analog signals to modulate the laser beams. Ho wever , as the ﬁlters are ﬁxed ov er the processing time of the layer, analog ﬁlter signals need to be maintained for that time. Thus, we propose to use the non-v olatile analog memristiv e memory bank, which maintains these voltages in their analog form for long retention times. In Winograd conv olution, and in each iteration, a tile of n × n is processed. In order to process an entire feature map, the transformed ﬁlter tile needs to mov e across the input feature map. In this paradigm two successiv e input tiles share size ( r − 1) × n elements. This, introduces data reuse opportunities, to av oid multiple queries of same data block. Here our goal is to exploit this opportunity at the front-end of our accelerator . Our design is inspired by the work in [16], where authors utilize line buf fers to a void redundant queries from the off-chip memory . Figure 6 sho ws an example line buf fer design to load and hold a 3 × 3 input feature map tile. Input tiles are fetched from of f-chip memory and loaded into the line buf fer . Buffered tiles are then passed into the digital to analog con verter (D AC) using parallel channels. In parallel to the input data stream, transformed ﬁlter weights are con verted into the analog signals to program the analog memristive memory . W e then use the voltage generated using the stored analog signals to modulate the laser source JOURNAL OF SELECTED TOPICS IN QU ANTUM ELECTRONICS 7 Fig. 5: High-le vel architecture of our proposed photonic accelerator . Input feature-maps and ﬁlters are initially stored in an off-chip memory . Input tiles of size n × n are loaded into the input line buf fer one at a time. Kernel weights do not change once the CNN is trained. Thus, we perform ﬁlters’ Winograd in electronics and the cost is amortized ov er many input tiles. W inograd transform for input feature map tiles are computed in photonics. The photonic element-wise matrix multiplication (EWMM) unit performs the core W inograd element-wise W inograd multiplications. Outputs are digitized and placed onto the output line b uffer . Finally , processed layer outputs are stored in the of f-chip memory . for the ﬁlters. Each output signal generated by a D A C is then used to modulate a laser beam of a particular wa velength λ i . It is worth noting that for each set of ﬁlters modulated by the laser source, input line goes through multiple iterations corresponding to different input tiles. Once both input tile laser beam and the ﬁlter laser beam are ready , the EWMM, multiplies each element of the Winograd input feature map tile with its corresponding Winograd ﬁlter value. The output from EWMM unit must be transformed back from W inograd domain into the original domain by the in verse Winograd transform. The result contains output feature map tiles for multiple channels c . Lastly , output feature map values are digitized and stored back into the of f-chip memory . A k ey principle in HPC is to try to minimize the IO and other communication latencies, compared to that of the compu- tation time, to av oid unit under-utilization. From Algorithm 1 we can see that the two inner-most loops iterate ov er different channels of the input feature map tile and dif ferent ﬁlters. Moreov er, operations within these two loops are independent form one another . This provides parallelization opportunities at the cost of additional hardware. In other w ords, the amount of parallelization and speed up we can achieve, scales linearly with the number of pipeline replications in our system. This linear scaling plateaus as soon as the computation bandwidth approaches the data transfer bandwidth. Our en visioned design uses an arbitrary number of 100 parallel paths. Our e valuation results in the next section justiﬁes this selection. V I . E V A L U A T I O N In this section we ev aluate the performance of our acceler- ator for the 3 × 3 ﬁlters of the VGG16 network against the Fig. 6: An example line buf fer design to load and hold an input tile of size 3 × 3 . recent FPGA [51][16][52][53][54] and GPU implementations [16]. A. Speed Here we develop a model to estimate the execution time of the our accelerator . First we model the time required to con volv e one input tile with one ﬁlter and we call it T tile f il ter . Follo wing that, we generalize the model to the case where we parallelize the process based on a vailable resources. F or one input feature map tile and a ﬁlter , both, the input branch and the ﬁlter branch of the Figure 5 are fully pipelined. Therefore, the execution time of a layer is determined by the longer of the two paths of ﬁlter path and input path. For each iteration of the ﬁlter path, input data path goes through multiple iterations. This is because a single ﬁlter operates on many input data tiles. That said, the input data path sets the upper bound on the delay . Our ex ecution time model is comprised of two major components namely , the Input/Output time ( T I O ) and the computation time ( T C omp ) . W e deﬁne ( T I O ) as. T I O = max ( T load , T of f l oad ) (15) JOURNAL OF SELECTED TOPICS IN QU ANTUM ELECTRONICS 8 where T load is the time it takes to transfer data from the off-chip memory to the input of the laser sources. Moreov er, we can implement a total of P input D A Cs to speedup the data transfer . Considering the fact that our input matrices are of the shape 4 × 4 , we used an array of 16 DA Cs in this work. Similarly T of f l oad is the time to store back the computed outputs from the in verse Winograd transform to the off-chip memory . The goal is to match the rate of the ADC at the output with input D AC to av oid any speed mismatch and thus congestion in the pipeline. At the time of this revie w both on-chip D A Cs and ADCs are capable of operating at sampling rates of more than 18 GS/s for bit resolution of at least 8 bits [55][24]. Furthermore, with recent advances in memory technology , current memories are able to transfer data at high IO bandwidths up to more than 512 Gb/s [56]. This high memory bandwidth allows us to buffer data and ﬁlters from of f-chip memory at high transfer rates and feed it to our input line buf fers. Ho wever , for our line buffers we need memories with high clock frequency and short access time. Current reported memory technologies have access time as short as 200 ps . At the photonic core of our accelerator , T compute is, T compute = T laser + T W inog rad + T E W M M + T iW inog rad (16) where T W inog rad is the time to compute the W inograd trans- form, T E W M M is the time to perform the element-wise matrix multiplication, and T iW inog rad is the time compute the inv erse W inograd transform. Once the laser is set up, input signals only incur a time delay equiv alent to ﬂight time of the light before they are fed into the ADC. Ha ving said that the clock frequency of the pipeline is determined by f r eq uency clock ≤ 1 min ( T load , T of f l oad , T compute ) (17) As a result of equation 17, we picked a clock frequency of 5 Ghz , which satisﬁes equation 17. From equation 17 T tile f il ter is simply found by , T tile f il ter = 1 5 GH z = 200 ps (18) For a F (4 × 4 , 3 × 3) W inograd, each T tile f il ter returns an output block of size 9 equiv alent to 9 con volution operations. Having a clock frequency of 5 GH z , our proposed accelerator performs at 9 × 5 G = 45 GO P /s . Figure 7 shows the a verage con volution speed comparison of our proposed accelerator against the state-of-the-art FPGA and GPU implementations. B. P ower In order to estimate the dynamic po wer consumption of our proposed system, we b uilt our in-house estimator by augment- ing the standard Google T ensorﬂo w tool. While primarily used for training and inference stages of neural networks, at the core, T ensorﬂow is a symbolic mathematical graph processing platform. T ensorﬂow enables users to e xpress arbitrary compu- tations into a dataﬂow graph, which is extremely useful in the context of neural networks. Ho wever , out-of-the-box T ensor - ﬂow is completely agnostic to physical realization of the neural Fig. 7: Comparison of conv olution operation speed for FPGA, GPU, and our photonic implementation. The last column labeled with (p) represents the speed of the photonic core in the absence of electronics. T ABLE II: Mapping of primitiv e math operations to their hardware realization. Math Operation Photonic Representation Addition Photodiode Multiplication MRR Connection W a veguide Non-linear Activ ation Electro-absorption Modulator networks being implemented. Thus, we augmented T ensorﬂo w high-lev el API with mathematical models of electro-optical components. Figure 8 depicts the nativ e T ensorﬂow toolkit hierarchy against our augmented version. In our estimator , Fig. 8: High le vel T ensorﬂow toolkit hierarchy vs. augmented T ensorﬂo w . each primiti ve mathematical operation is gi ven two physical models namely , the power model and the noise model. While, the noise model can impact the functionality , thus accuracy , in a neural network, the power model only models/measures consumed po wer . T able II shows some of these mathematical operations mapped to to their physical realizations. Photodiode power can be simply derived from its Respon- JOURNAL OF SELECTED TOPICS IN QU ANTUM ELECTRONICS 9 sivity equation: R = I ph P in = λ q hc η [ A W ] (19) where I ph is the photocurrent, P in is the optical input signal power , q is the electron charge, λ is the w avelength, h is the Planck’ s constant, and c is the speed of light. It should be noted that the signal is encoded in the optical input power P in . In this work use Aim photonics PDK values in [57]. Similarly , we modeled both thermal noise and the shot noise in photodiode. I sn = q 2 q ( I ph + I D )∆ f (20) I tn = r 4 K B T ∆ f R S H (21) where I D is the dark current of the photodetector , ∆ f is the noise measurement bandwidth, K B is the Boltzmann Constant, T is temperature in K elvins and R S H is total equiv alent shunt resistance of the photodiode. For MRRs we accounted for per unit length propagation loss. Fig. 9: Comparison of con volution operation power for FPGA, GPU, and our photonic implementation. The last column labeled with (p) represents the po wer consumption of the photonic core in the absence of electronics. Figure 9 depicts the power comparison results. Finally W e plotted the energy efﬁciency ﬁgure of merit deﬁned by the ratio of speed to power in Figure 10. V I I . T R A I N I N G , I N F E R E N C E , A N D N O I S E W e initially trained our neural network ofﬂine on a con ventional digital computer . Later during the inference stage we loaded the trained weights into our in-house simulator , which is equipped with noise sources modeling. Our hypothesis was that inference on a noisy neural network would result in some loss in accuracy . This is mostly due to the fact that, the network used during the training is noise-less, with 32-bit ﬂoating point resolution, while during inference the weights all in a sudden face a noisy network. In other Fig. 10: Comparison of energy efﬁcienc y for FPGA, GPU, and our photonic implementation. The last column labeled with (p) represents the po wer consumption of the photonic core in the absence of electronics. The results show that using photon- ics as an accelerator has the potential of improving energy efﬁcienc y by up to more than three orders of magnitude. Fig. 11: V isualalization of an augmented con v olutional layer using po wer and noise models for VGG16 network. words, the network performing inference experiences unseen noise behavior that results in accuracy loss. W e tested our hypothesis by sweeping a range of inference noise lev els and observing its effect on accuracy . For that reason, we identiﬁed two major noise sources, namely the neuron output noise and the weight noise. The neuron output noise represents the noise introduced at the output of each neuron by the photodiode and the nonlinear activ ation function. The ﬁrst plot in Figure 12 shows how accuracy is impacted by noise during inference for the case that the network was trained free of an y noise source. Our next hypothesis w as that, if we allow for certain amount of noise during the training, the model would become more robust to noise during the inference stage. T o that end, we trained the network with output noise source on. W e only added the output noise, and left the weight noise off, because weights are required to be calculated with maximum precision during training. In fact we observed that ev en a minute amount of noise added to the weights during the training could destroy JOURNAL OF SELECTED TOPICS IN QU ANTUM ELECTRONICS 10 the accuracy of the network to its baseline le vel of about 10%. W e swept the addition of training noise at logarithmic steps from 0 . 01% to 1% . Figure 12 depicts the ef fect of adding an output noise equiv alent to 0 . 1% and 0 . 5% of the maximum signal swing at the output of neurons. In our experiments we observed that the addition of about 0 . 1% noise during the training may result in slight 2% accuracy loss for low level of noise during inference. Howe ver , the model becomes more robust to higher le vels of inference noise. This shows that modeling noise by addition of noise during training can ﬁne- tune the network for a physical noisy realization as sho wn in Figure 12 (middle). Lastly , we noticed adding further amount of training noise beyond the initial 0 . 1% resulted very signiﬁcant inference accuracy losses shown in Figure 12 (bottom). V I I I . C O N C L U S I O N In this paper we presented a photonic CNN accelerator based on W inograd ﬁlering con volution algorithm. Winograd reduces the total number of multiplication, thus hardware, to perform con volution operation. W e e valuated the speed of our accelerator by developing an analyical frame work. Our results show that a photonic accelerator can compete with state-of-the- art W inograd based FPGA and GPU implementations. Such photonic accelerator has the potential of improving the energy efﬁcienc y by up to three orders of magnitude. Ho wev er , the ov erall speed is bound by the limitations of IO and con ver- sions in D AC and ADC. T o ev aluate power performance we augmented the native hardware-agnostic Google T ensorﬂo w tool with po wer models of our hardware components. Similar to speed performance, electronic IO and con vertors are the major consumers of po wer in our proposed design. Ho wever , the photonic core, without the electronic interface, can operate while consuming up to two orders of magnitude less po wer . In addition, we modeled noise into our T ensorﬂow-based simulator , to inv estigate the effect of hardware noise sources such as photodiode noise and MRR noise on the functionality (accuracy) of our CNN. W e found training the CNN with a small noise component, 0 . 1% of the signal swing in our experiment, can result in the CNN become more robust to inference-time noise introduced by noisy photodiodes and MRRs. R E F E R E N C E S [1] P . R. Prucnal and B. J. Shastri, Neur omorphic Photonics . CRC Press, May 2017, google-Books-ID: VbvODgAA QB AJ. [2] I. Chakraborty , G. Saha, A. Sengupta, and K. Roy , “T o ward fast neural computing using all-photonic phase change spiking neurons, ” Scientiﬁc r eports , v ol. 8, no. 1, p. 12980, 2018. [3] J. Feldmann, N. Y oungblood, C. Wright, H. Bhaskaran, and W . Pernice, “ All-optical spiking neurosynaptic networks with self-learning capabil- ities, ” Natur e , v ol. 569, no. 7755, p. 208, 2019. [4] J. K. Geor ge, A. Mehrabian, R. Amin, J. Meng, T . F . d. Lima, A. N. T ait, B. J. Shastri, T . El-Ghazawi, P . R. Prucnal, and V . J. Sorger , “Neuromorphic photonics with electro-absorption modulators, ” Optics Expr ess , vol. 27, no. 4, pp. 5181–5191, Feb. 2019. [Online]. A v ailable: https://www .osapublishing.org/oe/abstract.cfm?uri=oe- 27- 4- 5181 [5] C. W ang, M. Zhang, X. Chen, M. Bertrand, A. Shams-Ansari, S. Chandrasekhar , P . Winzer , and M. Lonar , “Integrated lithium niobate electro-optic modulators operating at CMOS-compatible voltages, ” Natur e , vol. 562, no. 7725, p. 101, Oct. 2018. [Online]. A vailable: https://www .nature.com/articles/s41586- 018- 0551- y Fig. 12: The e valuation of the effect of physical photodiode and MRR noise on inference accuracy . This ef fect can be par - tially compensated through introduction of an artiﬁcial noise source during the training stage. At the absence of training noise source (top) inference accuracy is quickly deteriorated as we sweep the photodiode and MRR noise. By introducing an equiv alent of 0.1% guassian noise, the network becomes more robust to inference noise. Further increase in training noise level (bottom) hinders the network from proper training. JOURNAL OF SELECTED TOPICS IN QU ANTUM ELECTRONICS 11 [6] A. Liu, R. Jones, L. Liao, D. Samara-Rubio, D. Rubin, O. Cohen, R. Nicolaescu, and M. Paniccia, “A high-speed silicon optical modulator based on a metaloxidesemiconductor capacitor, ” Natur e , vol. 427, no. 6975, p. 615, Feb . 2004. [Online]. A vailable: https://www .nature.com/articles/nature02310 [7] R. Amin, R. Maiti, C. Carfano, Z. Ma, M. H. T ahersima, Y . Lilach, D. Ratnayake, H. Dalir , and V . J. Sor ger, “0.52 V mm IT O-based Mach-Zehnder modulator in silicon photonics, ” APL Photonics , vol. 3, no. 12, p. 126104, Dec. 2018. [Online]. A v ailable: https://aip.scitation.org/doi/10.1063/1.5052635 [8] H. Bagherian, S. Skirlo, Y . Shen, H. Meng, V . Ceperic, and M. Sol- jacic, “On-chip optical con volutional neural networks, ” arXiv preprint arXiv:1808.03303 , 2018. [9] A. Mehrabian, Y . Al-Kabani, V . J. Sorger , and T . El-Ghazawi, “Pcnna: A photonic conv olutional neural network accelerator , ” in 2018 31st IEEE International System-on-Chip Confer ence (SOCC) . IEEE, 2018, pp. 169–173. [10] W . Liu, W . Liu, Y . Y e, Q. Lou, Y . Xie, and L. Jiang, “Holylight: A nanophotonic accelerator for deep learning in data centers, ” in 2019 Design, Automation & T est in Europe Conference & Exhibition (DA TE) . IEEE, 2019, pp. 1483–1488. [11] “Optalysys. ” [Online]. A vailable: https://www .optalysys.com/ [12] Y . Shen, N. C. Harris, S. Skirlo, M. Prabhu, T . Baehr-Jones, M. Hochber g, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljai, “Deep learning with coherent nanophotonic circuits, ” Natur e Photonics , vol. 11, no. 7, pp. 441–446, Jul. 2017. [Online]. A vailable: http://www .nature.com/articles/nphoton.2017.93 [13] T . W . Hughes, M. Minkov , Y . Shi, and S. Fan, “Training of photonic neural networks through in situ backpropagation and gradient measurement, ” Optica , vol. 5, no. 7, pp. 864–871, Jul. 2018. [Online]. A v ailable: https://www .osapublishing.org/optica/abstract.cfm? uri=optica- 5- 7- 864 [14] M. Miscuglio, A. Mehrabian, Z. Hu, S. I. Azzam, J. George, A. V . Kildishev , M. Pelton, and V . J. Sorger , “All-optical nonlinear acti vation function for photonic neural networks [Invited], ” Optical Materials Expr ess , vol. 8, no. 12, p. 3851, Dec. 2018. [Online]. A vailable: https://www .osapublishing.org/abstract.cfm?URI=ome- 8- 12- 3851 [15] A. Lavin and S. Gray , “F ast algorithms for conv olutional neural net- works, ” in Proceedings of the IEEE Conference on Computer V ision and P attern Recognition , 2016, pp. 4013–4021. [16] L. Lu, Y . Liang, Q. Xiao, and S. Y an, “Ev aluating fast algorithms for con volutional neural networks on fpgas, ” in 2017 IEEE 25th Annual International Symposium on Field-Pr ogrammable Custom Computing Machines (FCCM) . IEEE, 2017, pp. 101–108. [17] Y . Shen, N. C. Harris, S. Skirlo, M. Prabhu, T . Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund et al. , “Deep learning with coherent nanophotonic circuits, ” Nature Photonics , v ol. 11, no. 7, p. 441, 2017. [18] A. N. T ait, M. A. Nahmias, B. J. Shastri, and P . R. Prucnal, “Broadcast and weight: an integrated network for scalable photonic spike process- ing, ” J ournal of Lightwave T echnolo gy , vol. 32, no. 21, pp. 3427–3439, 2014. [19] A. N. T ait, A. X. W u, T . F . de Lima, E. Zhou, B. J. Shastri, M. A. Nahmias, and P . R. Prucnal, “Microring weight banks, ” IEEE J ournal of Selected T opics in Quantum Electr onics , vol. 22, no. 6, pp. 312–325, 2016. [20] H. Hu, F . Da Ros, M. Pu, F . Y e, K. Ingersle v , E. P . da Silv a, M. Nooruzzaman, Y . Amma, Y . Sasaki, T . Mizuno et al. , “Single-source chip-based frequency comb enabling extreme parallel data transmission, ” Natur e Photonics , vol. 12, no. 8, p. 469, 2018. [21] Q. Xu, D. F attal, and R. G. Beausoleil, “Silicon microring resonators with 1.5- µ m radius, ” Optics expr ess , vol. 16, no. 6, pp. 4309–4315, 2008. [22] L. Chen, K. Preston, S. Manipatruni, and M. Lipson, “Integrated ghz silicon photonic interconnect with micrometer-scale modulators and detectors, ” Optics expr ess , vol. 17, no. 17, pp. 15 248–15 256, 2009. [23] S. Stathopoulos, A. Khiat, M. T rapatseli, S. Cortese, A. Serb, I. V alov , and T . Prodromakis, “Multibit memory operation of metal-oxide bi-layer memristors, ” Scientiﬁc reports , vol. 7, no. 1, p. 17532, 2017. [24] B. Xu, Y . Zhou, and Y . Chiu, “ A 23-mw 24-gs/s 6-bit voltage-time hybrid time-interleaved adc in 28-nm cmos, ” IEEE Journal of Solid- State Circuits , vol. 52, no. 4, pp. 1091–1100, 2017. [25] M. Ziebell, D. Marris-Morini, G. Rasigade, P . Crozat, J.-M. Fdli, P . Grosse, E. Cassan, and L. V ivien, “T en Gbit/s ring resonator silicon modulator based on interdigitated PN junctions, ” Optics Express , vol. 19, no. 15, pp. 14 690–14 695, Jul. 2011. [Online]. A vailable: https://www .osapublishing.org/oe/abstract.cfm?uri=oe- 19- 15- 14690 [26] F . Y . Gardes, A. Brimont, P . Sanchis, G. Rasigade, D. Marris-Morini, L. O’F aolain, F . Dong, J. M. Fedeli, P . Dumon, L. V ivien, T . F . Krauss, G. T . Reed, and J. Mart, “High-speed modulation of a compact silicon ring resonator based on a reverse-biased pn diode, ” Optics Expr ess , vol. 17, no. 24, pp. 21 986–21 991, Nov . 2009. [Online]. A vailable: https://www .osapublishing.org/oe/abstract.cfm?uri=oe- 17- 24- 21986 [27] “OSA | T en Gbit/s ring resonator silicon modulator based on interdigitated PN junctions. ” [Online]. A v ailable: https://www .osapublishing.org/oe/abstract.cfm?uri=oe- 19- 15- 14690 [28] T . Baba, S. Akiyama, M. Imai, N. Hirayama, H. T akahashi, Y . Noguchi, T . Horikawa, and T . Usuki, “50-Gb/s ring-resonator-based silicon modulator, ” Optics Expr ess , vol. 21, no. 10, pp. 11 869–11 876, May 2013. [Online]. A v ailable: https://www .osapublishing.org/oe/abstract. cfm?uri=oe- 21- 10- 11869 [29] P . Dong, S. Liao, D. Feng, H. Liang, D. Zheng, R. Shaﬁiha, C.-C. Kung, W . Qian, G. Li, X. Zheng, A. V . Krishnamoorthy , and M. Asghari, “Low V pp, ultralow-ener gy , compact, high-speed silicon electro-optic modulator, ” Optics Express , vol. 17, no. 25, p. 22484, Dec. 2009. [Online]. A vailable: https://www .osapublishing.org/oe/abstract.cfm?uri= oe- 17- 25- 22484 [30] H. Jayatilleka, K. Murray , M. Caverle y , N. A. F . Jaeger , L. Chrostowski, and S. Shekhar, “Crosstalk in SOI Microring Resonator-Based Filters, ” Journal of Lightwave T ec hnology , vol. 34, no. 12, pp. 2886–2896, Jun. 2016. [Online]. A v ailable: http://ieeexplore.ieee.org/document/7272050/ [31] M. Bahadori, S. Rumley , H. Jayatilleka, K. Murray , N. A. F . Jae ger, L. Chrostowski, S. Shekhar, and K. Bergman, “Crosstalk Penalty in Microring-Based Silicon Photonic Interconnect Systems, ” Journal of Lightwave T echnolo gy , vol. 34, no. 17, pp. 4043–4052, Sep. 2016. [Online]. A vailable: http://ieeexplore.ieee.or g/document/7506337/ [32] “imec-ePIXfab SiPhotonics Passiv es. ” [Online]. A vailable: http://www . europractice- ic.com/SiPhotonics technology IHP passiv es.php [33] L. V ivien, A. Polzer , D. Marris-Morini, J. Osmond, J. M. Hartmann, P . Crozat, E. Cassan, C. Kopp, H. Zimmermann, and J. M. Fdli, “Zero- bias 40gbit/s germanium wav eguide photodetector on silicon, ” Optics Expr ess , v ol. 20, no. 2, pp. 1096–1101, Jan. 2012. [Online]. A vailable: https://www .osapublishing.org/oe/abstract.cfm?uri=oe- 20- 2- 1096 [34] Y . Salamin, P . Ma, B. Baeuerle, A. Emboras, Y . Fedoryshyn, W . Heni, B. Cheng, A. Josten, and J. Leuthold, “100 GHz Plasmonic Photodetector, ” A CS Photonics , vol. 5, no. 8, pp. 3291–3297, Aug. 2018. [Online]. A vailable: http://pubs.acs.org/doi/10.1021/acsphotonics. 8b00525 [35] P . Ma, Y . Salamin, B. Baeuerle, A. Emboras, Y . Fedoryshyn, W . Heni, B. Cheng, A. Josten, and J. Leuthold, “100 GHz Photoconductiv e Plasmonic Germanium Detector, ” in Conference on Lasers and Electro-Optics (2018), paper SM2I.3 . Optical Society of America, May 2018, p. SM2I.3. [Online]. A vailable: https: //www .osapublishing.org/abstract.cfm?uri=CLEO SI- 2018- SM2I.3 [36] A. N. T ait, T . F . de Lima, E. Zhou, A. X. W u, M. A. Nahmias, B. J. Shastri, and P . R. Prucnal, “Neuromorphic photonic networks using silicon photonic weight banks, ” Scientiﬁc Reports , vol. 7, no. 1, p. 7430, Aug. 2017. [Online]. A v ailable: https: //doi.org/10.1038/s41598- 017- 07754- z [37] W . Bogaerts, P . De Heyn, T . V an V aerenbergh, K. De V os, S. Ku- mar Selvaraja, T . Claes, P . Dumon, P . Bienstman, D. V an Thourhout, and R. Baets, “Silicon microring resonators, ” Laser & Photonics Reviews , vol. 6, no. 1, pp. 47–73, 2012. [38] X. Xue, Y . Xuan, C. W ang, P .-H. W ang, Y . Liu, B. Niu, D. E. Leaird, M. Qi, and A. M. W einer, “Thermal tuning of kerr frequency combs in silicon nitride microring resonators, ” Optics express , v ol. 24, no. 1, pp. 687–698, 2016. [39] C. Y oshida, K. Tsunoda, H. Noshiro, and Y . Sugiyama, “High speed resistiv e switching in pt/ ti o 2/ ti n ﬁlm for non volatile memory application, ” Applied Physics Letters , vol. 91, no. 22, p. 223510, 2007. [40] J. Borghetti, G. S. Snider , P . J. Kuek es, J. J. Y ang, D. R. Stewart, and R. S. W illiams, “memristiv eswitches enable statefullogic operations via material implication, ” Nature , vol. 464, no. 7290, p. 873, 2010. [41] I. Baek, M. Lee, S. Seo, M. Lee, D. Seo, D.-S. Suh, J. Park, S. Park, H. Kim, I. Y oo et al. , “Highly scalable non volatile resistiv e memory using simple binary oxide driv en by asymmetric unipolar voltage pulses, ” in IEDM T ec hnical Digest. IEEE International Electron Devices Meeting, 2004. IEEE, 2004, pp. 587–590. [42] E. J. Merced-Grafals, N. D ´ avila, N. Ge, R. S. Williams, and J. P . Strachan, “Repeatable, accurate, and high speed multi-lev el program- ming of memristor 1t1r arrays for power efﬁcient analog computing applications, ” Nanotechnology , vol. 27, no. 36, p. 365202, 2016. [43] A. Prakash, D. Deleruyelle, J. Song, M. Bocquet, and H. Hwang, “Resistance controllability and v ariability improvement in a taox-based JOURNAL OF SELECTED TOPICS IN QU ANTUM ELECTRONICS 12 resistiv e memory for multilevel storage application, ” Applied Physics Letters , vol. 106, no. 23, p. 233104, 2015. [44] S. R. Lee, Y .-B. Kim, M. Chang, K. M. Kim, C. B. Lee, J. H. Hur, G.-S. Park, D. Lee, M.-J. Lee, C. J. Kim et al. , “Multi-level switching of triple-layered taox rram with excellent reliability for storage class memory , ” in 2012 Symposium on VLSI T echnology (VLSIT) . IEEE, 2012, pp. 71–72. [45] K. Simonyan and A. Zisserman, “V ery deep con volutional netw orks for large-scale image recognition, ” arXiv preprint , 2014. [46] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in Pr oceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 770–778. [47] C. Szegedy , V . V anhoucke, S. Iof fe, J. Shlens, and Z. W ojna, “Rethinking the inception architecture for computer vision, ” in Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , 2016, pp. 2818–2826. [48] M. Mathieu, M. Henaf f, and Y . LeCun, “Fast training of con volutional networks through ffts, ” arXiv preprint , 2013. [49] N. V asilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y . LeCun, “Fast conv olutional nets with fbfft: A gpu performance ev aluation, ” arXiv preprint , 2014. [50] S. Chetlur , C. W oolle y , P . V andermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer , “cudnn: Efﬁcient primitives for deep learning, ” arXiv pr eprint arXiv:1410.0759 , 2014. [51] C. Zhang, P . Li, G. Sun, Y . Guan, B. Xiao, and J. Cong, “Optimizing fpga-based accelerator design for deep con volutional neural networks, ” in Pr oceedings of the 2015 A CM/SIGDA International Symposium on F ield-Pro grammable Gate Arrays . ACM, 2015, pp. 161–170. [52] J. Qiu, J. W ang, S. Y ao, K. Guo, B. Li, E. Zhou, J. Y u, T . T ang, N. Xu, S. Song et al. , “Going deeper with embedded fpga platform for con volutional neural network, ” in Pr oceedings of the 2016 ACM/SIGD A International Symposium on F ield-Pr ogrammable Gate Arr ays . A CM, 2016, pp. 26–35. [53] N. Suda, V . Chandra, G. Dasika, A. Mohanty , Y . Ma, S. Vrudhula, J.-s. Seo, and Y . Cao, “Throughput-optimized opencl-based fpga accelerator for large-scale con volutional neural networks, ” in Pr oceedings of the 2016 ACM/SIGD A International Symposium on Field-Pr ogrammable Gate Arrays . A CM, 2016, pp. 16–25. [54] C. Zhang, G. Sun, Z. Fang, P . Zhou, P . Pan, and J. Cong, “Caf feine: T o wards uniformed representation and acceleration for deep con volu- tional neural networks, ” IEEE T r ansactions on Computer-Aided Design of Integr ated Circuits and Systems , 2018. [55] A. Nazemi, K. Hu, B. Catli, D. Cui, U. Singh, T . He, Z. Huang, B. Zhang, A. Momtaz, and J. Cao, “3.4 a 36gb/s pam4 transmitter using an 8b 18gs/s dac in 28nm cmos, ” in 2015 IEEE International Solid-State Cir cuits Conference-(ISSCC) Dig est of T ec hnical P apers . IEEE, 2015, pp. 1–3. [56] D. U. Lee, K. W . Kim, K. W . Kim, K. S. Lee, S. J. Byeon, J. H. Kim, J. H. Cho, J. Lee, and J. H. Chun, “ A 1.2 v 8 gb 8-channel 128 gb/s high- bandwidth memory (hbm) stack ed dram with ef fective i/o test circuits, ” IEEE Journal of Solid-State Cir cuits , v ol. 50, no. 1, pp. 191–203, 2015. [57] “Silicon photonics process design kit (apsuny pdkv3.0), ” http://http://www .aimphotonics.com/pdk. [Online]. A v ailable: http: //www .aimphotonics.com/pdk Armin Mehrabian is a PhD candidate in Electrical Engineering at the George W ashington University . His research interests include High Performance Computing (HPC), Neuromorphic Computing, Arti- ﬁcial Intelligence (AI) from both software and hard- ware point of vie w . He receiv ed his BS. degree in Electrical Engineering at Shahid Beheshti University of T ehran, Iran focusing on Analog Electronics, and his MS. degree at the George W ashington Univ ersity (GWU), DC, USA in computer engineering focusing on VLSI and digital electronics design. His current research inv olves leveraging nanophotonics for HPC architecture designs. Mario Miscuglio Mario Miscuglio is a post-doctoral researcher in the Electrical Engineering department at the George W ashington Univ ersity . He received his Masters in Electric and Computer engineering from Polytechnic of T urin, w orking as researcher at Harvard/MIT . He completed his PhD in Optoelec- tronics from Uni versity of Genova (IIT), working as research fello w at the Molecular F oundry in LBNL. His interests e xtend across science and engineering, including photonic neuromorphic computing, nano- optics and plasmonics. Y ousra Alkabani received the BSc and MSc de- grees in computer and systems engineering from Ain Shams University , Cairo, Egypt, in 2003 and 2006, respectiv ely . She receiv ed the PhD degree in computer science from Rice University , Houston, TX, USA, in December 2010. She has been an assistant professor of computer and systems engi- neering at Ain Shams University since May 2011 and a visiting assistant professor of computer science and engineering at the American University in Cairo since 2013. Her research interests include hardware security , low po wer design, and embedded systems. She is a member of the IEEE. V olker J. Sorger is an Associate Professor in the Department of Electrical and Computer Engi- neering and the leader of the Orthogonal Physics Enabled Nanophotonics (OPEN) lab at the George W ashington Uni versity . He recei ved his PhD from the University of California Berkeley . His research areas include opto-electronic devices, plasmonics and nanophotonics and photonic analog information processing and neuromorphic computing. Amongst his breakthroughs are the ﬁrst demonstration of a semiconductor plasmon laser, attojoule-efﬁcient modulators, and PMAC/s-f ast photonic neural networks and near real-time analog signal processors. Dr . Sor ger has received multiple awards among are the Presidential Early Career A ward for Scientists and Engineers (PECASE), the AFOSR Y oung Investig ator A ward (YIP), the Hegarty Innov ation Prize, and the National Academy of Sciences aw ard of the year . Dr. Sorger is the editor-in-chief of Nanophotonics, the OSA Division Chair for Photonics and Opto-electronics and serves at the board-of-meetings at OSA & SPIE, and the scholarship committee. He is a senior member of IEEE, OSA & SPIE. T arek El-Ghazawi is a Professor in the Depart- ment of Electrical and Computer Engineering at The George W ashington University , where he leads the univ ersity-wide Strategic Program in High- Per - formance Computing. He is the founding director of The GW Institute for Massively Parallel Ap- plications and Computing T echnologies (IMP A CT). His research interests include high-performance computing, parallel computer architectures, high- performance I/O, reconﬁgurable computing, exper- imental performance ev aluations, computer vision, and remote sensing. He has published over 200 refereed research papers and book chapters in these areas and his research has been supported by DoD/D ARP A, NASA, NSF , and also industry , including IBM and SGI. He is the ﬁrst author of the book UPC: Distributed Shared Memory Programming, which has the ﬁrst formal speciﬁcation of the UPC language used in high- performance computing. Dr . El-Ghazawi is a member of the A CM and the Phi Kappa Phi National Honor Society; he was also a U.S. Fulbright Scholar , a recipient of the Alexander Schwarzkopf Prize for T echnological Inno vations and a recipient of the Alexander von Humboldt research aw ard from the Humboldt Foundation in Germany . He is a fellow of the IEEE.

A Winograd-based Integrated Photonics Accelerator for Convolutional Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment