Spatio-Temporal Backpropagation for Training High-performance Spiking Neural Networks

1 Spatio-T emporal Backpropagation for T raining High-performance Spiking Neural Networks Y ujie W u † , Lei Deng † , Guoqi Li, Jun Zhu and Luping Shi Abstract —Compared with artiﬁcial neural networks (ANNs), spiking neural networks (SNNs) ar e promising to explor e the brain-like behaviors since the spikes could encode more spatio- temporal inf ormation. Although exi sting schemes including pre- training fr om ANN or direct training based on backpropagation (BP) make the supervised training of SNNs possible, these methods only exploit the networks’ spatial domain information which leads to the performance bottleneck and requires many complicated training techniques. Another fundamental issue is that the spike activity is naturally non-differ entiable which causes great difﬁculties in training SNNs. T o this end, we build an iterative LIF model that is friendlier for gradient descent training. By simultaneously considering the layer -by-layer spatial domain (SD) and the timing-dependent temporal domain (TD) in the training phase, as well as an approximated derivative for the spike activity , we propose a spatio-temporal backpropagation (STBP) training framework without using any complicated skill. W e design the corresponding fully connected and con volution architectur e and evaluate our framework on the static MNIST and a custom object detection dataset, as well as the dynamic N-MNIST . Results show that our approach achiev es the best accuracy compared with existing state-of-the-art algorithms on spiking networks. This work pro vides a new perspective to ex- plore the high-performance SNNs f or future brain-like computing paradigm with rich spatio-temporal dynamics. I . I N T RO D U C T I O N Deep neural networks (DNNs) have achiev ed outstanding performance in div erse areas [1]–[5], while it seems that the brain uses another network architecture, spiking neural networks, to realize various complicated cogniti ve functions [6]–[8]. Compared with the e xisting DNNs, SNNs mainly ha ve two superiorities: 1) the spike pattern ﬂo wing through SNNs fundamentally codes more spatio-temporal information, while most DNNs lack timing dynamics, especially the widely used feedforward DNNs; and 2) e vent-driv en paradigm of SNNs can make it more hardware friendly , and be adopted by many neuromorphic platforms [9]–[14]. Howe ver , it remains challenging in training SNNs because of the quite complicated dynamics and non-dif ferentiable nature of the spike acti vity . In summary , there exist three kinds of training methods for SNNs: 1) unsupervised learning; 2) indirect supervised learning; 3) direct supervised learning. The ﬁrst one origins from the biological synaptic plasticity for † The authors contribute equally . Y ujie W u, Lei Deng, Guoqi Li and Luping Shi are with Center for Brain-Inspired Computing Research (CBICR),Department of Precision Instrument,Tsinghua Univ ersity , 100084 Beijing, China. (email:lpshi@tsinghua.edu.cn;) Jun Zhu is with state Key Lab of Intelligence T echnology and Sys- tem,Tsinghua National Lab for Information Science and T echnology ,Tsinghua Univ ersity , 100084 Beijing, China. (email:dcszj@mail.tsinghua.edu.cn) weight modiﬁcation, such as spike timing dependent plasticity (STDP) [15]–[17]. Because it only considers the local neuronal activities, it is difﬁcult to achie ve high performance. The second one ﬁrstly trains an ANN, and then transforms it into its SNN version with the same netw ork structure where the spiking rate of SNN neurons acts as the analog acti vity of ANN neurons [18]–[21]. This is not a bio-plausible way to explore the learning nature of SNNs. The most promising method to obtain high-performance training is the recent direct supervised learning based on the gradient descent theory with error backpropagation. Ho wev er , such a method only considers the layer-by-layer spatial domain and ignores the dynamics in temporal domain [22], [23]. Therefore man y complicated training skills are required to improve performance [19], [23], [24], such as ﬁxed-amount-proportional reset, lateral inhibi- tion, error normalization, weight/threshold regularization, etc. Thus, a more general dynamic model and learning framework on SNNs are highly required. In this paper , we propose a direct supervised learning framew ork for SNNs which combines both the SD and TD in the training phase. Firstly , we b uild an iterati ve LIF model with SNNs dynamics but it is friendly for gradient descent training. Then we consider both the spatial direction and temporal direction during the error backpropagation procedure, i.e, spatio-temporal backpropagation (STBP), which signiﬁcantly improv es the network accuracy . Furthermore, we introduce an approximated deriv ativ e to address the non-differentiable issue of the spike activity . W e test our SNNs frame work by using the fully connected and conv olution architecture on the static MNIST and a custom object detection dataset, as well as the dynamic N-MNIST . Man y complicated training skills which are generally required by existing schemes, can be avoided due to the fact that our proposed method can make full use of STD information that captures the nature of SNNs. Experimental results sho w that our proposed method could achieve the best accuracy on either static or dynamic dataset, compared with existing state-of-the-art algorithms. The inﬂuence of TD dynamics and different methods for the deri v ati ve approximation are systematically analyzed. This work shall open a way to explore the high-performance SNNs for future brain-like computing paradigms with rich STD dynamics. I I . M E T H O D A N D M AT E R I A L A. Iterative Leaky Inte grate-and-F ire Model in Spiking Neural Networks Compared with existing deep neural networks, spiking neural networks fundamentally code more spatio-temporal 2 Fig. 1: Illustration of the spatio-temporal characteristic of SNNs . Besides the layer-by-layer spatial dataﬂow like ANNs, SNNs are f amous for the rich temporal dynamics and non-volatile potential integration. Ho wev er , the existing training algorithms only consider either the spatial domain such as the supervised ones via backpropagation, or the temporal domain such as the unsupervised ones via timing-based plasticity , which causes the performance bottleneck. Therefore, how to build an learning framew ork making full use of the spatio-temporal domain (STD) is fundamentally required for high-performance SNNs that forms the main motiv ation of this work. information due to two facts that i) SNNs can also have deep architectures like DNNs, and ii) each neuron has its own neuronal dynamic properties. The former one grants SNNs rich spatial domain information while the later one offers SNNs the power of encoding temporal domain information. Ho wever , currently there is no uniﬁed frame work that allows the effec- tiv e training of SNNs just as implementing backpropagation (BP) in DNNs by considering the spatio-temporal dynamics. This has challenged the extensi ve use of SNNs in v arious ap- plications. In this work, we will present a framework based on iterativ e leaky integrate-and-ﬁre (LIF) model that enables us to apply spatio-temporal backpropagation for training spiking neural networks. It is known that LIF is the most widely applied model to describe the neuronal dynamics in SNNs, and it can be simply gov erned by τ du ( t ) dt = − u ( t ) + I ( t ) (1) where u ( t ) is the neuronal membrane potential at time t , τ is a time constant and I ( t ) denotes the pre-synaptic input which is determined by the pre-neuronal acti vities or external injections and the synaptic weights. When the membrane potential u exceeds a given threshold V th , the neuron ﬁres a spike and resets its potential to u reset . As shown in Figure 1, the forward dataﬂow of the SNN propagates in the layer- by-layer SD like DNNs, and the self-feedback injection at each neuron node generates non-volatile integration in the TD. In this way , the whole SNN runs with complex STD dynamics and codes spatio-temporal information into the spike pattern. The existing training algorithms only consider either the SD such as the supervised ones via backpropagation, or the TD such as the unsupervised ones via timing-based plasticity , which causes the performance bottleneck. Therefore, how to build an learning framework making full use of the STD is fundamentally required for high-performance SNNs that forms the main moti vation of this work. Howe ver , obtaining the analytic solution of LIF model in (1) directly makes it incon venient/obscure to train SNNs based on backpropagation. This is because the whole network shall present complex dynamics in both SD and TD. T o address this issue, the follo wing event-dri ven iterative updating rule u ( t ) = u ( t i − 1 ) e t i − 1 − t τ + I ( t ) (2) can be well used to approximate the neuronal potential u ( t ) in (1) based on the last spiking moment t i − 1 and the pre-synaptic input I ( t ) . The membrane potential exponentially decays until the neuron receiv es pre-synaptic inputs, and a ne w update round will start once the neuron ﬁres a spik e. That is to say , the neuronal states are co-determined by the spatial accumulations of I ( t ) and the leaky temporal memory of u ( t i − 1 ) . As we kno w , the ef ﬁciency of error backpropagation for training DNNs greatly beneﬁts from the iterati ve representa- tion of gradient descent which yields the chain rule for layer- by-layer error propagation in the SD backward pass. This motiv ates us to propose a iterativ e LIF based SNN in which the iterations occur in both the SD and TD as follo ws: x t +1 ,n i = l ( n − 1) X j =1 w n ij o t +1 ,n − 1 j (3) u t +1 ,n i = u t,n i f ( o t,n i ) + x t +1 ,n i + b n i (4) o t +1 ,n i = g ( u t +1 ,n i ) (5) where f ( x ) = τ e − x τ (6) g ( x ) = ( 1 , x ≥ V th 0 , x < V th (7) 3 In abov e formulas, the upper index t denotes the moment at time t , and n and l ( n ) denote the nth layer and the number of neurons in the nth layer, respectiv ely . w ij is the synaptic weight from the j th neuron in pre-synaptic layer to the ith neuron in the post-synaptic layer, and o j ∈ { 0 , 1 } is the neuronal output of the j th neuron where o j = 1 denotes a spike activity and o j = 0 denotes nothing occurs. x i is a simpliﬁed representation of the pre-synaptic inputs of the ith neuron, similar to the I in the original LIF model. u i is the neuronal membrane potential of the ith neuron and b i is a bias parameter related the threshold V th . Actually , formulas (4)-(5) are also inspired from the LSTM model [25]–[27] by using a forget gate f ( . ) to control the TD memory and an output gate g ( . ) to ﬁre a spik e. The for get gate f ( . ) controls the leaky extent of the potential memory in the TD, the output gate g ( . ) generates a spike activity when it is activ ated. Speciﬁcally , for a small positiv e time constant τ , f ( . ) can be approximated as f ( o t,n i ) ≈ ( τ , o t,n i = 0 0 , o t,n i = 1 (8) since τ e − 1 τ ≈ 0 . In this way , the original LIF model could be transformed to an iterativ e version where the recursiv e relationship in both the SD and TD is clearly describe, which is friendly for the following gradient descent training in the STD. B. Spatio-T emporal Backpr opagation T raining In order to present STBP training methodology , we deﬁne the following loss function L in which the mean square error for all samples under a giv en time windows T is to be minimized L = 1 2 S S X s =1 k y s − 1 T T X t =1 o t,N s k 2 2 (9) where y s and o s denote the label vector of the s th training sample and the neuronal output vector of the last layer N , respectiv ely . By combining equations (3)-(9) together it can be seen that L is a function of W and b . Thus, to obtain the deriv ativ e of L with respect to W and b is required for the STBP algorithm based on gradient descent. Assume that we hav e obtained deriv ativ e of ∂ L ∂ o i and ∂ L ∂ u i at each layer n at time t , which is an essential step to obtain the ﬁnal ∂ L ∂ W and ∂ L ∂ b . Figure2 describes the error propagation (dependent on the deriv ation) in both the SD and TD at the single-neuron lev el (ﬁgure2.a) and the network level (ﬁgure2.b). At the single-neuron lev el, the propagation is decomposed into a vertical path of SD and a horizontal path of TD. The dataﬂow of error propagation in the SD is similar to the typical BP for DNNs, i.e. each neuron accumulates the weighted error signals from the upper layer and iterati vely updates the parameters in dif ferent layers; while the dataﬂow in the TD shares the same neuronal states, which makes it quite complicated to directly obtain the analytical solution. T o solve this problem, we use the proposed iterativ e LIF model to unfold the state space in both the SD and TD direction, thus the states in the TD at dif ferent time steps can be distinguished that enables the chain rule for iterativ e propagation. Similar idea can be found in the BPTT algorithm for training RNNs in [28]. Now , we discuss how to obtain the complete gradient descent based on the follo wing four cases. Firstly , we denote that δ t,n i = ∂ L ∂ o t,n i (10) Case 1: t = T at the output layer n = N . In this case, the deri v ati ve ∂ L ∂ o T ,N i can be directly obtained since it depends on the loss function in Eq.(9) of the output layer . W e could hav e ∂ L ∂ o T ,N i = − 1 T S ( y i − 1 T T X k =1 o k,N i ) . (11) The deriv ation with respect to u T ,N i is generated based on o T ,N i ∂ L ∂ u T ,N i = ∂ L ∂ o T ,N i ∂ o T ,N i ∂ u T ,N i = δ T ,N i ∂ o T ,N i ∂ u T ,N i . (12) Case 2: t = T at the layers n < N . In this case, the deriv ati ve ∂ L ∂ o T ,n i iterativ ely depends on the error propagation in the SD at time T as the typical BP algorithm. W e hav e ∂ L ∂ o T ,n i = l ( n +1) X j =1 δ T ,n +1 j ∂ o T ,n +1 j ∂ o T ,n i = l ( n +1) X j =1 δ T ,n +1 j ∂ g ∂ u T ,n i w j i . (13) Similarly , the deriv ative ∂ L ∂ u T ,n i yields ∂ L ∂ u T ,n i = ∂ L ∂ u T +1 ,n i ∂ u T +1 ,n i ∂ u T ,n i = ∂ L ∂ u T +1 ,n i f ( o T +1 ,n i ) . (14) Case 3: t < T at the output layer n = N . In this case, the deriv ativ e ∂ L ∂ o t,N i depends on the error prop- agation in the TD direction. With the help of the proposed iterativ e LIF model in Eq.(3)-(5) by unfolding the state space in the TD, we acquire the required deri vati ve based on the chain rule in the TD as follows ∂ L ∂ o t,N i = δ t +1 ,N i ∂ o t +1 ,N i ∂ o t,N i + ∂ L ∂ o T ,N i (15) = δ t +1 ,N i ∂ g ∂ u t +1 ,N i u t,N i ∂ f ∂ o t,N j + ∂ L ∂ o T ,N i , (16) ∂ L ∂ u t,N i = ∂ L ∂ o t,N i ∂ o t,N i ∂ u t,N i = δ t,N i ∂ g ∂ u t,N i , (17) where ∂ L ∂ o T ,N i = − 1 T S ( y i − 1 T P T k =1 o k,N i ) as in Eq.(11). 4 Fig. 2: Error propagation in the STD. (a) At the single-neuron lev el, the vertical path and horizontal path represent the error propagation in the SD and TD, respecti vely . (b) Similar propagation occurs at the network level, where the error in the SD requires the multiply-accumulate operation like the feedforward computation. Case 4: t < T at the layers n < N . In this case, the deriv ativ e ∂ L ∂ o t,n i depends on the error propaga- tion in both SD and TD. On one side, each neuron accumulates the weighted error signals from the upper layer in the SD like Case 2; on the other side, each neuron also receives the propagated error from self-feedback dynamics in the TD by iterativ ely unfolding the state space based on the chain rule like Case 3. So we hav e ∂ L ∂ o t,n i = l ( n +1) X j =1 δ t,n +1 j ∂ o t,n +1 j ∂ o t,n i + ∂ L ∂ o t +1 ,n i ∂ o t +1 ,n i ∂ o t,n i (18) = l ( n +1) X j =1 δ t,n +1 j ∂ g ∂ u t,n i w j i + δ t +1 ,n i ∂ g ∂ u t,n i u t,n i ∂ f ∂ o t,n i , (19) ∂ L ∂ u t,n i = ∂ L ∂ o t,n i ∂ o t,n i ∂ u t,n i + ∂ L ∂ u t +1 ,n i ∂ u t +1 ,n i ∂ u t,n i (20) = δ t,n i ∂ g ∂ u t,n i + ∂ L ∂ u t +1 ,n i f ( o t +1 ,n i ) . (21) Based on the four cases, the error propagation procedure (depending on the above deri vati ves) is shown in Figure2. At the single-neuron lev el (Figure2.a), the propagation is decomposed into the vertical path of SD and the horizontal path of TD. At the network le vel (Figure2.b), the dataﬂow of error propagation in the SD is similar to the typical BP for DNNs, i.e. each neuron accumulates the weighted error signals from the upper layer and iterativ ely updates the parameters in different layers; and in the TD the neuronal states are unfolded iterativ ely in the timing direction that enables the chain-rule propagation. Finally , we obtain the deriv atives with respect to W and b as follows ∂ L ∂ b n = T X t =1 ∂ L ∂ u t,n ∂ u t,n b n = T X t =1 ∂ L ∂ u t,n , (22) ∂ L ∂ W n = T X t =1 ∂ L ∂ u t,n ∂ u t,n W n = T X t =1 ∂ L ∂ u t,n o t,n − 1 T , (23) where ∂ L ∂ u t,n can be obtained from in Eq.(11)-(21). Giv en the W and b according to the STBP , we can use gradient descent optimization algorithms to effecti vely train SNNs for achieving high performance. C. Derivative Appr oximation of the Non-dif ferentiable Spike Activity In the previous sections, we have presented how to obtain the gradient information based on STBP , but the issue of non- differentiable points at each spiking time is yet to be addressed. Actually , the deriv ative of output gate g ( u ) is required for the STBP training of Eq.(11)-(22). Theoretically , g ( u ) is a non- differentiable Dirac function of δ ( u ) which greatly challenges the effecti ve learning of SNNs [23]. g ( u ) has zero value ev erywhere except an inﬁnity value at zero, which causes the gradient vanishing or exploding issue that disables the error propagation. One of existing method viewed the discontinuous points of the potential at spiking times as noise and claimed it is beneﬁcial for the model rob ustness [23], [29], while it did not directly address the non-dif ferentiability of the spike activity . T o this end, we introduce four curves to approximate the deriv ativ e of spike acti vity denoted by h 1 , h 2 , h 3 and h 4 in Figure3.b: 5 h 1 ( u ) = 1 a 1 sig n ( | u − V th | < a 1 2 ) , (24) h 2 ( u ) = ( √ a 2 2 − a 2 4 | u − V th | ) sig n ( 2 √ a 2 − | u − V th | ) , (25) h 3 ( u ) = 1 a 3 e V th − u a 3 (1 + e V th − u a 3 ) 2 , (26) h 4 ( u ) = 1 √ 2 π a 4 e − ( u − V th ) 2 2 a 4 , (27) where a i ( i = 1 , 2 , 3 , 4) determines the curve shape and steep degree. In f act, h 1 , h 2 , h 3 and h 4 are the deri v ati ve of the rectangular function, polynomial function, sigmoid function and Gaussian cumulati ve distribution function, respecti vely . T o be consistent with the Dirac function δ ( u ) , we introduce the coef ﬁcient a i to ensure the integral of each function is 1. Obviously , it can be proven that all the abov e candidates satisfy that lim a i → 0 + h i ( u ) = dg du , i = 1 , 2 , 3 , 4 . (28) Thus, ∂ g ∂ u in Eq.(11)-(22) for STBP can be approximated by dg du ≈ dh i du . (29) In section III-C, we will analyze the inﬂuence on the SNNs performance with dif ferent curves and dif ferent values of a i . I I I . R E S U L T S A. P arameter Initialization The initialization of parameters, such as the weights, thresh- olds and other parameters, is crucial for stabilizing the ﬁring activities of the whole network. W e should simultaneously ensure timely response of pre-synaptic stimulus but av oid too much spikes that reduces the neuronal selectivity . As it is known that the multiply-accumulate operations of the pre- spikes and weights, and the threshold comparison are two key steps for the computation in the forward pass. This indicates the relativ e magnitude between the weights and thresholds determines the effecti veness of parameter initialization. In this paper , we ﬁx the threshold to be constant in each neuron for simpliﬁcation, and only adjust the weights to control the activity balance. Firstly , we initial all the weight parameters sampling from the standard uniform distribution W ∼ U [ − 1 , 1] (30) Then, we normalize these parameters by w n ij = w n ij q P l ( n − 1) j =1 w n ij 2 , i = 1 , .., l ( n ) (31) The set of other parameters is presented in T ableI. Furthermore, throughout all the simulations in our work, any complex skill as in [19], [23] is no longer required, such as the ﬁxed-amount-proportional reset, error normalization, weight/threshold regularization, etc. B. Dataset Experiments W e test our SNNs model and the STBP training method on various datasets, including the static MNIST and a custom object detection dataset, as well as the dynamic N-MNIST dataset. The input of the ﬁrst layer should be a spike train, which requires us to conv ert the samples from the static datasets into spike events. T o this end, the Bernoulli sampling from original pixel intensity to the spike rate is used in this paper . 1) Spatio-temporal fully connected neural network: Static Dataset. The MNIST dataset of handwritten digits [30] (ﬁg- ure4.b) and a custom dataset for object detection [14] (ﬁg- ure4.a) are chosen to test our method. MNIST is comprised of a training set with 60,000 labelled hand-written digits, and a testing set of other 10,000 labelled digits, which are generated from the postal codes of 0-9. Each digit sample is a 28 × 28 grayscale image. The object detection dataset is a tw o-category image dataset created by our lab for pedestrian detection. It includes 1509 training samples and 631 testing samples of 28 × 28 grayscale image. By detecting whether there is a pedestrian, an image sample is labelled by 0 or 1, as illustrated in Figure4.a. The upper and lower sub-ﬁgures in Figure4.c are the spike pattern of 25 input neurons con verted from the center patch of 5 × 5 pix els of a sample example on the object detection dataset and MNIST , respecti vely . Figure4.d illustrates an example for the spike pattern of output layer within 15ms before and after the STBP training ov er the stimulus of digit 9. At the beginning, neurons in the output layer randomly ﬁres, while after the training the 10th neuron coding digit 9 ﬁres most intensively that indicates correct inference is achie ved. T ableII compares our method with se veral other advanced results that use the similar MLP architecture on MNIST . Although we do not use any complex skill, the proposed STBP training method also outperforms all the reported results. W e can achiev e 98.89% testing accuracy which performs the best. T ableIII compares our model with the typical MLP on the ob- ject detection dataset. The contrast model is one of the typical artiﬁcial neural networks (ANNs), i.e. not SNNs, and in the following we use ’non-spiking network’ to distinguish them. It can be seen that our model achiev es better performance than the non-spiking MLP . Note that the overall ﬁring rate of the input spike train from the object detection dataset is higher than the one from MNIST dataset, so we increase its threshold to 2.0 in the simulation experiments. Dynamic Dataset. Compared with the static dataset, dynamic dataset, such as the N-MNIST [32], contains richer temporal features, and therefore it is more suitable to exploit SNN’ s potential ability . W e use the N-MNIST database as an example to ev aluate the capability of our STBP method on dynamic dataset. N-MNIST conv erts the mentioned static MNIST dataset into its dynamic version of spike train by using the dynamic vision sensor (D VS) [33]. For each original sample from MNIST , the work [32] controls the D VS to mov e in the direction of three sides of the isosceles triangle in turn (ﬁgure5.b) and collects the generated spike train which 6 Fig. 3: Derivativ e approximation of the non-differentiable spike activity . (a) Step activ ation function of the spike activity and its original derivati ve function which is a typical Diract function δ ( u ) with inﬁnite value at u = 0 and zero value at other points. This non-differentiable property disables the error propagation. (b)Several typical curves to approximate the derivati ve of spike activity . T ABLE I: Parameters set in our experiments Network parameter Description V alue T T ime window 30ms V th Threshold (MNIST/object detection dataset/N-MNIST) 1.5, 2.0, 0.2 τ Decay factor (MNIST/object detection dataset/N-MNIST) 0.1ms, 0.15ms, 0.2ms a 1 , a 2 , a 3 , a 4 Deriv ative approximation parameters(Figure3) 1.0 dt Simulation time step 1ms r Learning rate (SGD) 0.5 β 1 , β 2 , λ Adam parameters 0.9, 0.999, 1- 10 − 8 Fig. 4: Static dataset experiments. (a) A custom dataset for object detection. This dataset is a two-category image set built by our lab for pedestrian detection. By detecting whether there is a pedestrian, an image sample is labelled by 0 or 1. The images in the yello w boxes are labelled as 1, and the rest ones are marked as 0. (b)MNIST dataset. (c) Raster plot of the spike pattern of 49 input neurons con verted from the center patch of 5 × 5 pixels of a sample example on the object detection dataset (up) and MNIST (down). (d) Raster plot presents the comparison of output spike pattern before and after the STBP training ov er a digit 9 on MNIST dataset. is triggered by the intensity change at each pixel. Figure5.a records the saccade results on digit 0. Each sub-graph records the spike train within 10ms and each 100ms represents one saccade period. Due to the two possible change directions of each pixel intensity (brighter or darker), D VS could capture the corresponding two kinds of spike events, denoted by on-ev ent and off-e vent, respectiv ely (ﬁgure5.c). Since N-MNIST allows the relativ e shift of images during the saccade process, it produces 34 × 34 pixel range. And from the spatio-temporal representation in ﬁgure5.c, we can see 7 T ABLE II: Comparison with the state-of-the-art spiking networks with similar architecture on MNIST . Model Network structure T raining skills Accuracy Spiking RBM (STDP) [31] 784-500-40 None 93.16% Spiking RBM(pre-training*) [20] 784-500-500-10 None 97.48% Spiking MLP(pre-training*) [19] 784-1200-1200-10 W eight normalization 98.64% Spiking MLP(BP) [22] 784-200-200-10 None 97.66% Spiking MLP(STDP) [15] 784-6400 None 95.00% Spiking MLP(BP) [23] 784-800-10 Error normalization/ parameter regularization 98.71% Spiking MLP(STBP) 784-800-10 None 98.89% W e mainly compare with these methods that hav e the similar network architecture, and * means that their model is based on pre-trained ANN models. T ABLE III: Comparison with the typical MLP over object detection dataset. Model Network structure Accuracy Mean Interv al ∗ Non-spiking MLP(BP) 784-400-10 98.31% [97.62%, 98.57%] Spiking MLP(STBP) 784-400-10 98.34% [ 97.94% , 98.57% ] * results with epochs [201,210]. that the on-events and off-e vents are so different that we use two channel to distinguish it. Therefore, the network structure is 34 × 34 × 2-400-400-10. Fig. 5: Dynamic dataset of N-MNIST . (a) Each sub-picture shows a 10ms-width spike train during the saccades. (b) Spike train is generated by moving the dynamic vision sensor (D VS) in turn towards the direction of 1, 2 and 3. (c) Spatio-temporal representation of the spike train from digit 0 [32]where the upper one and lower one denote the on-e vents and off-e vents, respectiv ely . T ableIV compares our STBP method with some state-of- the-art results on N-MNIST dataset. The upper 5 results are based on ANNs, and lower 4 results including our method uses SNNs. The ANNs methods usually adopt a frame-based method, which collects the spike e vents in a time interval ( 50 ms ∼ 300 ms ) to form a frame of image, and use the con ventional algorithms for image classiﬁcation to train the networks. Since the transformed images are often blurred, the frame-based preprocessing is harmful for model performance and abandons the hardware friendly ev ent-driven paradigm. As can be seen from T ableIV, the models of ANN are generally worsen than the models of SNNs. In contrast, SNNs could naturally handle e vent stream patterns, and by better use of spatio-temporal feature of ev ent streams, our proposed STBP method achiev es best accuracy of 98.78% when compared all the reported ANNs and SNNs methods. The greatest adv antage of our method is that we did not use an y complex training skills, which is beneﬁcial for future hardware implementation. 2) Spatio-temporal con volution neural network: Extending our framework to con volution neural network structure allows the network going deeper and grants network more powerful SD information. Here we use our framework to establish the spatio-temporal conv olution neural network. Compared with our spatio-temporal fully connected network, the main difference is the processing of the input image, where we use the con volution in place of the weighted summation. Speciﬁcally , in the con volution layer , each conv olution neuron receiv es the conv oluted input and updates its state according to the LIF model. In the pooling layer , because the binary coding of SNNs is inappropriate for standard max pooling, we use the average pooling instead. Our spiking CNN model are also tested on the MNIST dataset as well as the object detection dataset . In the MNIST , our network contains one con volution layers with kernel size of 5 × 5 and two av erage pooling layers alternatively , followed by one hidden layer . And like traditional CNN, we use the elastic distortion [36] to preprocess dataset. T ableV records the state-of-the-art performance spiking con volution neural networks over MNIST dataset. Our proposed spik- ing CNN model obtain 98.42% accuracy , which outperforms other reported spiking networks with slightly lighter structure. Furthermore, we conﬁgure the same netw ork structure on a custom object detection database to ev aluate the proposed model performance. The testing accuracy is reported after training 200 epochs. T ableVI indicates our spiking CNN model could achie ve a competitive performance with the non-spiking 8 T ABLE IV: Comparison with state-of-the-art networks ov er N-MNIST . Model Network structure T raining skills Accuracy Non-spiking CNN(BP) [24] - None 95.30% Non-spiking CNN(BP) [34] - None 98.30% Non-spiking MLP(BP) [23] 34 × 34 × 2 -800-10 None 97.80% LSTM(BPTT) [24] - Batch normalization 97.05% Phased-LSTM(BPTT) [24] - None 97.38% Spiking CNN(pre-training*) [34] - None 95.72% Spiking MLP(BP) [23] 34 × 34 × 2 -800-10 Error normalization/ parameter regularization 98.74% Spiking MLP(BP) [35] 34 × 34 × 2 -10000-10 None 92.87% Spiking MLP(STBP) 34 × 34 × 2 -800-10 None 98.78% W e only show the network structure based on MLP , and the other network structure refers to the above references. *means that their model is based on pre-trained ANN models. T ABLE V: Comparison with other spiking CNN over MNIST . Model Network structure Accuracy Spiking CNN (pre-training ∗ ) [13] 28 × 28 × 1-12C5-P2-64C5-P2-10 99.12% Spiking CNN(BP) [23] 28 × 28 × 1-20C5-P2-50C5-P2-200-10 99.31% Spiking CNN (STBP) 28 × 28 × 1-15C5-P2-40C5-P2-300-10 99.42% W e mainly compare with these methods that hav e the similar network architecture, and * means that their model is based on pre-trained ANN models. T ABLE VI: Comparison with the typical CNN over object detection dataset. Model Network structure Accuracy Mean Interval ∗ Non-spiking CNN(BP) 28 × 28 × 1 -6C3-300-10 98.57% [98.57%, 98.57%] Spiking CNN(STBP) 28 × 28 × 1 -6C3-300-10 98.59% [ 98.26% , 98.89% ] * results with epochs [201,210]. CNN. C. P erformance Analysis 1) The Impact of Derivative Appr oximation Curves: In section II-B, we introduce different curves to approximate the ideal deriv ative of the spike activity . Here we try to analyze the inﬂuence of different approximation curves on the testing accuracy . The experiments are also conducted on the MNIST dataset, and the netw ork structure is 784 − 400 − 10 . The testing accuracy is reported after training 200 epochs. Firstly , we compare the impact of different curve shapes on model performance. In our simulation we use the mentioned h 1 , h 2 , h 3 and h 4 shown in Figure3.b . Figure6.a illustrates the results of approximations of different shapes. W e observe that different nonlinear curv es, such as h 1 , h 2 , h 3 and h 4 , only present small v ariations on the performance. Furthermore, we use the rectangular approximation as an e xample to e xplore the impact of width on the experiment results. W e set a 1 = 0 . 1 , 1 . 0 , 2 . 5 , 5 . 0 , 7 . 5 , 10 and corresponding results are plotted in ﬁgure6.b . Different colors denote different a 1 values. Both too large and too small a 1 value would cause worse performance and in our simulation, a 1 = 2 . 5 achie ves the highest testing accuracy , which implies the width and steepness of rectangle inﬂuence the model performance. Combining ﬁgure 6.a and ﬁgure 6.b, it indicates that the ke y point for approximating the deriv ation of the spike acti vity is to capture the nonlinear nature, while the speciﬁc shape is not so critical. 2) The Impact of T empor al Domain: A major contrib ution of this work is introducing the temporal domain into the ex- isting spatial domain based BP training method, which makes full use of the spatio-temporal dynamics of SNNs and enables the high-performance training. Now we quantitatively analyze the impact of the TD item. The experiment conﬁgurations keep the same with the previous section ( 784 − 400 − 10 ) and we also report the testing results after training 200 epochs. Here the existing BP in the SD is termed as SDBP . T ableVII records the simulation results. The testing ac- curacy of SDBP is lower than the accuracy of the STBP on different dataset, which sho ws the time information is beneﬁcial for model performance. Speciﬁcally , compared to the STBP , the SDBP has a 1.21% loss of accuracy on the objectiv e tracking dataset, which is 5 times larger than the loss on the MNIST . And results also imply that the performance of SDBP is not stable enough. In addition to the interference of the dataset itself, the reason for this variation may be the unstability of SNNs training. Actually , the training of SNNs relies heavily on the parameter initialization, which is also a great challenge for SNNs applications. In many reported works, researchers usually lev erage some special skills or mechanisms to improve the training performance, such as the lateral inhibition, regularization, normalization, etc. In contrast, by using our STBP training method, much higher per- formance can be achieved on the same network. Speciﬁcally , the testing accuracy of STBP reaches 98.48% on MNIST and 98.32% on the object detection dataset. Note that the STBP can achie ve high accuracy without using any complex training skills. This stability and robustness indicate that the dynamics in the TD fundamentally includes great potential for the SNNs computing and this work indeed provides a new idea. 9 Fig. 6: Comparisons of different deri vation approximation curv es. (a) The impact of different approximations. (b) The impact of dif ferent widths of regular approximation. T ABLE VII: Comparison for the SDBP model and the STBP model on different datasets. Model Dataset Network structure Training skills Accuracy Mean Interval ∗ Spiking MLP Objectiv e tracking 784-400-10 None 97.11% [96.04%,97.78%] (SDBP) MNIST 784-400-10 None 98.29% [98.23%, 98.39%] Spiking MLP Objectiv e tracking 784-400-10 None 98.32% [ 97.94% , 98.57% ] (STBP) MNIST 784-400-10 None 98.48% [ 98.42% , 98.51% ] * results with epochs [201,210]. I V . C O N C L U S I O N In this work, a uniﬁed frame work that allo ws supervised training spiking neural networks just like implementing back- propagation in deep neural networks (DNNs) has been built by exploiting the spatio-temporal information in the networks. Our major contrib utions are summarized as follows: 1) W e have presented a framew ork based on an iterati ve leaky integrate-and-ﬁre model, which enables us to implement spatio-temporal backpropagation on SNNs. Unlike previous methods primarily focused on its spatial domain features, our frame work further combines and exploits the features of SNNs in both the spatial domain and temporal domain; 2) W e have designed the STBP training algorithm and implemented it on both MLP and CNN architectures. The STBP has been veriﬁed on both static and dynamic datasets. Results hav e shown that our model is superior to the state-of-the-art SNNs on relati vely small-scale networks of spiking MLP and CNNs, and outperforms DNNs with the same network size on dynamic N- MNIST dataset. An attractiv e advantage of our algorithm is that it doesn’ t need extra training techniques which generally required by existing schemes, and is easier to be implemented in lar ge-scale networks. Results also hav e rev ealed that the use of spatio-temporal complexity to solve problems could fulﬁll the potential of SNNs better; 3) W e ha ve introduced an approximated deriv ati ve to address the non-differentiable issue of the spike activity . Controlled experiment indicates that the steepness and width of approximation curve would affect the model’ s performance and the key point for approximations is to capture the nonlinear nature, while the speciﬁc shape is not so critical. Because the brain combines complexity in the temporal and spatial domains to handle input information, we also would like to claim that implementing STBP on SNNs is more bio- plausible than applying BP on DNNs. The property of STBP that doesn’t rely on too man y training skills makes it more hardware-friendly and useful for the design of neuromorphic chip with online learning ability . Regarding the future research topics, two issues we belie ve are quite necessary and very important. One is to apply our frame work to tackle more problems with the timing characteristics, such as dynamic data processing, video stream identiﬁcation and speech recognition. The other is how to accelerate the supervised training of large scale SNNs based on GPUs/CPUs or neuromorphic chips. The former aims to further exploit the rich spatio-temporal features of SNNs to deal with dynamic problems, and the later may greatly prompt the applications of large scale of SNNs in real life scenarios. R E F E R E N C E S [1] P . Chaudhari and H. Agarwal, Pr ogr essive Review T owar ds Deep Learn- ing T echniques . Springer Singapore, 2017. [2] L. Deng and D. Y u, “Deep learning: Methods and applications, ” F oun- dations and T rends in Signal Pr ocessing , vol. 7, no. 3, pp. 197–387, 2014. [3] Jia, Y angqing, Shelhamer , Evan, Donahue, Jeff, Karayev , Serge y , Long, and Jonathan, “Caffe: Conv olutional architecture for fast feature embed- ding, ” Eprint Arxiv , pp. 675–678, 2014. [4] G. Hinton, L. Deng, D. Y u, and G. E. Dahl, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, ” IEEE Signal Pr ocessing Magazine , vol. 29, no. 6, pp. 82–97, 2012. 10 [5] K. He, X. Zhang, S. Ren, and J. Sun, Spatial Pyramid P ooling in Deep Con volutional Networks for V isual Recognition . Springer International Publishing, 2014. [6] X. Zhang, Z. Xu, C. Henriquez, and S. Ferrari, “Spike-based indirect training of a spiking neural network-controlled virtual insect, ” in De- cision and Control (CDC), 2013 IEEE 52nd Annual Conference on . IEEE, 2013, pp. 6798–6805. [7] J. N. Allen, H. S. Abdel-Aty-Zohdy , and R. L. Ewing, “Cognitiv e processing using spiking neural networks, ” in IEEE 2009 National Aer ospace and Electr onics Conference , 2009, pp. 56–64. [8] N. Kasabov and E. Capecci, “Spiking neural network methodology for modelling, classiﬁcation and understanding of eeg spatio-temporal data measuring cognitive processes, ” Information Sciences , vol. 294, no. C, pp. 565–575, 2015. [9] B. V . Benjamin, P . Gao, E. Mcquinn, S. Choudhary , A. R. Chan- drasekaran, J. M. Bussat, R. Alvarez-Icaza, J. V . Arthur , P . A. Merolla, and K. Boahen, “Neurogrid: A mixed-analog-digital multichip system for large-scale neural simulations, ” Pr oceedings of the IEEE , vol. 102, no. 5, pp. 699–716, 2014. [10] P . A. Merolla, J. V . Arthur, R. Alvarezicaza, A. S. Cassidy , J. Sawada, F . Ak opyan, B. L. Jackson, N. Imam, C. Guo, and Y . Nakamura, “ Artiﬁcial brains. a million spiking-neuron integrated circuit with a scalable communication network and interface. ” Science , vol. 345, no. 6197, pp. 668–73, 2014. [11] S. B. Furber, F . Galluppi, S. T emple, and L. A. Plana, “The spinnaker project, ” Proceedings of the IEEE , vol. 102, no. 5, pp. 652–665, 2014. [12] T . Hwu, J. Isbell, N. Oros, and J. Krichmar , “ A self-driving robot using deep conv olutional neural networks on neuromorphic hardware, ” arXiv .or g , 2016. [13] S. K. Esser, P . A. Merolla, J. V . Arthur , A. S. Cassidy , R. Appuswamy , A. Andreopoulos, D. J. Berg, J. L. Mckinstry , T . Melano, and D. R. Barch, “Conv olutional networks for fast, energy-efﬁcient neuromorphic computing, ” Pr oceedings of the National Academy of Sciences of the United States of America , vol. 113, no. 41, p. 11441, 2016. [14] S. S. Zhang, L.P . Shi, “Creating more intelligent robots through brain- inspired computing, ” Science(suppl) , vol. 354, 2016. [15] P . U. Diehl and M. Cook, “Unsupervised learning of digit recognition using spike-timing-dependent plasticity , ” F r ontiers in Computational Neur oscience , vol. 9, p. 99, 2015. [16] D. Querlioz, O. Bichler, P . Dollfus, and C. Gamrat, “Immunity to device variations in a spiking neural network with memristiv e nanodevices, ” IEEE T ransactions on Nanotechnology , vol. 12, no. 3, pp. 288–295, 2013. [17] S. R. Kheradpisheh, M. Ganjtabesh, and T . Masquelier , “Bio-inspired unsupervised learning of visual features leads to robust inv ariant object recognition, ” Neurocomputing , vol. 205, no. C, pp. 382–392, 2016. [18] J. A. Perezcarrasco, B. Zhao, C. Serrano, B. Acha, T . Serranogo- tarredona, S. Chen, and B. Linaresbarranco, “Mapping from frame- driv en to frame-free event-dri ven vision systems by low-rate rate-coding and coincidence processing. application to feed forward convnets. ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 35, no. 11, pp. 2706–19, 2013. [19] P . U. Diehl, D. Neil, J. Binas, and M. Cook, “Fast-classifying, high- accuracy spiking deep networks through weight and threshold balanc- ing, ” in International Joint Conference on Neural Networks , 2015, pp. 1–8. [20] O. Peter, N. Daniel, S. C. Liu, D. T obi, and P . Michael, “Real-time classiﬁcation and sensor fusion with a spiking deep belief network, ” F rontier s in Neur oscience , vol. 7, p. 178, 2013. [21] E. Hunsberger and C. Eliasmith, “Spiking deep networks with lif neurons, ” Computer Science , 2015. [22] P . O’Connor and M. W elling, “Deep spiking networks, ” arXiv .or g , 2016. [23] J. H. Lee, T . Delbruck, and M. Pfeiffer , “Training deep spiking neural networks using backpropagation, ” F rontier s in Neur oscience , vol. 10, 2016. [24] D. Neil, M. Pfeiffer , and S. C. Liu, “Phased lstm: Accelerating recurrent network training for long or event-based sequences, ” arXiv .or g , 2016. [25] F . A. Gers, J. Schmidhuber , and F . Cummins, “Learning to forget: continual prediction with lstm, ” Neural Computation , vol. 12, no. 10, p. 2451, 1999. [26] S. Hochreiter and J. Schmidhuber , “Long short-term memory , ” Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997. [27] J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Gated feedback recurrent neural networks, ” Computer Science , pp. 2067–2075, 2015. [28] P . J. W erbos, “Backpropagation through time: what it does and how to do it, ” Pr oceedings of the IEEE , vol. 78, no. 10, pp. 1550–1560, 1990. [29] Y . Bengio, T . Mesnard, A. Fischer, S. Zhang, and Y . W u, “ An objectiv e function for stdp, ” Computer Science , 2015. [30] Y . Lecun, L. Bottou, Y . Bengio, and P . Haf fner , “Gradient-based learning applied to document recognition, ” Pr oceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998. [31] E. Neftci, S. Das, B. Pedroni, K. Kreutzdelgado, and G. Cauwenberghs, “Event-dri ven contrastiv e diver gence for spiking neuromorphic systems, ” F rontier s in Neur oscience , vol. 7, p. 272, 2013. [32] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor , “Converting static image datasets to spiking neuromorphic datasets using saccades, ” F rontier s in Neur oscience , vol. 9, 2015. [33] P . Lichtsteiner, C. Posch, and T . Delbruck, “ A 128x128 120db 15us latency asynchronous temporal contrast vision sensor, ” IEEE Journal of Solid-State Circuits , vol. 43, no. 2, pp. 566–576, 2007. [34] D. Neil and S. C. Liu, “Ef fectiv e sensor fusion with event-based sensors and deep network architectures, ” in IEEE Int. Symposium on Circuits and Systems , 2016. [35] G. K. Cohen, G. Orchard, S. H. Leng, J. T apson, R. B. Benosman, and A. V . Schaik, “Skimming digits: Neuromorphic classiﬁcation of spike- encoded images, ” F rontiers in Neur oscience , vol. 10, no. 184, 2016. [36] P . Y . Simard, D. Steinkraus, and J. C. Platt, “Best practices for con volutional neural networks applied to visual document analysis, ” in International Conference on Document Analysis and Recognition , 2003, p. 958.

Spatio-Temporal Backpropagation for Training High-performance Spiking Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment