Unsupervised AER Object Recognition Based on Multiscale Spatio-Temporal Features and Spiking Neurons

1 Unsupervised AER Object Recognition Based on Multiscale Spatio-T emporal Features and Spiking Neurons Qianhui Liu, Gang Pan, Haibo Ruan, Dong Xing, Qi Xu, and Huajin T ang Abstract —This paper proposes an unsuper vised address e vent repr esentation (AER) object recognition approach. The proposed approach consists of a novel multiscale spatio-temporal feature (MuST) repr esentation of input AER events and a spiking neural network (SNN) using spike-timing-dependent plasticity (STDP) for object r ecognition with MuST . MuST extracts the features contained in both the spatial and temporal inf ormation of AER event ﬂow , and meanwhile forms an informative and compact feature spike representation. W e show not only how MuST exploits spik es to con vey information more effectively , b ut also how it beneﬁts the recognition using SNN. The recognition process is performed in an unsupervised manner , which does not need to specify the desir ed status of ev ery single neuron of SNN, and thus can be ﬂexibly applied in real-w orld recognition tasks. The experiments are perf ormed on ﬁve AER datasets including a new one named GESTURE-D VS. Extensive experimental r esults show the effectiveness and advantages of this proposed approach. Index T erms —addr ess event repr esentation (AER), spatio- temporal features, spiking neural network, unsupervised learn- ing. I . I N T RO D U C T I O N N EUR OMORPHIC engineering takes inspiration from bi- ology in order to construct brain-like intelligent systems and has been applied in many ﬁelds such as pattern recog- nition, neuroscience, and computer vision [1], [2]. Address ev ent representation (AER) sensors are neuromorphic devices imitating the mechanism of human retina. T raditional cameras usually record the visual input as images at a ﬁxed frame rate, which would suffer from severe data redundancy due to the strong spatio-temporal correlation of the scene. This problem could be solved to a large e xtent with AER vision sensors, which naturally respond to mo ving objects and ignore static redundant information. Each pixel in the AER sensor individually monitors the relative changes of light intensity of its recepti ve ﬁeld. If the change exceeds a predeﬁned threshold, an event will be emitted by that pixel. Each e vent carries the information of timestamp (the time when the e vent was emitted), address (the position of the corresponding pix el in the sensor) and polarity (the direction of the light change, i.e., dark-to-light or light-to-dark). The ﬁnal output of the sensor Q. Liu, H. Ruan, D. Xing, Q. Xu, and H. T ang are with Col- lege of Computer Science, Zhejiang University , Hangzhou 310027, China. (e-mail: qianhuiliu@zju.edu.cn; hbruan@zju.edu.cn; dongxing@zju.edu.cn; xuqi123@zju.edu.cn; huajin.tang@gmail.com). G. Pan is with the State Key Lab of CAD&CG, Zhejiang University , Hangzhou 310058, China. (e-mail: gpan@zju.edu.cn). (Corresponding author: Gang Pan) is a stream of events collected from each pix el, encapsulating only the dynamic information of the visual input. Compared with traditional cameras, AER sensors have the advantage of maintaining an asynchronous, high-temporal-resolution and sparse representation of the scene. Commonly used AER sensors include the asynchronous time-based image sensor (A TIS) [3], dynamic vision sensor (D VS) [4], [5], dynamic and acti ve pixel vision sensor (D A VIS) [6]. The output of AER vision sensor is ev ent-based; howe ver , there remain open challenges on how to extract the features of events and then to design an appropriate recognition mech- anism. Peng et al. [7] proposed a feature extraction model for AER ev ents called Bag of Events (BOE) based on the joint probability distribution of events. In addition, there are some existing works inspired by the cortical mechanisms of human vision, with a hierarchical or ganization that can provide features of increasing complexity and in variance to size and position [8]. Chen et al. [9] proposed an algorithm to extract size and position inv ariant line features for recognition of ob- jects, especially human postures in real-time video sequences from address-e vent temporal-dif ference image sensors. Zhao et al. [10] presented an event-dri ven conv olution-based network for feature extraction that takes data from temporal contrast AER e vents, and also introduced a forgetting mechanism in feature extraction to retain timing information of events into features. Lagorce et al. [11] proposed the HO TS model, which relies on a hierarchical time-oriented approach to ex- tract spatio-temporal features called time-surfaces from the asynchronously acquired dynamics of a visual scene. The time-surfaces are using relativ e timings of ev ents to give contextual information. Orchard et al. [12] proposed the HFirst model, in which a spiking hierarchical model with four layers was introduced for feature extraction by utilizing the timing information inherently presented in AER data. In addition, biological study of the visual ventral pathway indicates that vision sensing and object recognition in the brain are performed in the form of spikes [13]. Several coding hypotheses [14], [15] ha ve been proposed from different aspects to explain how these spikes represent information in the brain. Neurons in the visual cortex have been observed to precisely respond to the stimulus on a millisecond timescale [16]. This supports the hypothesis of temporal coding, which considers that information about the stimulus is contained in the speciﬁc precise spike timing of the neuron. T o implement the temporal coding, we need to specify the coding function to map the features of AER events to precise spike timings. 2 Event Flow Dynamic View AER Sensor Event Recognition with Spike-Timing- Dependent Plasticity L I F n e u ro n Pe a k D e t e ct i o n Event Q ueue Motion Symbol Detection S1 C1 1 i j n Encoding Neurons Learning Neurons Excitatory Synapses Inhibitory Synapses scale LIF Neuron Peak Multiscale Spatio-T emporal Feature Coding Event Flow Segmentation Fig. 1. The ﬂow chart of the proposed AER object recognition. The event ﬂow from AER sensor are sent concurrently to motion symbol detection (MSD) [10] and e vent queue. MSD adapti vely partitions the e vents waiting to be sent in e vent queue into segments, and streams the e vents segment by se gment to neurons in S 1 layer for spatio-temporal feature extraction. Neurons have their own scale of receptive ﬁeld and respond best to a certain orientation. The neuron responses reﬂect the strength of features, which cover both the spatial features of different scales and orientations and temporal information. Neurons of the same receptive scale and orientation are organized into one feature map (denoted by blue squares) and the max responses in adjacent non-overlapping 2 × 2 neuron regions of each feature map reach the C 1 layer. The C 1 features are coded to spikes and multiscale features having the same orientation and position in C 1 maps ﬂow into the same encoding neuron. The encoding neurons emit spikes to trigger learning neurons and the relative timing of spikes will trigger the spike-timing-dependent plasticity (STDP) on excitatory synapses during training. Each learning neuron inhibits others through inhibitory synapses (denoted by dash lines), ensuring different neurons learn different patterns. After training, each learning neuron will be assigned a class label based on its sensitivity to patterns of different classes. The ﬁnal recognition decision is determined by averaging the ﬁring rates of learning neurons per class and choosing the class with the highest average ﬁring rate. How to select a coding function that can better conv ey the information contained in features into spikes and contribute to the object recognition becomes a ke y question. W e also design the coding mechanism from spatial perspective since the spatial information of feature spikes also takes effects in object recognition. Inspired by previous works, we introduce an encoding scheme for AER e vents that e xtracts the spatio-temporal features of ra w e vents and forms a feature spik e representation. Considering that biological neurons are inherently capable of processing temporal information, we present a cortex- like hierarchical feature extraction based on leaky integrate- and-ﬁre (LIF) spiking neurons with spatial sensitivity . The responses of these neurons are accumulated along the time axis and reﬂect the strength of current spatio-temporal fea- tures. W e also propose the coding mechanism to obtain the spatio-temporal feature spikes, which consists of the natural logarithmic temporal coding function and multiscale spatial fusion. Through the proposed coding function, we obtain the feature spikes with ev en temporal distribution. W e will show that these spikes are more informative and contrib ute to the recognition using SNN. Meanwhile, the spatio-temporal features of multiple scales are highly correlated and are fused to spike-trains to form a multiscale spatio-temporal feature representation, which we have called MuST . Since MuST is in the form of spikes, it is natural to employ the spiking neural network (SNN) to learn the spike patterns. Compared with traditional classiﬁers, SNNs are more natural to interpret the information processing mechanisms of the brain [17], and more powerful on processing both spatial and temporal information [18]. In addition, SNNs hav e the advantage of low power consumption, for example, current implementations of SNN on neuromorphic hardware use only a few nJ or even pJ for transmitting a spike [19]. Most existing works for AER object recognition, such as [10] and [20], hav e chosen supervised classiﬁers of SNN for recognition. These supervised classiﬁers need to specify the desired status of ﬁring or not or ev en the ﬁring time of neurons. Howe ver , setting the desired status to e very single neuron is intricate and tedious in real-world recognition tasks. W e consider the unsupervised learning rule spike-timing- dependent plasticity (STDP) [21] of SNN. STDP works by considering the relative timing of presynaptic and postsynaptic spikes. According to this rule, if the presynaptic neuron ﬁres earlier (later) than the postsynaptic neuron, the synaptic weight will be strengthened (weakened). Through the STDP learning, each postsynaptic neuron naturally becomes sensitiv e to one or some similar presynaptic spike patterns. There are some existing works that hav e sho wn the po werful ability of STDP to learn the spik e patterns. Diehl et al. [19] proposed a SNN for image recognition that employs STDP learning to process the Poisson-distributed spike-trains with ﬁring rates proportional to the intensity of the image pixel. Iyer et al. [22] applied the Diehl’ s model [19] on nativ e AER data. Experiments on 3 the N-MNIST dataset [23] show that the method provides an ef fectiv e unsupervised application on AER ev ent streams. Zheng et al. [24] presented a spiking neural system that uses STDP-based HMAX to extract the spatio-temporal information from spikes patterns of the conv olved image. Panda et al. [25] presented a regenerativ e model that learns the hierarchical feature maps layer-by-layer in a deep con volutional network using STDP . Our major contrib utions can be summarized as follows: • W e propose an unsupervised recognition approach for AER object, which performs the task using MuST for encoding the AER e vents and STDP learning rule of SNN for object recognition with MuST . This approach does not require a teaching signal or setting the desired status of neurons in advance, and thus can be ﬂexibly applied in real-world recognition tasks. • W e present MuST which not only exploits the information contained in the input AER e vents, but also forms a new representation that is suitable for the recognition mechanism. MuST extracts the spatio-temporal features of AER ev ents based on LIF neuron model, and forms a feature spike representation which consumes less com- putational resources while still maintaining comparable performance. • Extensive experimental analysis shows that our recog- nition approach, processed in an unsupervised way , can achiev e comparable performance to existing, supervised solutions. The rest of this paper is org anized as follows. Section II ov erviews the ﬂow of information processing in this approach. Section III-IV describes the details of this recognition ap- proach. The experimental results are explained in Section V. In section VI, we come to our conclusion. I I . O V E RV I E W O F T H E P RO P O S E D A P P RO AC H The proposed AER object recognition consists of three parts, namely the Ev ent Flo w Segmentation, Multiscale Spatio- T emporal Feature (MuST) and Recognition with STDP , as shown in Fig. 1. W e will overvie w the ﬂow of information processing in this approach as follows. Event Flow Segmentation: Our object recognition ap- proach is driv en by raw ev ents from the AER sensor . Howe ver , it is still a daunting task to explore how to use each single ev ent as a source of meaningful information [7]. In addition, due to the high temporal resolution of the sensor , the time intervals between two successiv e ev ents can be very small (100 ns or less). F or the efﬁciency of computation and ener gy use, existing works [7], [10] heuristically partition events into multiple segments and then perform the feature extraction and recognition based on these segments. W e maintain an event queue to store the input e vents waiting to be sent to the next layer, and meanwhile apply the motion symbol detection (MSD) [10] to adaptiv ely partition the events according to their statistical characteristics, which is more ﬂexible than the partition methods based on ﬁxed time slices or ﬁxed event numbers. Events from the AER sensor are sent concurrently to the ev ent queue and MSD. MSD consists of a leaky integrate-and-ﬁre (LIF) neuron and a peak detection unit. The neuron receiv es the stimuli of ev ents and then updates its total potential. The peak detection is applied to locate temporal peaks on the neuron’ s total potential. A peak is detected when many e vents hav e occurred intensi vely , which indicates enough information has been gathered. Therefore, once the peak is detected, e vents in the ev ent queue emitted before the peak time will be sent as a segment to the next part. Multiscale Spatio-T emporal Featur e (MuST): The events are sent to the S 1 layer , which consists of neurons having their own scale of receptive ﬁeld and responding best to a certain orientation. S 1 neurons accumulate the responses which reﬂect the strength of spatial features. The timing information of ev ents is also recorded in the responses of S 1 neurons because of the spontaneous leakage of neurons. Each neuron associates to one pixel in the sensor and neurons of the same receptiv e scale and orientation are organized into one feature map. Each feature map in S 1 is di vided into adjacent non-overlapping 2 × 2 neuron regions and the max neuron responses in each region reach the C 1 layer . The neuron responses (features) in C 1 layer are coded into the form of spikes for recognition. The strength of the feature is in a logarithm manner mapped to the timing of spike by temporal coding, and multiscale features having the same orientation and position in C 1 feature maps are fused as a spike-train ﬂowing to one encoding neuron, forming the MuST representation of AER ev ents for recognition. Recognition with STDP: The encoding neurons emit spik es to excite the learning neurons of the SNN. According to STDP , the relati ve timing of spik es of the presynaptic encoding neuron and postsynaptic learning neuron triggers the synaptic weight adjustment during training. The spikes from one learn- ing neuron also inhibit the other learning neurons. This lateral inhibition prevents neurons from learning the same MuST pattern. After training, each learning neuron will be assigned a class label based on its sensitivity to patterns of different classes. The ﬁnal recognition decision for an input pattern is determined by averaging the ﬁring rates of learning neurons per class and choosing the class with the highest average ﬁring rate. I I I . M U LT IS C A L E S PA T I O - T E M P O R A L F E AT U R E The current theory of the cortical mechanism has been pointing to a hierarchical and mainly feedforward organization [10]. In the primary visual cortex (V1), two classes of func- tional cells – simple cells and complex cells are founded [26]. Simple cells respond best to stimuli at a particular orientation, position and phase within their relati vely small recepti ve ﬁelds. Complex cells tend to hav e lar ger receptive ﬁelds and exhibit some tolerance with respect to the exact position within their receptiv e ﬁelds. Further , plasticity and learning certainly occur at the lev el of inferotemporal (IT) cortex and prefrontal cortex (PFC), the top-most layers of the hierarchy [27]. Inspired by the visual processing in the cortex, we introduce the following mechanisms in our recognition approach: 1) W e model the object recognition a hierarchy of S 1 layer , C 1 layer , encoding layer and learning layer . 2) MuST feature 4 extraction consists of S 1 and C 1 layer, composed of simple cells and complex cells respectively . Simple cells combine the input with a bell-shaped tuning function to increase feature selectivity and complex cells perform the max pooling oper- ation to increase feature in v ariance. W e use LIF neurons to model the simple and complex cells. The LIF model has been used widely to simulate biological neurons and is inherently good at processing temporal information. The responses of the neurons just reﬂect the strength of spatio-temporal features with selectivity and in variance. W e also propose a coding mechanism from the temporal and spatial perspecti ves, aiming to form a feature spike representation to better exploit the information in raw events for recognition. 3) The STDP rule models the learning at the high layer of the hierarchy and learns sophisticated features of objects, which will be described in detail in the next section. In this section, we will propose the multiscale spatio- temporal feature representation of the raw AER e vents. A. Spatio-T emporal F eatur e Extraction W e conduct the feature extraction using bio-inspired hi- erarchical network composed of LIF neurons with a certain receptiv e scale and orientation, which takes into account both the temporal and spatial information encapsulated in AER ev ents. This network contains two layers named S 1 layer and C 1 layer, mimicking the simple and complex cells in primary visual cortex V 1 respectiv ely . An e vent-dri ven conv olution is introduced in neurons of the S 1 layer , and a max-pooling operation is used in the C 1 layer . 1) S 1 layer: Each event in the segment is sent to the S 1 layer , in which the input e vent is conv olved with a group of Gabor ﬁlters [10]. The function of Gabor ﬁlter can be described with the following equation: G (∆ x, ∆ y ; σ , λ, θ ) = exp( − X 2 + γ 2 Y 2 2 σ 2 ) cos( 2 π λ X ) (1) X = ∆ x cos θ + ∆ y sin θ (2) Y = − ∆ x sin θ + ∆ y cos θ (3) where ∆ x and ∆ y are the spatial offsets between the pixel position ( x , y ) and the ev ent address ( e x , e y ), γ is the aspect ratio. The wav elength λ and ef fecti ve width σ are parameters determined by scale s . Each ﬁlter models a neuron cell that has a certain scale s of recepti ve ﬁeld and responds best to a certain orientation θ . Each neuron associates to one pixel in the sensor and neurons of the same receptive scale and orientation are organized into one feature map. The responses of neurons in feature maps are initialized as zeros, then updated by accumulating each element of the ﬁlters to the maps at the position speciﬁed by the address of each ev ent. The response of the neuron at position ( x, y ) and time t in the map of speciﬁc scale s and orientation θ can be described as: r ( x, y, t ; s, θ ) = X e ∈ E ( t ) 1 { x ∈ X ( e x ) } 1 { y ∈ Y ( e y ) } exp( − t − e t τ leak ) G ( x − e x , y − e y ; σ ( s ) , λ ( s ) , θ ) (4) t t r s = 3 s = 5 s = 7 s = 9 Feature Map Fig. 2. Illustration of coding mechanism. Four blue squares denote the C1 feature maps of four different scales having the orientation of 45 ◦ . Four responses having same position in these four feature maps are chosen for illustration. These responses are conv erted to spikes by the logarithm coding function and then be fused into a spike-train. The lighter a pixel looks in the feature map, the higher is its response value, and the earlier is its corresponding spike timing. where E ( t ) denotes the set of events which are emitted before the time t in the current segment, 1 { . } is the indicator function, X ( e x ) = [ e x − s, e x + s ] and Y ( e y ) = [ e y − s, e y + s ] denote the receptiv e ﬁeld of the neuron, and τ leak denotes the decay time constant. Since the parameters σ and λ in function G are determined by s , we herein use σ ( s ) and λ ( s ) instead. This computation process can also be explained in another way . When the address of the current ev ent e is in the receptiv e ﬁelds of the neuron, the response r ( x, y, t ; s, θ ) of the neuron is increased by G (∆ x, ∆ y ; s, θ ) . Otherwise, the neuron response keeps decaying exponentially . The decay dynamics of the response are: τ leak d r ( x, y, t ; s, θ ) d t = − r ( x, y, t ; s, θ ) (5) W ith the exponential decay , the impact of earlier events is reduced on the current responses, and the precise timing information of each event can be captured in the responses. 2) C 1 layer: Each feature map in S 1 layer is divided into adjacent non-overlapping 2 × 2 cell re gions, namely S 1 units. The responses of C 1 cells are obtained by max pooling over the responses in S 1 units. The pooling operation causes the competition among S 1 cells inside a unit, and high-response features (considered as representativ e features) will reach the C 1 maps. After the pooling operation, the number of cells in C 1 maps is 1 / 4 of that in S 1 maps. This pooling operation decreases the number of required neurons in latter layers and makes the features locally in variant to size and position. B. Coding to Spike-T rains The spatio-temporal features in C 1 maps will be coded to spike-trains. A spike carries the information of its timestamp and address. W e propose a coding mechanism to con vert the strength of feature to the spike timing by a natural logarithm function of temporal coding, and to map the position of feature to the address of spike by multiscale fusion. This procedure is illustrated in Fig. 2, and the details are described as follows. The feature responses in C 1 maps are used to generate spik e timings by latency coding scheme [14], [15]. Features with the 5 maximum response values, which are considered to activ ate the spike more easily , correspond to the minimum latency and will ﬁre early; features with smaller values will ﬁre later or ev en not ﬁre. W e focus on ﬁnding an appropriate coding function in order to fully utilize the information contained in features for the following recognition. W e randomly choose 1000 samples from MNIST -DVS dataset, and show the distribution of C 1 re- sponses in Fig. 3. It can be seen that the distrib ution of features is hea vily sk ewed. Linear coding functions are used by many existing works [10], [28] to conv ert these features to spikes for simplicity , but such functions cannot change the distribution of data and thus the temporal distribution of feature spikes are still ske wed. This skewed distribution of feature spikes will lead to two problems: 1) higher-response features have less impact on recognition process. It is because the distribution of feature spikes affects the recognition process. The spikes of higher-response features are more sparsely distributed so that receptiv e neurons (learning neurons in our approach) are hard to accumulate responses high enough to emit spikes (because of the leakage of neurons). Therefore, the information in these high-response features cannot be completely transmitted to the receptiv e neurons and cannot be fully utilized by the recognition process. 2) Considering that the information of features in SNN is contained in the timings of feature spikes, the features are considered similar if the timings of their spikes are close. Therefore, it is difﬁcult to distinguish two features whose spikes are densely distributed in a short time windo w . T o solve these problems, feature responses in our approach are in a logarithm manner in versely mapped to spike timings. For one speciﬁc feature response depicted as r within the C 1 layer , the corresponding spike timing t spike can be computed as follo ws: t spike = C ( r ) = u − v ln( r ) (6) where u and v are normalizing factors ensuring that the spikes ﬁre in the predeﬁned time window t w , C denotes the coding function of response r . The settings of u and v are as follows: u = t w ln( r max ) / (ln( r max ) − ln( r min )) and v = t w / (ln( r max ) − ln( r min )) , where r max is the maximum 0.2 0.5 1 1.5 2 2.5 3 3.5 C1 Response 0 5 10 15 20 Percentage(%) Fig. 3. The distribution of C 1 feature responses on MNIST -DVS dataset. Each bin of the histogram has a nono verlapping span of 0 . 1 . The height of each bin indicates the average proportion of the C1 responses in the corresponding span. W e consider the features with responses smaller than 0 . 2 noises and ignore them. Fig. 4. One reconstructed image and its MuST pattern. Left : reconstructed image of the digit “5” from the MNIST -D VS dataset. Black pixel denotes there is no event at this position and white pixel denotes there is at least one ev ent at this position. Right : the corresponding MuST pattern. feature response in the training set, r min is the user-deﬁned minimum threshold, less than which the responses are set to be ignored. Section V will sho w the effects of this natural logarithm coding function. W e then attach the address information of features to Equa- tion (6) and obtain Equation (7). The spikes that con verted from feature responses r at the position ( x, y ) in the feature maps are written as: t spike = C ( r | x, y , S, Θ) = u − v ln( r ) s.t. r ∈ { r | r x = x, r y = y , r s ∈ S, r θ ∈ Θ } (7) where r s and r θ denote the scale and orientation of r , r x and r y denote the position of r in feature map, S is a set of values of scale s , Θ is a set of v alues of orientation θ . Unlike artiﬁcial neurons each of which represents informa- tion as a real value, a spiking neuron can con vey multiple signals in the form of a spike-train, which is more ﬂexible and more informativ e. Considering this characteristic of spiking neurons, certain features can be fused to make more ef ﬁcient use of neurons and form a compact representation. Inspired by [27] where features of neighboring scales are combined together , in our implementation, feature spikes of multiple scales having the same position and orientation are fused to a spike-train, sharing the same spike address. That is, each encoding neuron is in charge of the conv ersion of multiscale C 1 features. The spike-train that is con verted from feature responses r having position ( x, y ) and orientation θ is com- prised of a set of t spike in Equation (7), where Θ = { θ } . The following experiments in Section V will provide the analyses and ef fects of this multiscale fusion method. Through this encoding scheme, each input segment has its own MuST representation. Fig. 4 shows a reconstructed image of an ev ent segment in MNIST -D VS dataset [4], and its corresponding MuST representation. I V . R E C O G N I T I O N W I T H S P I K E - T I M I N G - D E P E N D E N T P L A S T I C I T Y In this part, a network of spiking neurons (SNN) is dev el- oped to perform object recognition with MuST . SNN simulates the fundamental mechanism of human brain and is good at processing spatio-temporal information. STDP is used here as the unsupervised learning rule of SNN. Every neuron naturally becomes sensitive to one or some similar input spike patterns 6 through STDP rather than approaching the desired status as in supervised learning. Due to the ﬂexibility of STDP , it is more suitable for our real-world recognition tasks. W e will describe the network design and unsupervised learning method as follo ws. A. Network Design The input stimuli of this network are the MuST spike- trains the encoding neurons emit. Encoding neurons are fully connected to the learning neurons. These synaptic connections are excitatory and will be adjusted in training procedure. Each of the learning neurons inhibits all other ones by inhibitory synapses with the short delay t d and the weights of inhibitory synapses are set to the predeﬁned value w inh . This connec- tivity implements lateral inhibition. Once a learning neuron ﬁres a spike, the inhibitory synapses transmit the stimuli to inhibit other learning neurons. The network design enables each neuron to represent one prototypical pattern or an a verage of some similar patterns, and prev ents a large number of neurons from representing only a few patterns. During training, the weights of all excitatory synapses are ﬁrstly initialized with random values and are updated using STDP . When the training is ﬁnished, we assign a class to each neuron, based on its highest response to the dif ferent classes ov er one presentation of the training set. Only in this class assignment step are the labels being used. For the training of the network, we do not use any label information. During the testing phase, the predicted class for the input pattern is determined by averaging the ﬁring rates of neurons per class and then choosing the class with the highest average ﬁring rate. B. STDP Learning Rule STDP is a biological process that adjusts the weights of connections between neurons. Considering both the encoding and learning neurons emit multiple spikes, we employ the triplet STDP model [29] which is based on interactions of relativ e timing of three spikes (triplets). Besides, triplet STDP has shown its computational advantage over standard STDP since it is sensitiv e for input patterns consisting of higher- order spatiotemporal spike pattern correlations [30]. LIF model is chosen to describe the neural dynamics [19]. The membrane v oltage V of the neuron is described as: τ d V d t = V rest − V + g e ( E exc − V ) + g i ( E inh − V ) (8) where τ is the postsynaptic neuron membrane time constant, V rest the resting membrane potential, E exc and E inh the equi- librium potentials of excitatory and inhibitory synapses, and g e and g i the conductance variables of excitatory and inhibitory synapses, respectively . The conductance is increased by the synaptic weight w at the time a presynaptic spike arriv es, otherwise the conductance k eeps decaying exponentially . If the synapse is excitatory , the decay dynamics of the conductance g e are: τ g e d g e d t = − g e (9) pre - synaptic post - synaptic ! "#$% ! "&' ! "#$%( Fig. 5. T wo conditions of the triplet STDP rule. Left: synaptic depression is induced using one postsynaptic trace when the presynaptic spike arri ves. Right: synaptic potentiation is induced using the post- and pre-synaptic traces when the postsynaptic spike arrives. where τ g e is the time constant of an excitatory postsynaptic potential; if the synapse is inhibitory , g i is updated using the same equation but with the time constant of the inhibitory post- synaptic potential τ g i . When the neuron’ s membrane potential is higher than its threshold V thr , the neuron will ﬁre a spike and its membrane potential will be reset to V reset . An adaptive membrane threshold [19], [31] is employed to prevent single learning neuron from dominating the response pattern. When the neuron ﬁres a spike, the threshold V thr will be increased by V plus . Otherwise the threshold V thr is described as: τ thr d V thr d t = V t − V thr (10) where V t denotes the predeﬁned membrane threshold. By incorporating such method, the more spikes a neuron ﬁres, the higher its membrane threshold will be. The weight dynamics are computed using synaptic traces which model the recent spike history . Each synapse keeps tracks of one presynaptic trace a pre and two postsynaptic traces a post and a post 2 . For simplicity , we use the Nearest- Spike interaction. As shown in Fig. 5, every time a presynaptic spike arriv es at the synapse, a pre is assigned to 1; otherwise a pre decays exponentially . The decay dynamic of the trace a pre is: τ a pre d a pre d t = − a pre (11) where τ a pre is the time constant of trace a pre . The postsynaptic traces a post and a post 2 work the same w ay as the presynaptic trace but their assignments are triggered by a postsynaptic spike and they decay with the time constant τ a post and τ a post 2 respectiv ely . When a presynaptic spike arri ves at the synapse, the weight is updated based on the postsynaptic trace: ∆ w = A − a post (12) where A − is the learning rate for presynaptic spike. When a postsynaptic spike arrives at the synapse the weight change ∆ w is: ∆ w = A + a pre a post 2 (13) where A + is the learning rate. Since the weights are not restricted in a range, weight normalization [32], which keeps the sum L of the synaptic weights connected to each learning neuron unchanged, is used to ensure an equal use of the neurons: ˆ w ij = w ij n e P k =1 w kj L (14) 7 Fig. 6. Some reconstructed images from the used datasets. (a): POKER-DVS dataset. (b): AER Posture dataset (rows from top to bottom represent BEND, SITST AND and W ALK respectively). (c): GESTURE-DVS dataset. (d): MNIST -DVS dataset. (e): NMNIST dataset. where w ij is the synaptic weights from encoding neuron i to learning neuron j , ˆ w ij is the normalized w ij , n e is the number of encoding neurons. V . E X P E R I M E N TA L R E S U L T S In this section, we ev aluate the performance of our proposed approach on AER datasets and compare our approach with other AER recognition methods. A. Datasets Fiv e different datasets are used in this paper to analyze the performance, i.e., POKER-D VS dataset [7], [33], MNIST -DVS dataset [4], NMNIST dataset [23], AER Posture dataset [10] and GESTURE-DVS dataset. Fig. 6 shows some samples of these ﬁ ve datasets. 1) POKER-D VS dataset: It contains 100 samples divided from an ev ent stream of poker card symbols with a spatial resolution of 32 × 32. It consists of four symbols, i.e., club, diamond, heart and spade. 2) MNIST -D VS dataset: It is obtained with a DVS sensor by recording 10,000 original handwritten images in MNIST moving with slow motion. Due to the motion during the recording of MNIST -DVS dataset, the digit appearances in this dataset hav e far greater variation than MNIST dataset. Thus, the recognition task of MNIST -D VS is more challenging than that of MNIST . The full length of each recording is about 2000 ms and the spatial resolution is 28 × 28. 3) N-MNIST dataset: it is obtained by moving an A TIS camera in front of the original MNIST images. It consists of 60,000 training and 10,000 testing samples. The spatial resolution is 34 × 34. 4) AER P osture dataset: It contains 191 BEND action, 175 SITST AND action and 118 W ALK action with a spatial resolution of 32 × 32. 5) Gestur e-D VS dataset: W e collect this dataset to further verify the robustness of our approach. W e ﬁrstly made a ﬁst abov e the scope of D VS sensor , and then swung the hand do wn to deliver a gesture. W e recorded the ev ents triggered by the hand moving down. The dataset contains three gestures, i.e., rock (a closed ﬁst), paper (a ﬂat hand), and scissor (a ﬁst with the index ﬁnger and middle ﬁnger extended, forming a V). Each gesture is delivered 40 times in total and the recording of each time has a duration of 50 ms . The e vents are captured by the D VS sensor with a resolution of 128 × 128 pixels and scaled to 32 × 32 pixels in the data preprocessing. B. Benchmark Methods W e compare our approach with other three recently pro- posed AER recognition methods. The ﬁrst one was proposed by Zhao et al. [10], which extracts the features through a con v olution-based network and performs recognition through a tempotron classiﬁer . T empotron classiﬁer is a supervised learning rule of SNN which speciﬁes the desired status of ﬁring or not for each neuron. The second one named BOE was proposed by Peng et al. [7], which uses a probability-based method for feature extraction and a support vector machine with linear kernel as the classiﬁer . The third one named HFirst was proposed by Orchard et al. [12], which employs a spiking hierarchical feature extraction model and a classiﬁer based on spike times. W e obtain the source codes of these benchmark methods from their authors. C. Experiment Settings The experiments are run on a workstation with two Xeon E5 2.1GHz CPUs and 128GB RAM. W e use MA TLAB to simulate Event Flow Segmentation and MuST , the BRIAN simulator [34] to implement SNN for recognition. W e randomly partition the used dataset into two parts for training and testing. The result is obtained ov er multiple runs with dif ferent training and testing data partitions. W e report the ﬁnal results with the mean accuracy and standard deviation. For fair comparison, the results of methods listed in T ABLE I are obtained under the same experimental settings. The results of benchmark methods are from the original papers [7], [10], or (if not in the papers) from the experiments using the code [7], [10], [12] with our optimization. 8 T ABLE I R E CO G N I TI O N P E R FO R M A NC E O N FI V E D A TA S ET S . Model POKER-D VS MNIST -D VS NMNIST AER Posture GESTURE-DVS 100 ms 200 ms full length Zhao’ s [10] 93.00 % 76.86 % 82.61 % 88.14 % 85.60 % 99.48 % 90.50 % BOE [7] 93.00 % 74.60 % 78.74 % 72.04 % 70.43 % 98.66 % 88.97 % HFirst [12] 94.00 % 55.77 % 61.96 % 78.13 % 71.15 % 94.48 % 84.75 % Our W ork 99.00 % 79.25 % 83.30 % 89.96 % 89.70 % 99.58 % 95.75 % T ABLE II T H E PA R AM E T E RS O F G A BO R FI L T ER S . scale s 3 5 7 9 effecti ve width σ 1.2 2.0 2.8 3.6 wav elength λ 1.5 2.5 3.5 4.6 orientation θ 0 ◦ , 45 ◦ , 90 ◦ , 135 ◦ aspect ratio γ 0.3 The constant parameter settings in our approach are summa- rized here. W e choose four orientations ( 0 ◦ , 45 ◦ , 90 ◦ , 135 ◦ ) and a range of sizes from 3 × 3 to 9 × 9 pixels with strides of two pixels for Gabor ﬁlters. The detailed settings of Gabor ﬁlters are listed in T ABLE II. These parameter settings have been proved solid on the task of visual feature capturing, and inherited in many works [10], [12], [28]. The time constant of feature response τ leak is set according to the time length of the symbol in each dataset. The τ leak for POKER-D VS, MNIST -D VS, NMNIST , AER Posture, and GESTURE-D VS dataset is set to 10 ms , 100 ms , 30 ms , 100 ms , and 50 ms , respecti vely . The time window t w and threshold r min in coding function are set as 500 ms and 0 . 2 respectiv ely . The parameters of neuron model in the recognition layer are set as follows: V rest = − 65 mV , E exc = 0 mV , E inh = − 100 mV , τ = 100 ms , V t = − 63 . 5 mV , V plus = 0 . 07 mV , τ thr = 1 e 7 ms . The parameters in STDP are set as follows: τ apre = 20 ms , τ apost = 30 ms , τ apost 2 = 40 ms , A + = 0 . 1 , A − = 0 . 001 . The other parameters in recognition layer are set as follows. The inhibitory weight w inh is set as 2 . 4 and the delay time t d is set as 0 . 3 ms . According to the number of samples, the number of learning neurons for POKER-D VS, MNIST -D VS, NMNIST , AER Posture, and GESTURE-D VS dataset are set as 60 , 700 , 1200 , 600 , and 60 . Due to different spatial resolutions, the parameter L in weight normalization is set as 37 . 5 for MNIST -D VS dataset, 54 . 0 for NMNIST dataset, and 47 . 0 for POKER-D VS dataset, AER Posture dataset, and GESTURE-D VS dataset. D. P erformance on Differ ent AER Datasets 1) On POKER-D VS dataset: F or each category of this dataset, 90% are randomly selected for training and the others are used for testing. W e obtain the average performance by repeating the e xperiments 100 times. Our approach gets the recognition accuracy of 99.00% on av erage, with a standard deviation of 3.84%. T ABLE I shows that our approach outperforms Zhao’ s method [10], BOE [7] and HFirst [12] by a performance margin of 6.00%, 6.00% and 5.00% respecti vely . 2) On MNIST -D VS dataset: This dataset has 10,000 sym- bols, 90% of which are randomly selected for training and the remaining ones are used for testing. The performance is av eraged over 10 runs. The experiments are conducted on recordings with the ﬁrst 100 ms , 200 ms and full length (about 2000 ms ) respecti vely . Fig. 7 sho ws the correct recognition rates on recordings with 100 ms of each digit along the diagonal and the confusions anywhere else. Digit 1 gets the highest accuracy of 96.67% because of its simple stroke. Confusions occur mostly between digit 7 and 9. As can be noticed in Fig. 6, the difference between the two digits is that there is an extra horizontal stroke in 9, which is connected to the above stroke. Hence, the learning neurons representing 9 are likely to ﬁre when the input pattern is 7. Overall, our approach achiev es the recognition accuracy of 79.25 % , 83.30 % and 89.96 % on the recordings of 100 ms , 200 ms and full length. W e can see that our performance becomes better with the longer recordings. Further , our approach consistently outperforms other methods on recordings with every time length in T ABLE I. 3) On NMNIST dataset: This dataset is inherited from MNIST , and has been partitioned into 60,000 training samples and 10,000 testing samples by default. MNIST -DVS and NMNIST datasets are both deriv ed from the original frame- based MNIST dataset. Compared with MNIST -D VS dataset recorded by moving the MNIST images with slow motion, NMNIST dataset is captured by moving the AER sensor . The obtained e vent streams in two datasets are not the same. Our approach gets the recognition accuracy of 89.70%. T ABLE I shows that the recognition performance of our approach is higher than that of Zhao’ s method [10], BOE [7] and HFirst [12]. In addition, compared with Iyer & Basu’ s 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 10 20 30 40 50 60 70 80 90 0 1 2 3 4 5 6 7 8 9 Actual digit 0 1 2 3 4 5 6 7 8 9 Predicted digit 0.18 2.06 1.53 0.47 0.94 1.53 0.23 1.23 0.18 0.00 0.73 0.06 0.56 0.11 0.45 0.11 0.51 0.79 3.93 1.46 7.57 0.34 0.56 1.91 0.79 2.92 0.67 2.68 1.26 6.73 1.53 2.84 0.42 2.00 4.89 1.10 0.21 3.26 1.71 0.53 0.37 0.37 3.16 0.75 9.95 1.19 0.45 0.28 5.51 0.85 4.94 0.57 2.50 1.65 4.12 1.04 0.70 0.87 1.04 2.38 0.17 1.16 1.63 0.56 2.79 0.89 0.72 4.96 0.33 0.45 1.45 14.82 2.66 3.00 1.62 7.79 3.18 7.39 2.48 3.23 64.84 3.81 1.31 1.77 0.68 1.60 6.90 0.68 0.74 22.08 2.51 61.72 91.66 96.67 79.87 76.54 79.68 82.05 86.88 73.04 0 10 20 30 40 50 60 70 80 90 0 1 2 3 4 5 6 7 8 9 Actual digit 0 1 2 3 4 5 6 7 8 9 Predicted digit 0.18 2.06 1.53 0.47 0.94 1.53 0.23 1.23 0.18 0.00 0.73 0.06 0.56 0.11 0.45 0.11 0.51 0.79 3.93 1.46 7.57 0.34 0.56 1.91 0.79 2.92 0.67 2.68 1.26 6.73 1.53 2.84 0.42 2.00 4.89 1.10 0.21 3.26 1.71 0.53 0.37 0.37 3.16 0.75 9.95 1.19 0.45 0.28 5.51 0.85 4.94 0.57 2.50 1.65 4.12 1.04 0.70 0.87 1.04 2.38 0.17 1.16 1.63 0.56 2.79 0.89 0.72 4.96 0.33 0.45 1.45 14.82 2.66 3.00 1.62 7.79 3.18 7.39 2.48 3.23 64.84 3.81 1.31 1.77 0.68 1.60 6.90 0.68 0.74 22.08 2.51 61.72 91.66 96.67 79.87 76.54 79.68 82.05 86.88 73.04 0 10 20 30 40 50 60 70 80 90 0 1 2 3 4 5 6 7 8 9 Actual digit 0 1 2 3 4 5 6 7 8 9 Predicted digit 0.18 2.06 1.53 0.47 0.94 1.53 0.23 1.23 0.18 0.00 0.73 0.06 0.56 0.11 0.45 0.11 0.51 0.79 3.93 1.46 7.57 0.34 0.56 1.91 0.79 2.92 0.67 2.68 1.26 6.73 1.53 2.84 0.42 2.00 4.89 1.10 0.21 3.26 1.71 0.53 0.37 0.37 3.16 0.75 9.95 1.19 0.45 0.28 5.51 0.85 4.94 0.57 2.50 1.65 4.12 1.04 0.70 0.87 1.04 2.38 0.17 1.16 1.63 0.56 2.79 0.89 0.72 4.96 0.33 0.45 1.45 14.82 2.66 3.00 1.62 7.79 3.18 7.39 2.48 3.23 64.84 3.81 1.31 1.77 0.68 1.60 6.90 0.68 0.74 22.08 2.51 61.72 91.66 96.67 79.87 76.54 79.68 82.05 86.88 73.04 0 10 20 30 40 50 60 70 80 90 91.66 96.67 79.87 76.54 79.68 82.05 86.88 73.04 64.84 61.72 Fig. 7. A verage confusion matrix of the testing results over 10 runs of MNIST - D VS 100 ms dataset. 9 unsupervised model [22] on NMNIST , which achie ves the accuracy of 80.63%, our approach can give higher accuracy of 89.70%. 4) On AER P ostur e dataset: In this dataset, we randomly select 80% of human actions for training and the others for testing. The experiments are repeated 10 times to obtain the av erage performance. The results are listed in T ABLE I. The recognition accuracy obtained by our approach is 99.58%. Our approach has a performance that is comparable to Zhao’ s [10], higher than BOE [7] and HFirst [12]. 5) On Gestur e D VS dataset: This dataset has 3 categories, each with 40 samples. For each category , 90% samples are randomly selected for training and the others are for testing. W e perform the experiments 100 times and average the per- formance. In this dataset, the position of the hand in each sample is not constant and some portion of the player’ s forearm is sometimes recorded. These randomnesses increase the dif ﬁculties of the recognition task of this dataset. Our approach achiev es the recognition accuracy of 95.75%, with a standard deviation of 5.49%. T ABLE I sho ws that our approach outperforms Zhao’ s method [10], BOE [7] and HFirst [12] by a performance margin of 5.25%, 6.78% and 11.00% respectiv ely . E. Analyses of the MuST In this section, we carry out experiments to analyze the effects of MuST from two aspects: the temporal coding function and the spatial fusion method. The experiments are conducted on POKER-D VS dataset, AER Posture dataset , 1,000 samples of MNIST -D VS 100 ms dataset and GESTURE- D VS dataset. For each dataset, the experiment settings are the same as the previous section. T ABLE III A C C U RA CY W I TH L I NE A R C O D IN G F UN C T I ON A N D N A T U R A L L O G A RI T H M C O DI N G F U N CT I O N . Method POKER Posture MNIST GESTURE Linear Coding 95.25 % 96.69 % 73.30 % 95.25 % Log Coding 99.00% 99.58% 76.90% 95.75 % 1) Effects of the temporal coding function: W e compare the performance of the approach using con ventional linear coding function [10], [28] and the proposed natural logarithm coding function. The linear coding function is set as follows: t spike = − ar + b , where a = t w /r max and b = t w . As sho wn in T ABLE III 1 , the proposed logarithm coding function achieves higher performance than the linear one on three datasets. As reported in Fig. 8, using linear coding function, feature spikes which are emitted early has a quite sparse temporal distribution. For example, the linear coding function generates only approximately 8% spikes before 400 ms . The feature spikes with sparse distrib ution are hard to accumulate potential of the learning neurons high enough to emit spikes. Thus, the information in these feature spikes cannot be transmitted to the learning neurons. As the temporal distribution of spikes 1 Due to the space limit, names of the dataset are abbreviated accordingly . Fig. 8. Spike timing distributions with linear coding function and natural logarithm coding function on MNIST -DVS 100 ms dataset. Each bin has a nonoverlapping temporal span of 20 ms , and the time window t w is 500 ms . The height of each bin indicates the average proportion of the spike timings in the corresponding time span. The natural logarithm coding function ev ens the distribution of the timings of feature spikes. The information entropy H linear and H log are 2 . 21 and 3 . 99 , respectiv ely . becomes denser, potential of the learning neuron becomes higher and emits more spikes gradually . According to the STDP learning, the synaptic weight is updated when there is a presynaptic spike or a postsynaptic spike. Therefore, the learning and the recognition is mostly af fected by the feature spikes emitted later . As can be seen in Fig. 8, the proposed logarithm coding function evens the temporal distribution of spikes, so that the features can be used equally to a large extent. The analysis can also be giv en in another aspect. Consid- ering that the information in SNN is represented by the spike timings, we ev aluate the information carried by the feature spikes generated by these two coding functions using informa- tion entropy of the spikes. The information entropy with the higher value means the corresponding feature representation contains more information of the features. The information entropy is calculated as: H = − X i p i log 2 p i (15) where p i denotes the portion of spikes located within the i -th temporal bin. The information entropy H log of feature spikes generated by natural logarithm coding function is 3 . 99 , which is higher than H linear generated by linear coding function with 2 . 21 . This suggests that the obtained MuST feature representation is more informativ e and the proposed natural logarithm coding function con ve ys more information of the features into the spikes, which contributes to the recognition using SNN. 2) Effects of the spatial fusion method: In our approach, the spatial features of AER ev ents are extracted from two aspects, i.e., scales and orientations. There exist four spatial fusion options for features, i.e., multiscale fusion, multi- orientation fusion, no fusion and full fusion. W e will compare our approach with those using full fusion, multi-orientation fusion and no fusion instead of multiscale fusion to provide the analyses. Multiscale fusion fuses features of multiple scales having same orientation θ and position ( x, y ) in feature maps 10 T ABLE IV A C C U RA CY A N D R E Q U IR E D PA RA M E T ER S W I T H F O UR F U SI O N M E T HO D S . Dataset Accuracy Params POKER-D VS Multiscale Fusion 99.00% 0.43M Multi-Orientation Fusion 94.50 % 0.43M No Fusion 96.63 % 1.72M Full Fusion 85.50 % 0.11M AER Posture Multiscale Fusion 99.58% 4.67M Multi-Orientation Fusion 99.00 % 4.67M No Fusion 92.17 % 17.57M Full Fusion 90.56 % 1.44M MNIST -D VS Multiscale Fusion 76.90% 4.34M Multi-Orientation Fusion 57.97 % 4.34M No Fusion 75.62 % 15.87M Full Fusion 54.64 % 1.46M GESTURE-D VS Multiscale Fusion 95.75% 0.43M Multi-Orientation Fusion 90.83 % 0.43M No Fusion 73.25 % 1.72M Full Fusion 80.58 % 0.11M into a spike-train for an encoding neuron, which is comprised of a set of t spike in Equation (7) where S = { 3 , 5 , 7 , 9 } and Θ = { θ } . Multi-orientation fusion fuses features of multiple orientations having same scale s and position ( x, y ) into a spike-train, which is comprised of t spike in Equation (7) where S = { s } and Θ = { 0 ◦ , 45 ◦ , 90 ◦ , 135 ◦ } . No fusion does not fuse any feature spike, and the feature spike ha ving scale s , orientation θ and position ( x, y ) can be expressed using Equation (7) where S = { s } and Θ = { θ } . Full fusion fuses all feature spik es ha ving the same position ( x, y ) into a spike-train, that is comprised of t spike in Equation (7) where S = { 3 , 5 , 7 , 9 } and Θ = { 0 ◦ , 45 ◦ , 90 ◦ , 135 ◦ } . T ABLE IV reports the recognition accuracy and the required number of parameters of these four methods. W e will give the analyses via 3 comparisons: First, both multi-orientation fusion and multiscale fusion fuse the features along their corresponding aspect, and require the same number of parameters since the number of scales and orientations are the same in our settings. But multi- orientation fusion yields a lower performance, as shown in T ABLE IV. An important factor to affect the result of these two fusion methods is the correlation among data sources. A high correlation between features implies features contain similar information, while a lower feature correlation means that features hav e richer div ersity . It is expected that highly correlated features are fused together to one neuron, while low correlation features are separated to different neurons, so that learning neurons can distinguish various patterns of the fused spikes more easily . W e use the correlation coefﬁcient (CC) to measure the correlation and randomly choose 1000 samples of MNIST - D VS 100 ms dataset for illustration. For the i -th sample, CC between scales is obtained by av eraging the Pearson CCs of pairwise scale maps having the same orientations: C C i s = n θ P θ =1 n s P s =1 n s P s 0 = s +1 ρ ( r ( s, θ ) , r ( s 0 , θ )) M (16) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Correlation Coefficient 0 2 4 6 8 Density CC between scales CC between orientations Fig. 9. The normalized histograms of correlation coefﬁcient (CC), and their ﬁtted probability density functions. Each CC v alue is deri ved from a pair of response series of different scales or of different orientations, and the distribution consists of CCs for ev ery possible pair. Each bin has a nonoverlapping span of 0 . 02 , and the height of each bin indicates the density of CC v alues in the corresponding span. CC between orientations is on a verage lower than CC between scales. where r ( s, θ ) represents the vector of C 1 responses at scale s and orientation θ , ρ ( A, B ) denotes Pearson correlation coefﬁcient between vector A and vector B, M = n θ  2 n s  denotes the number of pairs of feature vectors. CC between orientations is obtained in the same way but with pairwise orientation maps having the same scales. W e can see from Fig. 9, CC between orientations is on av erage lower than CC between scales. Speciﬁcally , there are only about 3% of values of CC between scales less than 0 . 5 , b ut approximately 74% of CC between orientations less than 0.5. It demonstrates that features of dif ferent orientations have lower correlation than those of different scales. Multi-orientation fusion brings together div erse information into one neuron to express, and separates similar information to different neurons to express. Therefore, the recognition network are hard to learn the spik e patterns, which results in a lower performance. Second, we notice that the method without fusion maintains relativ ely high accuracies on three datasets but requires larger number of parameters in recognition part. W ithout fusion, each encoding neuron represents a speciﬁc spatio-temporal feature. As shown in T ABLE IV, this method will require larger com- putation resource. Nev ertheless, multiscale fusion can achieve a competitive result with more efﬁcient resource usage, which is well suited for resource-constrained neuromorphic de vices. Third, full fusion fuses all the spatio-temporal features of a position and obtains the worst result on three datasets. The fusion degree of full fusion is higher than other three fusion methods. Although it requires least computation resource, this method faces sev ere limitation of feature expression and therefore has a poor recognition accuracy . V I . C O N C L U S I O N In this paper , we propose an unsupervised recognition approach for AER object. The proposed approach presents a MuST representation for encoding AER ev ents and employs 11 STDP for object recognition with MuST . MuST exploits the spatio-temporal information encapsulated in the AER ev ents and forms a feature representation that contributes to the latter recognition. Experimental results sho w the effects of MuST from both temporal and spatial perspectives. MuST , with ev en temporal distribution, has been shown informative and can improv e the performance of recognition. MuST also fuses highly correlated features, forming a compact spike repre- sentation, which consumes less computational resource while still maintaining comparable performance. The recognition process employs a SNN trained by the triplet STDP , which does not require a teaching signal or setting the desired status of neurons. Compared with other state-of-the-art supervised benchmark methods, our approach yields comparable or even better performance on ﬁve AER datasets, including a new dataset named GESTURE-DVS that further veriﬁes the ro- bustness of our approach. R E F E R E N C E S [1] G. Indiv eri and S.-C. Liu, “Memory and information processing in neuromorphic systems, ” Proceedings of the IEEE , vol. 103, no. 8, pp. 1379–1397, 2015. [2] D. Monroe, “Neuromorphic computing gets ready for the (really) big time, ” Communications of the ACM , vol. 57, no. 6, pp. 13–15, 2014. [3] C. Posch, D. Matolin, and R. W ohlgenannt, “ A qvga 143 db dynamic range frame-free pwm image sensor with lossless pixel-lev el video com- pression and time-domain cds, ” IEEE J ournal of Solid-State Circuits , vol. 46, no. 1, pp. 259–275, 2011. [4] P . Lichtsteiner, C. Posch, and T . Delbruck, “ A 128 × 128 120 db 15 µ s latency asynchronous temporal contrast vision sensor, ” IEEE Journal of Solid-State Circuits , vol. 43, no. 2, pp. 566–576, 2008. [5] J. A. Le ˜ nero-Bardallo, T . Serrano-Gotarredona, and B. Linares-Barranco, “ A 3.6 µ s latency asynchronous frame-free event-dri ven dynamic-vision- sensor , ” IEEE Journal of Solid-State Circuits , vol. 46, no. 6, pp. 1443– 1455, 2011. [6] C. Brandli, R. Berner, M. Y ang, S.-C. Liu, and T . Delbruck, “ A 240 × 180 130 db 3 µ s latency global shutter spatiotemporal vision sensor , ” IEEE Journal of Solid-State Circuits , vol. 49, no. 10, pp. 2333–2341, 2014. [7] X. Peng, B. Zhao, R. Y an, H. T ang, and Z. Yi, “Bag of events: An efﬁcient probability-based feature extraction method for aer image sensors, ” IEEE T ransactions on Neural Networks and Learning Systems , vol. 28, no. 4, pp. 791–803, 2017. [8] T . Serre, A. Oliv a, and T . Poggio, “ A feedforward architecture accounts for rapid categorization, ” Proceedings of the National Academy of Sciences , vol. 104, no. 15, pp. 6424–6429, 2007. [9] S. Chen, P . Akselrod, B. Zhao, J. A. P . Carrasco, B. Linares-Barranco, and E. Culurciello, “Efﬁcient feedforward categorization of objects and human postures with address-ev ent image sensors, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 34, no. 2, pp. 302– 314, 2012. [10] B. Zhao, R. Ding, S. Chen, B. Linares-Barranco, and H. T ang, “Feed- forward categorization on aer motion events using cortex-like features in a spiking neural network. ” IEEE T ransactions on Neural Networks and Learning Systems , vol. 26, no. 9, pp. 1963–1978, 2015. [11] X. Lagorce, G. Orchard, F . Galluppi, B. E. Shi, and R. B. Benosman, “Hots: a hierarchy of event-based time-surfaces for pattern recogni- tion, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 39, no. 7, pp. 1346–1359, 2017. [12] G. Orchard, C. Meyer , R. Etienne-Cummings, C. Posch, N. Thakor, and R. Benosman, “Hﬁrst: A temporal approach to object recognition, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 37, no. 10, pp. 2028–2040, 2015. [13] R.-M. Memmesheimer, R. Rubin, B. P . ¨ Olveczky , and H. Sompolinsky , “Learning precisely timed spikes, ” Neuron , vol. 82, no. 4, pp. 925–938, 2014. [14] S. Panzeri, N. Brunel, N. K. Logothetis, and C. Kayser, “Sensory neural codes using multiplex ed temporal scales, ” T rends in Neurosciences , vol. 33, no. 3, pp. 111–120, 2010. [15] J. Hu, H. T ang, K. C. T an, and H. Li, “How the brain formulates mem- ory: A spatio-temporal model research frontier , ” IEEE Computational Intelligence Magazine , vol. 11, no. 2, pp. 56–68, 2016. [16] D. A. Butts, C. W eng, J. Jin, C.-I. Y eh, N. A. Lesica, J.-M. Alonso, and G. B. Stanley , “T emporal precision in the neural code and the timescales of natural vision, ” Nature , vol. 449, no. 7158, p. 92, 2007. [17] Q. Y u, H. T ang, K. C. T an, and H. Li, “Rapid feedforward computation by temporal encoding and learning with spiking neurons, ” IEEE T rans- actions on Neural Networks and Learning Systems , vol. 24, no. 10, pp. 1539–1552, 2013. [18] T . Zhang, Y . Zeng, D. Zhao, and M. Shi, “ A plasticity-centric approach to train the non-differential spiking neural networks, ” in Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018. [19] P . U. Diehl and M. Cook, “Unsupervised learning of digit recognition using spike-timing-dependent plasticity , ” Fr ontiers in Computational Neur oscience , vol. 9, p. 99, 2015. [20] Y . Ma, R. Xiao, and H. T ang, “ An event-dri ven computational system with spiking neurons for object recognition, ” in International Confer ence on Neural Information Processing . Springer, 2017, pp. 453–461. [21] G.-q. Bi and M.-m. Poo, “Synaptic modiﬁcations in cultured hip- pocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type, ” Journal of Neuroscience , vol. 18, no. 24, pp. 10 464–10 472, 1998. [22] L. R. Iyer and A. Basu, “Unsupervised learning of event-based image recordings using spike-timing-dependent plasticity , ” in 2017 Interna- tional Joint Conference on Neural Networks (IJCNN) . IEEE, 2017, pp. 1840–1846. [23] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor , “Con verting static image datasets to spiking neuromorphic datasets using saccades, ” F rontiers in Neuroscience , vol. 9, p. 437, 2015. [24] Y . Zheng, S. Li, R. Y an, H. T ang, and K. C. T an, “Sparse temporal encoding of visual features for robust object recognition by spiking neurons, ” IEEE T ransactions on Neur al Networks and Learning Systems , no. 99, pp. 1–11, 2018. [25] P . Panda and K. Ro y , “Unsupervised regenerati ve learning of hierarchical features in spiking deep networks for object recognition, ” in 2016 International Joint Conference on Neural Networks (IJCNN) . IEEE, 2016, pp. 299–306. [26] D. H. Hubel and T . N. Wiesel, “Receptiv e ﬁelds, binocular interaction and functional architecture in the cat’s visual cortex, ” The Journal of physiology , vol. 160, no. 1, pp. 106–154, 1962. [27] T . Serre, L. W olf, S. Bileschi, M. Riesenhuber, and T . Poggio, “Robust object recognition with cortex-like mechanisms, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , no. 3, pp. 411–426, 2007. [28] D. Liu and S. Y ue, “Fast unsupervised learning for visual pattern recognition using spike timing dependent plasticity , ” Neur ocomputing , vol. 249, pp. 212–224, 2017. [29] J.-P . Pﬁster and W . Gerstner , “T riplets of spikes in a model of spike timing-dependent plasticity , ” Journal of Neur oscience , vol. 26, no. 38, pp. 9673–9682, 2006. [30] J. Gjor gjiev a, C. Clopath, J. Audet, and J.-P . Pﬁster , “ A triplet spike- timing–dependent plasticity model generalizes the bienenstock–cooper– munro rule to higher-order spatiotemporal correlations, ” Pr oceedings of the National Academy of Sciences , vol. 108, no. 48, pp. 19 383–19 388, 2011. [31] W . Zhang and D. J. Linden, “The other side of the engram: Experience- driv en changes in neuronal intrinsic excitability , ” Natur e Reviews Neu- r oscience , vol. 4, no. 11, p. 885, 2003. [32] G. J. Goodhill and H. G. Barrow , “The role of weight normalization in competitiv e learning, ” Neural Computation , vol. 6, no. 2, pp. 255–269, 1994. [33] J. A. P ´ erez-Carrasco, B. Zhao, C. Serrano, B. Acha, T . Serrano- Gotarredona, S. Chen, and B. Linares-Barranco, “Mapping from frame- driv en to frame-free event-dri ven vision systems by lo w-rate rate coding and coincidence processing–application to feedforward con vnets, ” IEEE T ransactions on P attern Analysis and Machine Intelligence , vol. 35, no. 11, pp. 2706–2719, 2013. [34] D. F . Goodman and R. Brette, “The brian simulator, ” F r ontiers in Neur oscience , vol. 3, p. 26, 2009.

Unsupervised AER Object Recognition Based on Multiscale Spatio-Temporal Features and Spiking Neurons

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment