In this work we present a novel internal clock based space-time neural network for motion speed recognition. The developed system has a spike train encoder, a Spiking Neural Network (SNN) with internal clocking behaviors, a pattern transformation block and a Network Dynamic Dependent Plasticity (NDDP) learning block. The core principle is that the developed SNN will automatically tune its network pattern frequency (internal clock frequency) to recognize human motions in a speed domain. We employed both cartoons and real-world videos as training benchmarks, results demonstrate that our system can not only recognize motions with considerable speed differences (e.g. run, walk, jump, wonder(think) and standstill), but also motions with subtle speed gaps such as run and fast walk. The inference accuracy can be up to 83.3% (cartoon videos) and 75% (real-world videos). Meanwhile, the system only requires six video datasets in the learning stage and with up to 42 training trials. Hardware performance estimation indicates that the training time is 0.84-4.35s and power consumption is 33.26-201mW (based on an ARM Cortex M4 processor). Therefore, our system takes unique learning advantages of the requirement of the small dataset, quick learning and low power performance, which shows great potentials for edge or scalable AI-based applications.
This research explores the key findings and methodology presented in the paper: An Internal Clock Based Space-time Neural Network for Motion Speed Recognition.
In this work we present a novel internal clock based space-time neural network for motion speed recognition. The developed system has a spike train encoder, a Spiking Neural Network (SNN) with internal clocking behaviors, a pattern transformation block and a Network Dynamic Dependent Plasticity (NDDP) learning block. The core principle is that the developed SNN will automatically tune its network pattern frequency (internal clock frequency) to recognize human motions in a speed domain. We employed both cartoons and real-world videos as training benchmarks, results demonstrate that our system can not only recognize motions with considerable speed differences (e.g. run, walk, jump, wonder(think) and standstill), but also motions with subtle speed gaps such as run and fast walk. The inference accuracy can be up to 83.3% (cartoon videos) and 75% (real-world videos). Meanwhile, the system only requires six video datasets in the learning stage and with up to 42 training trials. Hardware perf
Nowadays Artificial Neural Networks (ANNs) [1] achieve huge successes and become one of the key factors leading to the next generation industrial revolution. And it is a game-changing player in some industrial fields such as face recognition [2], auto-driving and natural language processing [3]. It progresses rapidly and meanwhile, it suffers several main constraints such as requirements of a large amount of training data, low fault tolerances and without cognitive computing functions [4]. This is fundamentally different from how our brains process information [5], and these issues are not solved yet. Therefore, there is a small portion of researchers follow the other path and try to overcome this dilemma: Spiking Neural Networks (SNNs) come of the age [6] and use temporalspatial based processing and event-driven mechanisms [7] [8]. And the core principles of SNNs are to replicate fascinate brain computing behaviours [9][10]: ultra-low power consumption, selflearning and strong fault tolerances. Unfortunately, up to now there is still a considerable gap between ANNs and SNNs regarding the application levels. Based on our limited knowledge, we conclude several issues as below:
โข
The mainstream SNN training algorithms such as Spiking-timing dependent plasticity (STDP) are widely used in the neuromorphic computing field. For example, ODIN [11] develops a 10-neuron SNN and employed SDSP learning algorithm for MINST dataset testing, the system demonstrates its capability with 84.5% classification accuracy. Meanwhile, [12][13] [14] shows similar results by using SNNs based STDP learning algorithms. However, STDP is a local training algorithm which strongly limits its application. Also, there is a large number of groups investigate SNNs based backpropagation or gradient descent algorithms which similar to ANNs training framework [15][16]. However, these kinds of algorithms seem feeble and don’t fit SNNs nature computing features.
Simulation of a brain computing can be either from a high bioplausible level Hodgkin-Huxley neuron model [17] or a high mathematical level leakage-and-integration neuron model [18].
Similarly at a network level, modelling of a small neural network can perform plasticity, adaption and compensation [19][20][21], while formulating a large scale network (100,000) takes advantage of cognitive computing features [22][23]. We are confused about at which level the neuromorphic system should learn from a brain. The obvious reason is the brain is not fully understood yet [24], and more importantly, neuromorphic engineers are not well recognized this point when they develop systems. As a result of this, the developed system doesn’t reflect SNN computing features properly.
โข
Currently neuromorphic computing fields are largely focused on hardware architecture design such as Neurogrid [25],
TrueNorth [26] and neural processors [27]. They all made a significant contribution on this field and demonstrate the capabilities to simulate either a million neurons or complicated ion channel mechanisms in real-time. One potential risk of this bottomup approach is that the emerging algorithms may not well fit into developed hardware, and results of no killer applications. The algorithm, software, hardware, and application should be fully taken into accounts when we design a neuromorphic computing system.
Therefore, by considering these factors above and inspired by the biological cerebellum Passenger-of-Timing (POT) mechanism [28][29], we propose a novel SNN based learning system for speed recognitions. As it is shown in Figure . 1, the system consists of a spike train encoder, an internal clock based SNN, a pattern transformation block and a Network-Dynamic Dependent Plasticity (NDDP) learning block. The main principle is that motion speed can be differentiated via a trained SNN internal clock timing information. By applying both cartoon and real-world videos, results demonstrate that under a constrained hardware resources environment, the proposed system can not only recognize motions with considerable speed differences (e.g. run, walk, jump, wonder and standstill) but also motions with subtle speed gaps such as slow run and fast walk. Therefore, the key contributions are as followed: โข Applications level: the proposed system can be applied on IoT fields for speed recognition due to its ultra-low power consumption(33.26mW), short-latency (0.84s) and usage of limited hardware resources (can be implemented on a typical ARM Cortex M4 controller).
And this will enable system learning capabilities on edges or end devices.
An internal clock based SNN learning system has three stages for training and learning: 1)information translation: the input motion videos are transformed into spike trains via a spike train encoder; 2)training: by given pre-defined learning signals, the SNN modify its global dynamic pattern frequency (internal clock frequency) via NDDP learning rules to minimize errors (cost function); 3)inference: the trained SNN differentiates input motions based on mean firing rates. The detailed individual blocks are described in Figure 2.
A temporal-spatial spike train encoder aims to reduce redundant information both in time and space domain, and only events related information is given into SNNs. The equation is as below:
Where ๐ is the total information (bits) given to the neural network, ๐ is an input video frame number, ๐ is a network neuron number (pixel number), โ๐ is a spatial resolution that converts several pixel values into a single one, โ๐ก is a differential timing between a current frame and reference frame. ๐ด ๐ ๐ is pixel ๐ at frame ๐ activities: A = 1 indicates spiking, otherwise A = 0 (a function [๐ข] + equals 1 when ๐ข โ
0, otherwise equals 0). As in Figure 2 (a-b) displays, the reference video motion is converted into spike trains, video frames are encoded into corresponding neuron spike trains as inputs. Figure 2(b) displays a detailed example of converting run motion video into spike trains in a contour plot format.
Based on the previous work [29], we develop a new spiking neural network and with two types of inputs: one is synaptic inputs of excitatory and recurrent inhibitory inputs from the other neurons, and the other one is from motions spike trains. The model is tailor modified leaky integrate-and-fire model as equation shown below: Where ๐ข(๐ก) and ๐ด ๐ is neuron membrane potentials and activity states; ๐ผ is an external afferent input signal and ๐ค ๐๐ represents neuron j to neuron i. A function [๐ข] + equals 1 when ๐ข โ
0 , otherwise equals 0. The final SNN outputs are the results of Boolean AND logic operation between neuron spikes ๐ข(๐ก) and motion spikes ๐ ๐ก ๐ , where ๐ ๐ก ๐ is a motion spike index ๐ at timing ๐ก activities (1 or 0). This is to build a correlation between internal SNNs and external world dynamics (Figure 3a). Neuron model also has long temporal integration of activities of neurons. This is described by the summation with respect to ๐ , ๐ is the decay time constant. The neural network global dynamic pattern frequency (internal clock frequency) is described by using the similarity index, which is shown at equation (3):
Where ๐ถ(๐ก 1 , ๐ก 2 ) equals 1 if the activity pattern ๐ง ๐ (๐ก 1 ) and ๐ง ๐ (๐ก 2 ) are identical, and it equals 0 if they are orthogonal, which illustrates that there is no overlap. ๐ก 1 and ๐ก 1 are simulation time index from 0 to the last simulation step. As Figure 2(c) shows, the internal clock frequency can be calculated by evaluating repetitive pattern frequencies (internal clock frequency). Here we employ above similarity index to measure repetitive pattern frequencies.
The system learning process is divided into two stages: 1) STAGE1:
Where ๐ is video motion index and ๐, ๐ค is its motion frequency (e.g. walking frequency) and rank weights. Motion videos will be ranked from high to low based on frequencies (these are calculated based on video information). A variable ๐ ๐๐๐ ๐ is given to distinguish different motion types.
At frequency band classification stage (this stage purpose is to significantly reduce training time), an SNN will be configured into the four different internal clock frequencies sequentially: ๐ ๐ ๐๐ = ๐ (slow, patterns are no overlap at Figure2 (b-c)); ๐ ๐ ๐๐ = 2๐ (middle, network patterns overlap twice at Figure 2(b-c) ); ๐ ๐ ๐๐ = 3๐ (fast, network patterns overlap more than two times at Figure 2(b-c) left) and ๐ ๐ ๐๐ = 4๐ (ultra-fast, network patterns overlap in most of the time at Figure 2(b-c)). The internal clock frequency modification is achieved in equation ( 5):
Where ๐ is a neuron modular size, which means how many neurons share the same synaptic connections; ๐ is the neuron model decay time constant and ๐ is a network excitatory synapse weight. Then
Where ๐ฟ is a training rate, ๐ ๐ and ๐ ๐ is the current trial training errors and previous trial training errors. At each training trial, the SNN synaptic weight ๐ will be fine-tuned until the training error equals 0. The global synaptic weight ๐ upper limit is 2.5.
Three different benchmarks are tested to prove system functionalities: 1) recognition of motions with considerable speed differences: run, walk, joyful, jump, slow walk and wonder; 2) recognition of motions with subtle speed gap such as slow run and fast walk; 3) recognition of real-world motion videos based on knowledge learned from cartoon videos.
Regarding experimental setup, the neuron number N = 900 and with stimulation time T = 500 ms. Excitatory and inhibitory synapses (weight is 0) number ratio is followed binomial distribution P = 0.5. Cartoon motion videos format is ‘RGB24’, the resolutions are 596 by 336 pixels, the frame rate is 30 and bits per pixel is 8. Based on the developed system, spike train encoder parameters are setup as โ๐ = 4 and โ๐ก = 1. At the inference stage as Figure 4 depicts, 18 videos are randomly selected from three types of motions. Based on the trained results, motions with mean firing rates above 45.1Hz (Figure 4 red dash line) are identical with fast movements, motions with firing rates between 45.1Hz and 30.9Hz are identical with medium movements and motions with firing rate below 30.9Hz (Figure 4 blue dash line) are identical with slow movements. Only two motions a17(fast) and a18(fast) are differentiated into medium motions, the overall accuracy is 88.9%.
In order to further prove developed system capabilities, six videos {slow run and fast walk} with tiny frequency gaps are chosen in this experiment. In this case teaching signals are defined in the box as below:
Since the real-world motion such as walk and run share the identical repetitive spike train patterns with cartoon videos, we did an inference for real-world motion videos (slow run and fast walk) based on cartoon videos trained SNNs. The results are displayed at Figure . 7, 4 videos with a fast walk (G1-G4) and 4 videos with a slow walk (G5-G8) are employed for at this experiment. The system with slow clock frequency has 3 errors (red arrows at video index 2,5,8) with accuracy 62.5%; the system with middle clock frequency has 2 errors (red arrows at video index 5,8) with accuracy 75%, and the system with fast clock frequency has 3 errors (red arrows at video index 2,5,7) with accuracy 62.5%. The results are summarized in Table 1.
We estimated algorithms hardware implementation results on our previously designed embedded-ASIC hardware [30][31]. For a single training trail the latency is 0.08s, hence the total training time for SNNs with slow, middle fast clock frequency is 0.84s, 5.08s and 4.35s. And power consumptions of each case are 33.26mW, 201mW and 172.2mW. Here an event-driven implementation technique is not applied here so the total power can be further optimized in the near future.
In this work we develop a novel internal clock based SNN learning system for speed recognition. The system key advances are as below:
โข Requirement of a small training dataset The developed system employs 6 motion videos for training purposes, and inferences 18 motion videos. The ratio of training and inference dataset is 1:3. The key reason is that designed SNN captures the common speed properties of motion videos and transforms them into a spiking-burst pattern domain for processing.
โข Quick learning performances For SNNs with slow clock frequencies, only less than 10 training trails are required, while for SNNs with middle and fast clock frequencies, the training trail number is up to 50 times. This is due to we modify neural network global spiking-patterns rather than individual neurons. Based on our previous hardware implementation work [32][31], we estimated the latency on a typical ARM Cortex M4 processor is less than 6 seconds for 50 times training. The details are summarized in Table 1.
Has certain cognitive behaviors By using cartoon videos trained SNN, the system can also differentiate real-world run and walk videos with certain accuracies. This proves that the system has basic cognitive learning behaviors in a spiking-pattern domain.
โข The SNN with specific behaviors Inspired by the work [33], the developed SNN has tailor-designed internal clock timing behaviors [34] at initial stages. This will strongly beneficial to one-shot /few learning performances.
One of the most promising applications for the developed algorithm is the edge/IoT fields since developed system hardware implementation only has less than seconds latency and 33-201mW power consumption, the typical IoT based embedded processors can easily implement developed algorithms and enable learning behaviors at the end device level.
Currently developed SNNs fully focus on timing representation via internal clocking behaviors. However, in some special cases, large dynamic events on the spatial domain can also exert vital effects on inference results such as video index 3 (walk with a big umbrella). Compare to the other work [35], the developed NDDP results of variable training time and uncertain results. Also, the maximum movement speed that can be recognized by the developed SNN is still required further explorations. This is closely related to the large-scale datasets and the developed NDDP rule. In the next stage, we will investigate introducing spatial domain information representation mechanisms [36] and introduce standard dataset videos [37] for the training as well as algorithm optimizations.
This content is AI-processed based on open access ArXiv data.