S2Act: Simple Spiking Actor
Spiking neural networks (SNNs) and biologically-inspired learning mechanisms are attractive in mobile robotics, where the size and performance of onboard neural network policies are constrained by power and computational budgets. Existing SNN approac…
Authors: Ugur Akcal, Seung Hyun Kim, Mikihisa Yuasa
S2Act: Simple Spiking Actor Ugur Akcal ∗ 135† , Seung Hyun Kim ∗ 2† , Mikihisa Y uasa 1 , Hamid Osooli 1 , Jiarui Sun 1 , Ribhav Sahu 3 Mattia Gazzola 267 , Huy T . T ran 1 , and Girish Cho wdhary 345† Abstract — Spiking neural networks (SNNs) and biologically- inspired learning mechanisms are attractive in mobile robotics, where the size and performance of onboard neural network policies are constrained by power and computational budgets. Existing SNN approaches, such as population coding, reward modulation, and hybrid artificial neural network (ANN)–SNN architectur es, hav e shown promising results; howev er , they face challenges in complex, highly stochastic en vironments due to SNN sensitivity to hyperparameters and inconsistent gradient signals. T o address these challenges, we propose simple spiking actor (S2Act), a computationally lightweight framework that deploys an RL policy using an SNN in three steps: (1) architect an actor-critic model based on an approximated network of rate-based spiking neurons, (2) train the network with gradients using compatible activation functions, and (3) transfer the trained weights into physical parameters of rate-based leaky integrate-and-fire (LIF) neurons for inference and deployment. By globally shaping LIF neuron parameters such that their rate-based responses approximate ReLU activations, S2Act effectively mitigates the vanishing gradient problem, while pre- constraining LIF response cur ves reduces r eliance on com- plex SNN-specific hyper parameter tuning . W e demonstrate our method in two multi-agent stochastic en vironments (capture- the-flag and parking) that captur e the complexity of multi- robot interactions, and deploy our trained policies on physical T urtleBot platf orms using Intel’s Loihi neuromorphic hard- ware. Our experimental results show that S2Act outperforms rele vant baselines in task perf ormance and real-time inference in nearly all considered scenarios, highlighting its potential for rapid pr ototyping and efficient real-world deployment of SNN- based RL policies. I . I N T RO D U C T I O N Spiking neural network (SNN)-based reinforcement learn- ing (RL) is a growing area of research at the intersec- tion of neuroscience, machine learning, and robotics. SNNs represent a broad class of biologically-inspired approaches for developing intelligent systems, alternativ e to traditional artificial neural networks (ANNs), that lev erage continuous dynamics and e vent-dri ven mechanisms similar to biological neurons. An important benefit of this event-dri ven frame work is substantial energy savings in processing neural network This work was supported in part by Navy #N00014-19-1-2373, ONR #N00014-20-1-2249, NSF Expeditions #IIS–2123781, NSF EFRI #2515342 Univ ersity of Illinois at Urbana-Champaign: 1 Aerospace Engineering, 2 Mechanical Science and Engineering, 3 Computer Science, 4 Agricultural and Biological Engineering, 5 Coordinated Science Laboratory , 6 Carl R. W oese Institute for Genomic Biology 7 National Center for Supercomputing Applications ∗ These authors contrib uted equally to this work. †Corresponding author: { skim449,makcal2,girishc } @illinois.edu This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. T raining Deployment Artificial Network of Rate-based LIF Neurons On-chip Spiking S2Act Simulation Same W eight s ANN-to-SN N Conversion Intel-Loihi Fig. 1. S2Act. An ANN with rate-based LIF activ ation is used for training, with learned weights then con verted to a spiking network for deployment on neuromorphic hardware. models when used with neuromorphic hardware [1]–[4]. Be- yond energy efficiency , the sparse and asynchronous, event- driv en nature of spike-based representations can also benefit multi-agent systems by reducing communication overhead and enabling decentralized, low-latenc y coordination. Inte- grating SNNs with RL has therefore attracted considerable attention for mobile and autonomous robotics [5], [6], to support more efficient on-board inference [7], [8], control [9], [10], and decision-making [11], [12]. Realizing these benefits in complex, stochastic, real-world settings, howe ver , remains a significant challenge. Existing approaches to SNN-based RL range from biologically-plausible learning rules to theory-driv en methodologies; regardless of the paradigm though, current approaches share sev eral common challenges. For example, existing methods often require extensi ve tuning of neuronal hyperparameters due to the use of complex nonlinear neuron models [13], rely on large or complex network structures for auxiliary losses [14], [15], or suf fer from vanishing gradients due to spiking dynamics that are incompatible with backpropagation [16], [17]. The resulting instability and sensitivity of these methods are then often amplified in highly stochastic en vironments, such as multi-agent and adversarial settings found in real-world mobile robot deployments [5], [18]. Motiv ated by these challenges, we propose a simple yet effecti ve ANN-to-SNN solution that matches or e x- ceeds the performance of existing methods, while being computationally-efficient enough to enable rapid prototyp- ing and swift sim-to-real deployment on mobile robots with neuromorphic hardware. Our approach is illustrated in Figure 1—our key idea is to configure a global set of neuronal parameters such that the resulting LIF response curves approximate the behavior of rectified linear unit (ReLU) activ ations [19], following principles from activ ation function design in deep learning [20]. This design choice remov es the need to tune neuronal hyperparameters and facilitates gradient-based optimization while preserving the core advantages of SNNs. This, in turn, enables stable and efficient offline training and further supports better sim-to- real deployment in real world en vironments. The main contributions of this work are threefold: (1) we introduce a novel ANN-to-SNN con version technique that mitigates the vanishing gradient problem in rate-based spiking models by shaping LIF neuron dynamics to ap- proximate ReLU activ ations—enabling stable and gradient- friendly training; (2) we propose S imple S piking Act or (S2Act), a simple and computationally lightweight SNN ac- tor for complex, stochastic environments; and (3) we validate S2Act in both simulated and real-w orld multi-robot settings, presenting the first on-chip demonstration of an SNN-based RL policy in a multi-agent adversarial task. The rest of this paper is organized as follo ws: Section II discusses related work using SNNs in RL. Section III details our proposed method. Section IV outlines our experimental setup and Section V discusses our experimental results. Finally , Section VI provides concluding thoughts, elaborates on the limitations of our method, and identifies potential directions for future work. I I . P R E V I O U S W O R K Recent research has explored a v ariety of techniques for employing SNNs within RL framew orks. Many approaches emphasize biological plausibility by adopting learning rules inspired by neuroscience, such as R-STDP [21]–[24]. While these methods align with pre v ailing hypotheses about ho w biological neurons learn [25], they often require extensiv e hyperparameter tuning and, in general settings, underperform compared to backpropagation [26]. As a result, the training ov erhead limits rapid prototyping and swift sim-to-real de- ployment. A second line of work combines SNNs with ANNs to lev erage mature training pipelines while retaining spike- based computation. For e xample, Chevtcenk o and Ludermir [27] proposed an R-STDP tuned SNN integrated with a pre- trained binary con v olutional neural network (CNN) for RL tasks. Overall, these approaches are proposing the conceptual compatibility of SNN with ANN which our method is inspired from. At the same time, many prior studies place less emphasis on neuromorphic hardware constraints, which often leads unstable solutions for deplo yment, leaving room to further improv e implementation for real-world on-chip operation. A third class of methods uses population coding [28]– [30]. T ang et al. [9] proposed the population-coded spiking actor network (PopSAN), which integrates population coding with RL and includes many energy-ef ficient robotic control applications. While inspirational and ef fectiv e, population coding in SNNs comes with practical trade-offs, including layered architectural complexity from population redundancy [31], increased resource demands [32], and additional train- ing dif ficulties [33]. In particular , population-coded architec- tures such as PopSAN typically require a large number of training parameters, since each state and action dimension is represented by a separate population of neurons. In practice, these considerations may lead to larger and more entangled SNN architectures that require longer training times and are challenging to compile for neuromorphic chips [6]. Collectiv ely , prior SNN-based RL approaches hav e shown strong performance on relati vely simple tasks b ut can face limitations in more complex and highly stochastic en viron- ments [5], [18]. Building on and motiv ated by these prior works, we aim to achie ve more stable learning and reliable SNN policy performance, with a particular emphasis on handling highly stochastic en vironments, like multi-agent adversarial ones. I I I . M E T H O D O L O G Y Figure 2 illustrates the S2Act training-to-deployment pipeline, which follows the common paradigm of training of- fline and deploying online. A rate-based actor-critic network is first trained using proximal policy optimization (PPO) [34] (see Section III-A) in a simulated en vironment. A ke y aspect of S2Act is that both the actor and critic use ReLU-like rate- based approximations of LIF neurons (soft-ReLLIF), whose parameters (i.e., neuron-to-neuron connection weights) are updated during training. After training, the actor network is then con verted to an equiv alent spiking network (see Section III-D), whose weights are used for deployment on neuromorphic hardware. A. ANN-to-SNN Con version The core idea underpinning our methodology is to design spiking neuron dynamics that closely emulate the beha vior of ReLU activ ations, while carefully re gularizing the dynamics such that they remain ef fectiv ely expressible on neuromor- phic hardware. In practice, gradient methods for SNNs often struggle because neuromorphic systems rely on a narrow optimal range of spiking activity for reliable information representation. This limitation stems from the nature of rate- based encoding: the resolution of neural activity (i.e., the rate of spiking e vents) depends on both inference time and activity magnitude. The resulting narrow operating regime leav es little margin for shifts in weight distribution during training, which makes it more difficult to achiev e reliable models under sensitiv e or noisy data collection. By re- designing spiking neuron dynamics, maximally supporting the optimal range of neuron expression, we achiev e a direct con version paradigm with minimal discrepanc y in sim-to-real settings. The ANN-to-SNN con version paradigm—i.e., training with dif ferentiable acti vation units and substituting spiking neurons at inference—additionally avoids the complexities of surrogate design, spike encoding, and hyperparameter tuning. This approach therefore maintains and utilizes the extensi ve capacity and scalable infrastructure of modern ANNs and backpropagation-based training, while keeping DEPLOYMENT RA TE-BASED LIF NETWORK TRAINING Neuromorphic Hardware PPO Update T rajectory Memory Buf fer (a) (b) Action Observation Reward Simulated Environment Actor Network Spiking Actor Critic Network update Substitute rate-neuron with spike-neuron ANN-to-SNN Conversion Action T ar get Environment Fig. 2. S2Act training-to-deployment pipeline. (a) W e employ an ANN-to-SNN con version strategy , enabling neuromorphic deployment of computationally lightweight SNN RL agents. In the rate-based LIF network training phase, the simulated en vironment provides discrete visual observations O t . These are passed to an actor-critic model, comprising a policy (actor) network and a value (critic) network, both modeled with soft-ReLLIF units (green circles). W e use PPO to train these networks. (b) Once trained, the soft-ReLLIF neurons in the actor network are replaced with spiking LIF neurons (blue circles) for deployment on neuromorphic hardware, such as Intel’ s Loihi. Fig. 3. Rate-based LIF neur on variants. Rate-based LIF units (cyan line) exhibit unbounded gradients at the critical input v alue determined by their neuronal parameters. Although a smoothing term can be used to address these unbounded gradients, the resulting soft rate-based LIF unit (purple line with star mark ers) will still be prone to v anishing gradients. A common solution to this problem in netw orks with bounded activ ation functions is to employ ReLU activation. Therefore, we adjust the neuronal parameters of the soft rate-based LIF units to approximate the ReLU activ ation. Consequently , we utilize soft ReLLIF (green line) activations. SNN dynamics optimal for neuromorphic deployment. In this way , we bridge the gap between the ef ficiency of SNNs and the demands of training in complex, stochastic en vironments. Our implementation in volv es fe w components beyond the soft-ReLLIF activ ation function: we regularize the acti vity range to support optimal rate-based value representation while controlling inference latency . Unlike ANNs, where floating-point magnitude does not cost energy , activity reg- ularization of spike-rate considerably af fects both ener gy consumption and inference time, especially at the network scale. T o enforce this operating regime, we globally config- ure LIF neuron parameters such that their response curves approximate ReLU functions, while remaining compatible with gradient-based optimization, and rescale their acti vity to fit within the range of the planned deplo yment hardware. This neuron-shaping approach addresses vanishing gradients and reduces the number of neuron parameters to tune, thereby easing stable training and deployment of deep SNN- based RL policies. B. SNN Implementation The analytical firing rate solution ρ [ I s ( t )] of the sub- threshold dynamics of LIF neurons expressed by Equa- tion (1) is the very core of the ANN-to-SNN con version framew ork we implement. ρ [ I s ( t )] = 0 if I s ( t ) ≤ I t h τ re f + τ s pike − 1 if I s ( t ) > I t h I t h = ( V t h − V 0 ) C m τ m τ s pike = − τ m log 1 − ( V t h − V reset ) C m τ m ( V 0 − V reset ) C m τ m + I s ( t ) (1) Equation (1) gives the LIF firing rate ρ as the time-av eraged spike rate for a constant synaptic current I s ( t ) . At time t , I s ( t ) denotes the total synaptic current receiv ed by the neuron, C m represents the membrane capacitance, τ m is the membrane time constant, and V 0 is the resting potential of the neuron. The membrane potential V ( t ) functions as a scalar state variable that resets to a value V reset whenev er it reaches a threshold V t h . The refractory period, denoted as τ re f , represents the duration a neuron needs before it can receiv e input currents again following a spike, where τ s pike specifies the time required for the neuron’ s potential to rise from V reset to V t h after a spike at time t = t ′ , under the condition V t h < I s ( t ) = c ∈ R , t ′ < t ≤ t ′ + τ s pike . For a comprehensi ve discussion of the LIF neuron model and the analytical solution for its firing rate, we recommend consulting [35]. Obstacle T eam Region Flag Agent Miqus M5 Cameras Qualisys (c) (b) CtF Arena (+motion cap tures system) (a) CtF Simulation CAPTURE THE FLAG (DISCRETE) Control Agent Forward+steer action (d) Parking simulation P ARKING (CONTINUOUS) T ar get spot Adversarial or non-player Fig. 4. Simulated and real-w orld envir onment for evaluations. (a) V isualization of our simulated 2 vs. 2 CtF game. Gray squares are obstacles, triangles are agents, and circles are flags. The region highlighted by solid red lines is the border region for a red agent whose policy is the patrol policy . The game has stochastic combat: when two agents are adjacent, whether one is eliminated depends on the territory , the number of nearby enemies, and the number of nearby allies. (b) Real-world CtF arena measures 12’ × 12’ and replicates the 10 × 10 grid-world used during policy training. Blue, red, and gray tiles denote the blue team’s territory , the red team’s territory , and obstacles, respectiv ely . (c) A Qualisys motion capture system with ten Miqus M5 cameras and Qualisys T rack Manager software is used to track robot positions in real time. (d) V isualization of the parking en vironment. The green rectangle is the ego vehicle, and the blue square is the target parking spot. The rate-based LIF activ ation function characterized by Equation (1) is not compatible with backpropagation, since ∂ ρ / ∂ I s is discontinuous and unbounded at I s = ( V t h − V 0 ) C m / τ m . Ne vertheless, a modification of Equation (1) following the approach of [36] allows for a smoothed, rate-based approximation of the LIF model ( ρ ′ [ I s ( t )] ) as expressed by Equation (2). ρ ′ [ I s ( t )] = τ re f + τ m log 1 + V t h Θ [ I s ( t ) − V t h ] − 1 Θ ( x ) = γ log 1 + e x / γ (2) Howe v er , an activ ation function driv en by Equation (2) becomes vulnerable to vanishing gradients as its output spiking acti vity con ver ges to the maximum firing rate. T o ov ercome complications arising from output activity satura- tion, we draw inspiration from [20] and adjust the neuronal parameters in Equation (2) so that each rate-based LIF unit in the ANNs used for training approximates a ReLU acti vation function, as depicted in Figure 3. C. P olicy Optimization W e train an actor network (i.e., policy network) π θ com- posed of soft-ReLLIF neurons with trainable parameters θ , following a typical RL training procedures (see Figure 2a). W e specifically use PPO, which maximizes the clipped objectiv e giv en in Equation (3). θ new = arg max θ 1 | D θ | T ∑ d ∈ D θ T ∑ t = 1 min ( f t , g t ) f t = f π old θ , π θ , O t , a t = π θ ( a t | O t ) π old θ ( a t | O t ) ˆ A π old θ ( O t , a t ) g t = g π old θ , π θ , O t , a t = clip π θ ( a t | O t ) π old θ ( a t | O t ) , 1 − ε , 1 + ε ! ˆ A π old θ ( O t , a t ) (3) Once training is completed, we substitute the actor neurons with spiking LIF neurons for inference. D. Ar chitectur e Details W e keep the size of S2Act minimal to simplify im- plementation on neuromorphic chips, which are often not well-optimized for spatially complex layer structures (e.g., CNNs). The actor network comprises two dense layers of 64 neurons each, with inputs represented as a vector of the positions of the en vironment components. For neuronal parameters, we set C m = 1, τ m = 1, V 0 = 0, V reset = 0, τ re f = 0, and γ = 0 . 003. W e used PyT orch to build and train the SNNs. I V . E X P E R I M E N T A L S E T U P W e demonstrate S2Act in a simulated capture-the-flag (CtF) game, a real-world multi-robot implementation of that CtF game, and a simulated parking environment, as illustrated in Figure 4. W e compare against three baselines representing different classes of SNN-based RL methods: a population coding method (PopSAN) [9], a hybrid method (Hybrid SNN) [27], and a recurrent network method (RSNN) [23]. W e conducted extensi ve parameter sweeps for each baseline and chose the hyperparameters that produced the best performance for each method. A. CtF En vir onment Capture the Flag (CtF) is a multi-agent adversarial en vi- ronment and a useful testbed for stress-testing RL training: it combines long horizons with sparse, delayed rewards, and it requires policies to respond to unpredictable opponent behavior . T raining in this setting is challenging because the episodic rew ard signal is often noisy and stochastic— conditions under which SNN-based RL frequently struggles to achiev e the stability needed to learn strong policies. W e consider a simulated CtF game with m allied agents (blue) v ersus n enemy agents (red), based on [37] (Figure 4a). The environment has discrete state and action spaces, and its complexity and stochasticity arise from agent interactions, territorial adv antages, and adv ersarial dynamics. W e adjust task difficulty by varying the enemy team size n and by choosing the heuristic policy that the enemy agents follow: random-walk or patr ol . In random-walk , enemies select ac- tions uniformly at random, whereas patrol biases mo vement tow ard the border region to defend the flag. A team wins either by capturing the opponent’ s flag or by capturing all opponents. Here, we report results for two settings: (1) 1v1 CtF against a random-walk enemy , which is relatively easy to learn, and (2) 2v2 CtF against defensi ve patr ol enemies, where training often conv erges to conservati ve behavior , and stable learning can yield more aggressi ve strategies that successfully capture the flag. Additional en vironment details will be provided in a public GitHub repository [38]. B. Multi-Robot CtF Sim-to-Real with Loihi W e also prepared a real-world demonstration of the CtF en vironment with robots using neuromorphic hardware (see Figure 4b). Our experimental setup includes multiple T urtle- Bot3 Burger robots in an indoor arena, where a ground station runs an on-chip S2Act centralized policy for the blue team, and autonomous heuristic policies for the red team. Communication among system components was established ov er a local wireless network using a router and R OS 2 Humble [39], and a custom P-controller w as used to driv e robots with feedback from a Qualisys motion-capture system. W e deployed an S2Act centralized policy (trained in simulation) on a Loihi neuromorphic chip [4]. Loihi is a brain-inspired neuromorphic processor designed to run asyn- chronous spiking networks with low energy consumption. W e used the Kapoho Bay model for its modularity and accessibility , which allow us to assess performance and energy in mobile deployment settings. W e interfaced Loihi with NengoLoihi [40] and NxSDK [41] to con vert the trained policy network into spiking neurons and synapses. This ANN-to-SNN con version re-compiles non-linear activ ation functions as neuron nodes and defines synaptic connections scaled by the learned weights. Additional hardware tuning was required to rescale the activity range to the optimal expression range; details will be provided in a GitHub repository upon acceptance. C. P arking En vir onment Finally , we include a simulated parking en vironment [42] that ev aluates SNN-based RL under physical dynamics, continuous state-action control, and adversarial disturbance (see Figure 4d). In this task, an ego v ehicle (yello w) must navigate to a designated parking area while another vehicle (green) moves to wards the same area. Stochasticity comes from the other vehicle’ s randomly sampled acceleration and from the parking position. An episode terminates when the ego vehicle successfully park, collides with the other vehicle, or crashes into a wall. Additional en vironment details are provided in [43]. V . R E S U L T S A N D D I S C U S S I O N S Figure 5 shows training curves for our method, S2Act, and baselines in our simulated en vironments. W e see that S2Act consistently achieves the highest con ver ged mean episodic rew ard and does so the fastest, with PopSAN emerging as the closest competitor . W e also observe that PopSAN’ s training becomes unstable after 100,000 time steps in the parking en vironment. Both RSNN and Hybrid SNN methods struggle to solve the tasks. W e also ev aluated the performance of each fully trained agent over 1,000 episodes and report the mean values of various performance metrics in T able I. Our metrics include the estimated e nergy consumption per inference (EpI), the es- timated mean inference times (MInT), and the mean training time (MT rT) for each policy . EpI for Loihi was measured with a USB power meter to capture the collectiv e po wer draw (including microprocessors and cooling), and EpI for CPU and GPU was estimated using the pyJoules package [44]. In the table, bold values mark the best EpI and success rates in each scenario. For the CtF scenarios, we include statistics for the following success and failure cases: (1) Flag capture, where a controlled agent captures the enemy flag (success); (2) Defeated, where all controlled agents are captured by the enemy (failure); (3) Lost flag, where an enemy agent captures the friendly flag (f ailure); (4) T ime-up, where the episode reaches the maximum step limit (failure). For the parking en vironment, we include statistics for the following cases: (1) Success, where the car parks in the target spot; (2) Crash, where the car collides with the other vehicle. W e tested both simulated and on-chip S2Act policies alongside our baselines. The simulated S2Act agent achie ved the highest success rate of 78 . 1% and the fastest MT rT (2.4 hours compared to 8.3 hours by its closest competitor , PopSAN) by a significant margin in the 1v1 CtF random enemy scenario. The on-chip S2Act agent came second with a success rate of 72 . 6%. Similarly , both the simulated and on-chip S2Act policies deliv ered the best performance and fastest MT rT (5 hours versus 15.7 hours by PopSAN) in the 2v2 CtF defensive enemies scenario. It is worth noting that, in both scenarios, most failure cases occurred when the enemy had fav orable initial conditions, either making it easier to defend its flag or capture the S2Act agents. Furthermore, the low flag- capture rates of RSNN and Hybrid SNN are due to their inability to solve the task; episodes typically end in time-up as agents adopt conserv ativ e strategies rather than taking risk to capture the flag. In the parking en vironment, the simulated S2Act also yielded the highest success rate and showed a significant impro vement in MT rT relati ve to most baselines. The on-chip S2Act policy performed slightly worse than PopSAN b ut outperformed the RSNN and Hybrid SNN baselines, which failed to solve the task. W e deployed the same on-chip S2Act policy in the real- world 2v2 CtF setup described in Section IV -B: Turtle- Bot3 Burger robots in an indoor arena, with the blue team controlled by the Loihi-based policy and the red team by S2Act (Ours) Pop-SAN RSNN Hybrid SNN Mean Episodic Reward Mean Episodic Reward Mean Episodic Reward 1 0 -1 -2 CtF 1v1 - Random W alking Enemy CtF 2v2 - Defensive Enemy Parking Environment - 1v1 0.0 0.2 0.4 0.6 0.8 1.0 T raining Step (million) 0.0 0.2 0.4 0.6 0.8 1.0 T rainng Step (million) T rainng Step (million) (a) (b) (c) -60 -50 -40 -30 -20 -10 0 0 -100 -50 0.1 0.0 0.05 0.2 0.15 0.25 Fig. 5. Evaluation results. A verage training curves of S2Act and other baselines in two CtF scenarios (a, b) and the parking en vironment (c). Solid lines represent the a verage mean episodic rew ard ov er three seeds with shaded areas representing one standard deviation confidence intervals. Ground Station (a) Multi-robot Setup CAPTURE THE FLAG: REAL-WORLD DEMO T urtleBot3 Robot Marker Ball Enemy (defensive) Centralized RL policy Mocap T racking (b) (c) captured decoy patrolling (e) Capturing (d) Approaching S2Act on-chip (Loihi) Fig. 6. Episode trajectory . (a) Multi-robot demonstration of the on-chip deployment in the CtF task. (b) Each T urtleBot3 Burger robot is fitted with four sparsely placed marker balls to enable robust pose tracking. Communication among the components is handled via a local wireless network running R OS 2 Humble [39]. The ev aluations are conducted in a 2v2 scenario, where the blue agents engage with defensi ve red team behaviors. (c) A ground station communicates the on-chip S2Act centralized policy’ s commands to the blue team, while red team policies run individually on the ground station. (d-e) W e overlay the trajectories executed by robots controlled by the S2Act policy network. A policy first moves agents to the nearest territorial boundary from the target flag, after which the agents coordinate to capture the flag. A supplementary video is provided. heuristic policies, using motion capture and a ground sta- tion ov er a local network (see Figure 6abc). In repeated runs, the blue team successfully exhibited coordinated CtF behaviors—approaching the enemy flag, a voiding or en- gaging red agents, and capturing the flag when conditions allowed (see Figure 6de)—without any fine-tuning in the physical environment. W e initially observed performance degradation due to out-of-range neural activity and a shift in the ef fecti ve current range induced by thermal drift. Ho we ver , only minimal rescaling of neural activity was required to recov er the same qualitative beha vior observed in simulation, indicating rob ust sim-to-real transfer and low overhead for deployment on neuromorphic hardware. Note that ev en with rescaling, we still observed a slight but noticeable reduction in the performance of the on-chip S2Act polic y . This gap is likely attrib utable to quantization effects on Loihi [45], arising from limited bit precision for representing synaptic weights and neural acti v ations. Although the policy’ s qualitative beha vior remains similar , residual discretization and timing ef fects can still reduce the control fidelity . More broadly , the rate-based SNN im- plementation offers an inherent trade-off: achieving higher- precision encoding incurs latency , since each inference step consists of a temporal integration windo w to estimate output rates. Consequently , rate-based encoding is fundamentally limited by the resolution of firing-rate estimates over finite windows: impro ving accuracy typically requires higher firing rates and/or longer windo ws, both of which increase latenc y or energy . This limitation may be acceptable for higher- lev el planning and decision-making that is not time-critical, whereas fast continuous control will likely require alternative low-latenc y representations or decoding strategies. V I . C O N C L U S I O N S This paper presents S2Act, a computationally ef ficient RL architecture based on the ANN-to-SNN conv ersion paradigm. T o the best of our knowledge, S2Act is the first on-chip SNN polic y that uses an ANN-to-SNN con version approach in an RL setting to tackle a real-w orld multi-agent adv ersarial scenario. Our experiments demonstrate that S2Act achieves superior training and inference performance in simulated CtF scenarios and parking tasks, outperforming other SNN com- petitors while using only two hidden layers of soft-ReLLIF neurons. Experiments further show that S2Act e xhibits su- perior performance despite being significantly smaller in the number of trainable parameters and substantially more sample-efficient compared to its competitors, with minimal performance degradation due to ANN-to-SNN approxima- tion errors. These results underscore S2Act’ s robustness and adapt- ability in handling highly stochastic and adversarial environ- ments, pushing the boundaries of current SNN capabilities. Hardware-in-the-loop simulations using Intel’ s neuromorphic T ABLE I P E RF O R M AN C E M E AS U R ES . I N F ER E N C E S TA T I ST I C S AN D R A T E O F S UC C E S SF U L T ER M I NATI O N . En vironment Method Hardware Inference statistics T erminal conditions EpI MInT MT rT Flag Capture Defeated Lost Flag T ime-up [J / inf] [ms] [hours] (%) (%) (%) (%) 1v1 CtF Simulated S2Act GPU 0.079 6.7 2.4 78.1 20.8 0.4 0.7 S2Act Loihi 0.033 27 - 72.6 26.4 0.4 0.6 PopSAN GPU 1.042 141.6 8.3 67.7 24.2 0.3 7.8 RSNN CPU 0.468 57.8 12.1 4.2 17.3 0.1 78.4 Hybrid SNN CPU 0.066 9.2 27.3 0.3 17.1 0.8 81.8 2v2 CtF Simulated S2Act GPU 0.093 7.2 5.0 34.0 65.0 0.0 1.0 S2Act Loihi 0.037 29 - 29.4 61.6 0.0 9.0 PopSAN GPU 1.415 188.1 15.7 21.2 62.5 0.0 16.3 RSNN CPU 0.562 71.2 20.6 0.0 12.3 0.0 87.7 Hybrid SNN CPU 0.073 9.8 34.8 0.0 10.9 0.0 89.1 EpI [J / inf] MInT [ms] MTrT [hours] Success Rate Crash Rate Parking Simulated S2Act GPU 0.012 1.4 4.9 43 11.0 S2Act Loihi 0.033 26 - 39.1 14.4 PopSAN GPU 0.201 24.1 16.6 40.4 11.9 RSNN CPU 0.320 41.4 19.4 0.0 22.7 Hybrid SNN CPU 0.03 2.3 34.9 0.0 21.4 USB form factor , Kapoho Bay , demonstrate that on-chip S2Act policies maintain competiti ve, and in some cases superior , performance relativ e to their simulated counterparts, despite slight reductions due to hardware constraints. Despite the promising results demonstrated by S2Act, sev- eral limitations warrant consideration. First, the reliance on rate-based training and subsequent ANN-to-SNN con version introduces approximation errors that may degrade perfor- mance during on-chip inference, particularly under the quan- tization constraints of neuromorphic hardware like Loihi. While these discrepancies were modest in our experiments, they could become more pronounced in more complex tasks or larger networks. Additionally , S2Act currently assumes full observability of the environment and fixed neuronal parameters, limiting its adaptability to partially observable or dynamic contexts where real-time learning or plasticity may be beneficial. Our findings highlight S2Act’ s potential for efficient, neuromorphic RL and real-world deployment of SNNs in complex, dynamic tasks. Future research may explore neuro- morphic hardware-a ware SNN architecture design to further optimize on-chip performance, as well as rob ust policy learning in adversarial and partially observable environments, broadening the practical applicability of SNNs in intelligent autonomous systems. R E F E R E N C E S [1] A. Gruel, A. F . V incent, J. Martinet, and S. Sa ¨ ıghi, “Neuromorphic ev ent-based line detection on spinnaker , ” in 2024 IEEE 6th Interna- tional Conference on AI Cir cuits and Systems (AICAS) , pp. 36–40, IEEE, 2024. [2] F . Paredes-V all ´ es, J. J. Hagenaars, J. Dupeyroux, S. Stroobants, Y . Xu, and G. de Croon, “Fully neuromorphic vision and control for au- tonomous drone flight, ” Science Robotics , vol. 9, no. 90, p. eadi0591, 2024. [3] G. Orchard, E. P . Frady , D. B. D. Rubin, S. Sanborn, S. B. Shrestha, F . T . Sommer, and M. Davies, “Efficient neuromorphic signal pro- cessing with loihi 2, ” in 2021 IEEE W orkshop on Signal Pr ocessing Systems (SiPS) , pp. 254–259, IEEE, 2021. [4] M. Davies, A. W ild, G. Orchard, Y . Sandamirskaya, G. A. F . Guerra, P . Joshi, P . Plank, and S. R. Risbud, “ Advancing neuromorphic computing with loihi: A survey of results and outlook, ” Pr oceedings of the IEEE , vol. 109, no. 5, pp. 911–934, 2021. [5] K. Roy , A. Jaiswal, and P . Panda, “T owards spike-based machine in- telligence with neuromorphic computing, ” Natur e , vol. 575, no. 7784, pp. 607–617, 2019. [6] N. Rathi, I. Chakraborty , A. Kosta, A. Sengupta, A. Ankit, P . Panda, and K. Roy , “Exploring neuromorphic computing based on spiking neural networks: Algorithms to hardware, ” ACM Computing Surveys , vol. 55, no. 12, pp. 1–49, 2023. [7] D. Palossi, A. Loquercio, F . Conti, E. Flamand, D. Scaramuzza, and L. Benini, “ A 64-mw dnn-based visual navigation engine for autonomous nano-drones, ” IEEE Internet of Things Journal , vol. 6, no. 5, pp. 8357–8371, 2019. [8] G. T ang, N. Kumar , and K. P . Michmizos, “Reinforcement co-learning of deep and spiking neural networks for energy-ef ficient mapless navi- gation with neuromorphic hardware, ” in 2020 IEEE/RSJ International Confer ence on Intelligent Robots and Systems (IR OS) , pp. 6090–6097, IEEE, 2020. [9] G. T ang, N. Kumar , R. Y oo, and K. Michmizos, “Deep reinforcement learning with population-coded spiking neural network for continuous control, ” in Confer ence on Robot Learning , pp. 2016–2029, PMLR, 2021. [10] K. M. Oikonomou, I. Kansizoglou, and A. Gasteratos, “ A hybrid reinforcement learning approach with a spiking actor netw ork for efficient robotic arm target reaching, ” IEEE Robotics and Automation Letters , vol. 8, no. 5, pp. 3007–3014, 2023. [11] J. L. Lobo, J. Del Ser, A. Bifet, and N. Kasabov , “Spiking neural networks and online learning: An overvie w and perspectiv es, ” Neural Networks , vol. 121, pp. 88–100, 2020. [12] J. K. Eshraghian, M. W ard, E. O. Neftci, X. W ang, G. Lenz, G. Dwivedi, M. Bennamoun, D. S. Jeong, and W . D. Lu, “T raining spiking neural networks using lessons from deep learning, ” Proceed- ings of the IEEE , 2023. [13] F . Zenke and T . P . V ogels, “The remarkable robustness of surrogate gradient learning for instilling complex function in spiking neural networks, ” Neural computation , vol. 33, no. 4, pp. 899–925, 2021. [14] S. Deng, H. Lin, Y . Li, and S. Gu, “Surrogate module learning: Reduce the gradient error accumulation in training spiking neural networks, ” in Pr oceedings of the 40th International Conference on Machine Learning (A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, eds.), vol. 202 of Pr oceedings of Machine Learning Researc h , pp. 7645–7657, PMLR, 23–29 Jul 2023. [15] P . G. Cachi, S. V entura, and K. J. Cios, “Impro ving spiking neural network performance with auxiliary learning, ” Machine Learning and Knowledge Extraction , vol. 5, no. 3, pp. 1010–1022, 2023. [16] Y . Guo, Y . Y . Chen, Z. Hao, W . Peng, Z. Jie, Y . Zhang, X. Liu, and Z. Ma, “T ake a shortcut back: Mitigating the gradient vanishing for training spiking neural networks, ” ArXiv , vol. abs/2401.04486, 2024. [17] J. H. Lee, T . Delbruck, and M. Pfeiffer , “Training deep spiking neural networks using backpropagation, ” F r ontiers in Neuroscience , vol. V olume 10 - 2016, 2016. [18] K. Zhang, Z. Y ang, and T . Bas ¸ar, “Multi-agent reinforcement learn- ing: A selective overvie w of theories and algorithms, ” Handbook of r einforcement learning and contr ol , pp. 321–384, 2021. [19] K. Hara, D. Saito, and H. Shouno, “ Analysis of function of rectified linear unit used in deep learning, ” in 2015 international joint confer- ence on neural networks (IJCNN) , pp. 1–8, IEEE, 2015. [20] X. Glorot, A. Bordes, and Y . Bengio, “Deep sparse rectifier neural networks, ” in Pr oceedings of the fourteenth international confer ence on artificial intelligence and statistics , pp. 315–323, JMLR W orkshop and Conference Proceedings, 2011. [21] C. Capone and P . S. Paolucci, “T o wards biologically plausible model- based reinforcement learning in recurrent spiking networks by dream- ing new experiences, ” Scientific Reports , vol. 14, no. 1, p. 14656, 2024. [22] H. Lu, J. Liu, Y . Luo, Y . Hua, S. Qiu, and Y . Huang, “ An autonomous learning mobile robot using biological reward modulate stdp, ” Neur o- computing , vol. 458, pp. 308–318, 2021. [23] G. Bellec, F . Scherr , A. Subramoney , E. Hajek, D. Salaj, R. Legenstein, and W . Maass, “ A solution to the learning dilemma for recurrent networks of spiking neurons, ” Nature communications , vol. 11, no. 1, pp. 1–15, 2020. [24] Z. Bing, Z. Jiang, L. Cheng, C. Cai, K. Huang, and A. Knoll, “End to end learning of a multi-layered snn based on r-stdp for a target tracking snake-like robot, ” in 2019 International Conference on Robotics and Automation (ICRA) , pp. 9645–9651, IEEE, 2019. [25] Z. Brzosko, S. B. Mierau, and O. Paulsen, “Neuromodulation of spike-timing-dependent plasticity: past, present, and future, ” Neuron , vol. 103, no. 4, pp. 563–581, 2019. [26] L. Deng, Y . W u, X. Hu, L. Liang, Y . Ding, G. Li, G. Zhao, P . Li, and Y . Xie, “Rethinking the performance comparison between snns and anns, ” Neural networks , vol. 121, pp. 294–307, 2020. [27] S. F . Chevtchenk o and T . B. Ludermir, “Combining stdp and binary networks for reinforcement learning from images and sparse re wards, ” Neural Networks , vol. 144, pp. 496–506, 2021. [28] D. Auge, J. Hille, E. Mueller , and A. Knoll, “ A survey of encoding techniques for signal processing in spiking neural networks, ” Neural Pr ocessing Letters , vol. 53, no. 6, pp. 4693–4710, 2021. [29] D. Zhang, T . Zhang, S. Jia, and B. Xu, “Multi-scale dynamic coding improved spiking actor network for reinforcement learning, ” in Pr o- ceedings of the AAAI Confer ence on Artificial Intelligence , vol. 36, pp. 59–67, 2022. [30] K. Naya, K. Kutsuzawa, D. Owaki, and M. Hayashibe, “Spiking neural network discov ers energy-ef ficient hexapod motion in deep reinforcement learning, ” IEEE Access , vol. 9, pp. 150345–150354, 2021. [31] X. Pitkow and D. E. Angelaki, “Inference in the brain: statistics flowing in redundant population codes, ” Neur on , vol. 94, no. 5, pp. 943–953, 2017. [32] T . Pfeil, A. Gr ¨ ubl, S. Jeltsch, E. M ¨ uller , P . M ¨ uller , M. A. Petrovici, M. Schmuker, D. Br ¨ uderle, J. Schemmel, and K. Meier, “Six networks on a univ ersal neuromorphic computing substrate, ” F r ontiers in neu- r oscience , vol. 7, p. 11, 2013. [33] E. O. Neftci, H. Mostafa, and F . Zenke, “Surrogate gradient learning in spiking neural networks: Bringing the po wer of gradient-based optimization to spiking neural networks, ” IEEE Signal Pr ocessing Magazine , vol. 36, no. 6, pp. 51–63, 2019. [34] J. Schulman, F . W olski, P . Dhariwal, A. Radford, and O. Klimov , “Proximal policy optimization algorithms, ” arXiv preprint arXiv:1707.06347 , 2017. [35] A. Stockel, “Harnessing neural dynamics as a computational resource, ” 2022. [36] E. Hunsberger and C. Eliasmith, “Spiking deep networks with lif neurons, ” ArXiv , vol. abs/1510.08829, 2015. [37] M. Y uasa, H. T . T ran, and R. S. Sreeniv as, “On generating explanations for reinforcement learning policies: An empirical study , ” IEEE Contr ol Systems Letters , 2024. [38] “Tran research group gridworld en vironment. ” https://github. com/Tran- Research- Group/trg- multigrid , 2023. [39] S. Macenski, T . Foote, B. Gerke y , C. Lalancette, and W . W oodall, “Robot operating system 2: Design, architecture, and uses in the wild, ” Science Robotics , v ol. 7, no. 66, p. eabm6074, 2022. [40] T . DeW olf, P . Jaw orski, and C. Eliasmith, “Nengo and lo w-power ai hardware for robust, embedded neurorobotics, ” Fr ontiers in Neuro- r obotics , vol. 14, p. 568359, 2020. [41] B. Rueckauer, C. Bybee, R. Goettsche, Y . Singh, J. Mishra, and A. Wild, “Nxtf: An api and compiler for deep spiking neural networks on intel loihi, ” ACM Journal on Emerging T ec hnologies in Computing Systems (JETC) , v ol. 18, no. 3, pp. 1–22, 2022. [42] D. Moreira, “Deep reinforcement learning for automated parking, ” Master’ s thesis, Uni versidade do Porto, Faculdade de Engenharia, Aug. 2021. Master’ s thesis (Master in Informatics and Computing Engineering). [43] E. Leurent and et al., “An En vironment for Autonomous Dri ving Decision-Making [computer software], ” GitHub , May 2018. URL: https://github.com/eleurent/highway- env . [44] M. c. Belgaid, R. Rouvoy , and L. Seinturier, “Pyjoules: Python library that measures python code snippets, ” Nov . 2019. [45] R. Massa, A. Marchisio, M. Martina, and M. Shafique, “ An efficient spiking neural network for recognizing gestures with a dvs camera on the loihi neuromorphic processor , ” in 2020 International Joint Confer ence on Neural Networks (IJCNN) , pp. 1–9, IEEE, 2020.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment