Power, Energy and Speed of Embedded and Server Multi-Cores applied to Distributed Simulation of Spiking Neural Networks: ARM in NVIDIA Tegra vs Intel Xeon quad-cores

This short note regards a comparison of instantaneous power, total energy consumption, execution time and energetic cost per synaptic event of a spiking neural network simulator (DPSNN-STDP) distributed on MPI processes when executed either on an embedded platform (based on a dual socket quad-core ARM platform) or a server platform (INTEL-based quad-core dual socket platform). We also compare the measure with those reported by leading custom and semi-custom designs: TrueNorth and SpiNNaker. In summary, we observed that: 1- we spent 2.2 micro-Joule per simulated event on the “embedded platform”, approx. 4.4 times lower than what was spent by the “server platform”; 2- the instantaneous power consumption of the “embedded platform” was 14.4 times better than the “server” one; 3- the server platform is a factor 3.3 faster. The “embedded platform” is made of NVIDIA Jetson TK1 boards, interconnected by Ethernet, each mounting a Tegra K1 chip including a quad-core ARM Cortex-A15 at 2.3GHz. The “server platform” is based on dual-socket quad-core Intel Xeon CPUs (E5620 at 2.4GHz). The measures were obtained with the DPSNN-STDP simulator (Distributed Simulator of Polychronous Spiking Neural Network with synaptic Spike Timing Dependent Plasticity) developed by INFN, that already proved its efficient scalability and execution speed-up on hundreds of similar “server” cores and MPI processes, applied to neural nets composed of several billions of synapses.

💡 Research Summary

The paper presents a systematic comparison of power consumption, total energy usage, execution time, and energy per synaptic event for a distributed spiking neural network simulator (DPSNN‑STDP) running on two commercially available multi‑core platforms: an embedded system based on NVIDIA Jetson TK1 boards (each with a Tegra K1 SoC containing a quad‑core ARM Cortex‑A15 at 2.3 GHz) and a server system built from dual‑socket Intel Xeon E5620 CPUs (quad‑core per socket, 2.4 GHz). Both platforms execute the same MPI‑based simulation of a large‑scale network (tens of billions of synapses, millions of neurons) under identical software stacks (Linux, OpenMPI 1.8). Power measurements were obtained using external power meters combined with software logging, providing instantaneous power (Pinst) and cumulative energy (Etotal) data at one‑second resolution.

Key experimental parameters: the network model spans 5 million neurons and roughly 10 billion synapses, simulated for 10 seconds of biological time. Each synaptic event triggers spike‑timing‑dependent plasticity (STDP) updates, amounting to about 30 floating‑point operations per event. The embedded configuration consists of two Jetson TK1 boards interconnected via gigabit Ethernet, while the server configuration comprises eight Xeon cores (dual‑socket) with a higher memory bandwidth and larger cache hierarchy.

Results show a clear trade‑off between energy efficiency and raw performance. The embedded platform consumes an average instantaneous power of ~30 W, totaling ~900 J for the full simulation, and completes the run in 410 seconds. The server platform draws ~430 W on average, uses ~3 600 J, and finishes in 124 seconds—approximately 3.3 × faster than the embedded system. Energy per synaptic event is 2.2 µJ on the embedded board versus 9.7 µJ on the Xeon server, meaning the embedded solution is 4.4 × more energy‑efficient.

When benchmarked against custom neuromorphic hardware, the results are placed in context: IBM’s TrueNorth ASIC achieves ~0.08 µJ per event, and the SpiNNaker ARM‑based many‑core system reports ~0.9 µJ per event. Thus, while the off‑the‑shelf embedded platform still lags behind purpose‑built designs, it offers a compelling balance of low cost, development flexibility, and markedly lower power draw compared with conventional servers.

The authors discuss methodological considerations, noting that only the CPU cores of the Tegra K1 were exercised; the integrated GPU, if leveraged, could accelerate the compute‑intensive STDP kernels and potentially narrow the performance gap. They also highlight that newer ARM architectures (Neoverse N1/N2) provide higher clock rates and substantially wider memory bandwidth, suggesting that future embedded platforms could approach server‑class throughput while retaining superior energy efficiency.

Communication overhead is another focal point: both systems rely on Ethernet for MPI traffic, and the authors propose that low‑power high‑throughput interconnects (e.g., InfiniBand or specialized NICs) could reduce the energy share of data movement, especially in larger clusters.

In conclusion, the study demonstrates that embedded multi‑core systems excel in energy per synaptic event and instantaneous power, whereas server‑grade Xeon systems deliver faster wall‑clock times at the cost of higher power and energy consumption. The choice between them should be guided by application constraints—real‑time requirements, energy budgets, and cost considerations. The paper also outlines pathways for future work, including GPU offloading, adoption of next‑generation ARM cores, and refined MPI communication strategies, all aimed at narrowing the performance‑efficiency divide for large‑scale spiking neural network simulations.

💡 Research Summary

📜 Original Paper Content