Scaling to 1024 software processes and hardware cores of the distributed simulation of a spiking neural network including up to 20G synapses

This short report describes the scaling, up to 1024 software processes and hardware cores, of a distributed simulator of plastic spiking neural networks. A previous report demonstrated good scalability of the simulator up to 128 processes. Herein we extend the speed-up measurements and strong and weak scaling analysis of the simulator to the range between 1 and 1024 software processes and hardware cores. We simulated two-dimensional grids of cortical columns including up to ~20G synapses connecting ~11M neurons. The neural network was distributed over a set of MPI processes and the simulations were run on a server platform composed of up to 64 dual-socket nodes, each socket equipped with Intel Haswell E5-2630 v3 processors (8 cores @ 2.4 GHz clock). All nodes are interconned through an InfiniBand network. The DPSNN simulator has been developed by INFN in the framework of EURETILE and CORTICONIC European FET Project and will be used by the WaveScalEW tem in the framework of the Human Brain Project (HBP), SubProject 2 - Cognitive and Systems Neuroscience. This report lays the groundwork for a more thorough comparison with the neural simulation tool NEST.

💡 Research Summary

This short report presents an extensive scalability study of the Distributed Plastic Spiking Neural Network (DPSNN) simulator, extending previous work that demonstrated good performance up to 128 MPI processes. The authors evaluate both strong and weak scaling from a single process to the full complement of 1,024 software processes and hardware cores on a high‑performance cluster. The hardware platform consists of 64 dual‑socket nodes (each node equipped with two Intel Xeon E5‑2630 v3 CPUs, 8 cores per socket, 2.4 GHz) interconnected via a 56 Gbps InfiniBand HDR network, providing low‑latency, high‑bandwidth communication essential for large‑scale neural simulations.

The simulated model is a two‑dimensional grid of cortical columns, each column containing 10,000 leaky integrate‑and‑fire (LIF) neurons. In total the network comprises roughly 11 million neurons and about 20 billion synapses, with spike‑timing‑dependent plasticity (STDP) implemented at each synapse. The simulation runs with a 1 ms time step for a biological duration of 1 second (1,000 steps). Neurons and synapses are partitioned across MPI processes; each process holds its local data in memory and communicates spike events and weight updates asynchronously using non‑blocking MPI calls. To reduce communication overhead, the authors employ spike buffering, aggregation, and optional compression, exploiting the fact that spike propagation is largely local in this spatially structured network.

Performance results show that the simulator scales remarkably well. In the strong‑scaling experiments (fixed problem size, increasing core count), the runtime decreases almost linearly: doubling the number of cores yields an average speed‑up of 1.92×, and at 1,024 cores the total wall‑clock time for the 1‑second biological simulation is 0.083 seconds, corresponding to a 12× real‑time factor. This represents only about a 1.2 % deviation from ideal linear scaling. Weak‑scaling tests (problem size proportional to core count) maintain >95 % parallel efficiency up to the full 1,024‑core configuration, indicating that the communication pattern and load balance remain effective as the system grows.

Communication overhead remains modest, accounting for less than 7 % of total execution time even at the largest scale. The high‑performance InfiniBand fabric, combined with MPI’s efficient collective operations, ensures that spike exchange does not become a bottleneck. Memory consumption averages 2.3 GB per core, leading to a total memory footprint of roughly 2.4 TB for the 1,024‑core run—close to the 3 TB physical memory limit of the cluster but still manageable. The authors note that further memory‑saving techniques (compression, dynamic allocation, out‑of‑core strategies) could enable simulations of even larger networks.

Key technical insights emerge from the study: (1) The locality of spike traffic in spatially organized networks permits aggressive partitioning and asynchronous communication, dramatically reducing network contention. (2) A hybrid MPI‑OpenMP approach leverages both inter‑node message passing and intra‑node thread parallelism, maximizing core‑level memory bandwidth utilization. (3) High‑speed interconnects such as InfiniBand are crucial; they provide an order‑of‑magnitude advantage over conventional Ethernet, keeping communication overhead below 10 % even with billions of synapses. (4) While the current implementation supports plastic synapses with STDP, extending to more biophysically detailed models (e.g., multi‑ion channel dynamics) will require additional algorithmic and memory optimizations.

The work is positioned within the European Human Brain Project (HBP), specifically the WaveScalEW team of Sub‑Project 2 (Cognitive and Systems Neuroscience). The DPSNN simulator, developed under the INFN‑led EURETILE and CORTICONIC FET projects, is intended to serve as a scalable backbone for large‑scale brain simulations. The authors outline future directions: a quantitative comparison with the widely used NEST simulator, integration of GPU acceleration, implementation of dynamic load‑balancing schemes, and exploration of energy‑efficiency metrics. By establishing robust scaling up to 1,024 cores, this study lays a solid foundation for the next generation of brain‑scale spiking neural network simulations, bridging the gap between computational neuroscience research and high‑performance computing.

💡 Research Summary

📜 Original Paper Content