Architectural improvements and 28 nm FPGA implementation of the APEnet+ 3D Torus network for hybrid HPC systems

Architectural improvements and 28 nm FPGA implementation of the APEnet+   3D Torus network for hybrid HPC systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Modern Graphics Processing Units (GPUs) are now considered accelerators for general purpose computation. A tight interaction between the GPU and the interconnection network is the strategy to express the full potential on capability computing of a multi-GPU system on large HPC clusters; that is the reason why an efficient and scalable interconnect is a key technology to finally deliver GPUs for scientific HPC. In this paper we show the latest architectural and performance improvement of the APEnet+ network fabric, a FPGA-based PCIe board with 6 fully bidirectional off-board links with 34 Gbps of raw bandwidth per direction, and X8 Gen2 bandwidth towards the host PC. The board implements a Remote Direct Memory Access (RDMA) protocol that leverages upon peer-to-peer (P2P) capabilities of Fermi- and Kepler-class NVIDIA GPUs to obtain real zero-copy, low-latency GPU-to-GPU transfers. Finally, we report on the development activities for 2013 focusing on the adoption of the latest generation 28 nm FPGAs and the preliminary tests performed on this new platform.


💡 Research Summary

The paper presents a comprehensive update on the APEnet+ interconnect, a custom FPGA‑based PCIe expansion card designed for hybrid CPU‑GPU high‑performance computing clusters. APEnet+ provides six fully bidirectional off‑board links (34 Gbps raw per direction) and an X8 Gen2 PCIe interface, enabling a three‑dimensional torus topology. The authors focus on three major architectural refinements introduced in 2013 and on the migration to a 28 nm FPGA technology.

First, the PCIe subsystem suffered from low effective bandwidth because only a single DMA engine could issue requests, leading to long idle periods while waiting for completions. By adding a second, concurrent DMA engine and a prefetchable command queue, multiple outstanding transactions can overlap, reducing total transaction time by up to 40 %. Second, virtual‑to‑physical address translation, previously performed by an embedded Nios II processor, became a bottleneck on the receive path. The team implemented a hardware Translation Look‑Aside Buffer (TLB) inside the FPGA, capable of storing a limited number of page entries. When a page hit occurs, the Nios II is bypassed, delivering up to a 60 % bandwidth increase on synthetic benchmarks. Third, the off‑board transceiver was pushed toward its frequency limits. Although the Altera (now Intel) transceivers were conservatively set at 7 Gbps for reliability, signal‑integrity work allowed operation at higher speeds, achieving a per‑channel efficiency of 0.784 and an aggregated raw bandwidth of roughly 2.6 GB/s (≈40 KB memory footprint per channel).

Performance measurements demonstrate that the refined APEnet+ outperforms commercial InfiniBand for small‑message GPU‑to‑GPU transfers. With peer‑to‑peer (P2P) enabled, round‑trip latency is about 8.2 µs for messages up to 128 KB, compared with 16.8 µs without P2P and 17.4 µs for InfiniBand on the same platform. Bandwidth tests show that CPU‑memory reads/writes and GPU‑to‑CPU writes reach the link limit of ~2.2 GB/s, while GPU‑memory reads are limited by the GPU’s internal memory subsystem. These results confirm that APEnet+ excels in low‑latency, GPU‑direct communication scenarios.

The paper also introduces a fault‑awareness mechanism called LO|FA|MO (Local Fault Monitor). Each node hosts a lightweight watchdog register pair (host status, APEnet+ status, and neighbor status). A periodic software watchdog updates these registers; if a node fails to refresh its register, neighboring APEnet+ modules detect the fault and propagate diagnostic messages across the 3‑D torus. Global fault awareness can be achieved within a time dominated by the watchdog period (e.g., 0.9 s for a 500 ms period), without impacting data‑transfer latency.

Looking ahead, the authors describe the design of a next‑generation board based on the Altera Stratix V GX 28 nm FPGA. The new design migrates the PCIe interface to Gen3 (8 Gbps per lane, 128/130‑bit encoding, <1 % protocol overhead) yielding a theoretical host bandwidth of ~7.9 GB/s. To sustain this rate, the internal data path is widened to 256 bits operating at 250 MHz and is built around the AXI4 protocol, paving the way for future integration of embedded ARM hard IP cores available in higher‑end Stratix devices. Off‑board links are upgraded to 56 Gbps using QSFP+ connectors; preliminary tests on a Stratix V development board achieved 11.3 Gbps per lane (45.2 Gbps per channel) with 40 Gbps‑rated cables, indicating a clear path toward the target speed.

Finally, the paper reports on real‑world deployments of APEnet+ within the QUonG hybrid cluster (a 4 × 4 × 1 torus) and on several scientific applications, including spiking neural network simulations, breadth‑first search on graphs, 3‑D Heisenberg spin‑glass models, and high‑energy‑physics GPU‑stream processing. The integration of fault‑monitoring, low‑latency GPU‑direct RDMA, and the planned high‑speed Gen3/56 Gbps interfaces positions APEnet+ as a cost‑effective, scalable alternative to commercial interconnects for future exascale GPU‑centric HPC systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment