An FPGA-based Torus Communication Network

An FPGA-based Torus Communication Network
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We describe the design and FPGA implementation of a 3D torus network (TNW) to provide nearest-neighbor communications between commodity multi-core processors. The aim of this project is to build up tightly interconnected and scalable parallel systems for scientific computing. The design includes the VHDL code to implement on latest FPGA devices a network processor, which can be accessed by the CPU through a PCIe interface and which controls the external PHYs of the physical links. Moreover, a Linux driver and a library implementing custom communication APIs are provided. The TNW has been successfully integrated in two recent parallel machine projects, QPACE and AuroraScience. We describe some details of the porting of the TNW for the AuroraScience system and report performance results.


💡 Research Summary

The paper presents the design, FPGA implementation, and performance evaluation of a three‑dimensional torus network (TNW) intended to provide low‑latency, high‑bandwidth nearest‑neighbor communication for scientific parallel applications. The authors target workloads such as Lattice‑QCD and Lattice‑Boltzmann simulations, which require fine‑grained, regular data exchanges between adjacent processes. To meet these requirements while leveraging commodity multi‑core CPUs, they develop a custom network processor (NWP) that resides on a modern FPGA and connects to the host CPU via a PCI‑Express (PCIe) interface.

Each node’s FPGA hosts six external PHY transceivers, forming the six directions of a 3‑D torus. The physical links operate at 10 Gbit/s, and after accounting for 8/10b encoding and a lightweight custom protocol the effective per‑direction bandwidth is 0.91 GB/s. Data is transmitted in packets consisting of a 32‑bit header, a 128‑byte payload, and a 32‑bit CRC. The link logic guarantees strict ordering and reliability by using positive (ACK) and negative (NACK) feedback, and by storing transmitted packets in an internal buffer until acknowledgment is received.

The NWP provides injection and reception FIFOs for each of the six links and supports eight virtual channels per link, allowing multiple independent streams (e.g., different threads or cores) to share the same physical link without interference. Virtual‑channel tags travel with every packet, enabling the receiver to demultiplex streams correctly.

Two communication schemes are described. In the “Pput” scheme the CPU directly writes data into memory‑mapped injection buffers; on x86 systems this is realized with write‑combining (WC) buffers that coalesce 16‑byte stores into 64‑byte burst PCIe transactions. This approach yields low latency for short messages but requires the NWP to reorder potentially interleaved fragments caused by WC flushes. In the “Nget” scheme the NWP initiates DMA reads from the CPU’s memory, simplifying back‑pressure handling and allowing non‑blocking sends, at the cost of additional control complexity. The AuroraScience system currently uses the Pput scheme, while an experimental Nget implementation is also evaluated.

System software consists of a Linux kernel driver and a user‑space communication library. The driver maps the injection buffers as WC memory and allocates contiguous reception buffers for each virtual channel. The library exposes a simple API (tnwSend, tnwCredit, tnwPoll, etc.) that matches the Single‑Program‑Multiple‑Data (SPMD) model common in scientific codes; a typical pattern is to issue a credit on the “‑x” direction, then send data on the “+x” direction, and finally poll for completion.

Performance measurements on an Altera Stratix IV GX‑230 FPGA running at 250 MHz show that the Pput implementation achieves up to 0.83 GB/s per link (≈90 % of the theoretical maximum) with an average latency of 1.67 µs for 128‑byte messages. The Nget prototype reaches 0.76 GB/s and 2.16 µs latency. The internal NWP‑to‑NWP latency is about 0.6 µs (0.24 µs contributed by the PHYs). Adding the PCIe transaction overhead (~0.2 µs) yields a total software overhead of roughly 0.7 µs for Pput and 0.8 µs for Nget.

As an application test, a 2‑dimensional Lattice‑Boltzmann fluid dynamics code was ported to AuroraScience. On a 16‑node configuration the code achieved 36–39 % of the per‑node peak double‑precision performance (≈160 GFlops), demonstrating that the TNW can sustain realistic scientific workloads.

The authors conclude that the FPGA‑based torus network provides an efficient, scalable interconnect for tightly coupled parallel machines. The design has been successfully integrated into the QPACE and AuroraScience projects, and plans are underway to release the hardware description and software as open source. Future work includes a fully optimized Nget implementation, exploration of higher‑speed serial links, and extensions to support more general communication patterns beyond nearest‑neighbor exchanges.


Comments & Academic Discussion

Loading comments...

Leave a Comment