NaNet: a flexible and configurable low-latency NIC for real-time trigger systems based on GPUs
NaNet is an FPGA-based PCIe X8 Gen2 NIC supporting 1/10 GbE links and the custom 34 Gbps APElink channel. The design has GPUDirect RDMA capabilities and features a network stack protocol offloading module, making it suitable for building low-latency, real-time GPU-based computing systems. We provide a detailed description of the NaNet hardware modular architecture. Benchmarks for latency and bandwidth for GbE and APElink channels are presented, followed by a performance analysis on the case study of the GPU-based low level trigger for the RICH detector in the NA62 CERN experiment, using either the NaNet GbE and APElink channels. Finally, we give an outline of project future activities.
💡 Research Summary
**
The paper presents NaNet, an FPGA‑based PCIe Gen2 x8 network interface card designed for ultra‑low‑latency, real‑time GPU‑accelerated trigger systems. NaNet integrates four distinct physical link technologies: standard 1 GbE, a future‑ready 10 GbE, and the custom 34 Gbps APElink (implemented with three QSFP+ lanes). Its core architecture inherits the Distributed Network Processor (DNP) from the APEnet+ 3‑D NIC, providing hardware support for Remote Direct Memory Access (RDMA) and, crucially, GPUDirect RDMA. This feature enables direct peer‑to‑peer transfers between the NIC and NVIDIA Fermi/Kepler GPUs without CPU involvement, dramatically reducing data‑movement latency.
The design is modular. The Physical Link Coding block uses Altera’s Triple Speed Ethernet MAC for GbE, while the APElink interface employs a proprietary word‑stuffing protocol. A UDP Offloader extracts payloads from incoming Ethernet frames in hardware, feeding a 32‑bit wide data stream that is repackaged into 128‑bit APEnet+ packets by the NaNet Controller. The Router multiplexes multiple logical channels, and the Network Interface (PCIe x8 Gen2) delivers packets to host or GPU memory. An embedded Nios II microcontroller handles configuration, virtual‑to‑physical address translation, and DMA initiation. Although the Nios II introduces a modest overhead (≈1.6 µs per packet), the authors plan to replace it with a dedicated Translation Lookaside Buffer (TLB) and address‑generation logic to eliminate this jitter.
Performance evaluation was carried out on a Supermicro 6016GT‑TF server equipped with dual Intel Xeon X5570 CPUs, an NVIDIA M2070 GPU, and the NaNet‑1 board. Using a loopback test, the authors measured bandwidth and latency for both GbE and APElink links. GbE achieved near‑maximum theoretical throughput (≈800 MB/s) across a wide range of buffer sizes, with end‑to‑end latency (including GPU kernel execution for ring reconstruction) staying below 10 µs and exhibiting minimal jitter (<0.5 µs). APElink sustained ≈20 Gbps of data flow; the observed bandwidth plateau is attributed to the current RX path implementation rather than the physical link itself. Latency for APElink was slightly higher (≈7 µs) due to the Nios II address translation, but still well within the stringent timing budget of real‑time triggers.
The authors applied NaNet to a concrete use case: the Level‑0 (L0) trigger of the NA62 experiment’s RICH detector. NA62 must reduce a 10 MHz particle rate to a manageable ≈1 kHz for downstream processing. Traditional L0 hardware performs coarse cuts, while higher‑level triggers run on commodity PCs. By feeding raw RICH data directly into a GPU via NaNet, sophisticated pattern‑recognition algorithms can be executed at the L0 stage, improving selection purity without violating the ≤1 ms latency constraint. Benchmarks showed that both GbE and APElink channels meet the required latency and bandwidth, confirming the feasibility of a GPU‑centric L0 trigger.
Future work includes the development of NaNet‑10, which will add native 10 GbE support using a Terasic Dual XAUI‑to‑SFP+ mezzanine and migrate to a Stratix V FPGA to exploit Gen3 PCIe (8 GB/s per direction). Additionally, NaNet‑3 is being designed for the KM3NeT neutrino telescope, featuring deterministic latency links based on 8B/10B encoding and Time Division Multiplexing (TDMP). These extensions aim to broaden NaNet’s applicability beyond high‑energy physics to any domain where deterministic, sub‑microsecond data movement to GPUs is essential, such as high‑frequency trading, real‑time video analytics, and edge‑cloud AI inference.
In summary, NaNet demonstrates that a carefully architected FPGA NIC with GPUDirect RDMA and network‑stack offloading can deliver the ultra‑low latency, high‑throughput, and deterministic performance required for modern real‑time GPU‑based trigger systems, while remaining flexible enough to accommodate emerging link standards and diverse application domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment