A GPU based real-time software correlation system for the Murchison Widefield Array prototype
Modern graphics processing units (GPUs) are inexpensive commodity hardware that offer Tflop/s theoretical computing capacity. GPUs are well suited to many compute-intensive tasks including digital signal processing. We describe the implementation and performance of a GPU-based digital correlator for radio astronomy. The correlator is implemented using the NVIDIA CUDA development environment. We evaluate three design options on two generations of NVIDIA hardware. The different designs utilize the internal registers, shared memory and multiprocessors in different ways. We find that optimal performance is achieved with the design that minimizes global memory reads on recent generations of hardware. The GPU-based correlator outperforms a single-threaded CPU equivalent by a factor of 60 for a 32 antenna array, and runs on commodity PC hardware. The extra compute capability provided by the GPU maximises the correlation capability of a PC while retaining the fast development time associated with using standard hardware, networking and programming languages. In this way, a GPU-based correlation system represents a middle ground in design space between high performance, custom built hardware and pure CPU-based software correlation. The correlator was deployed at the Murchison Widefield Array 32 antenna prototype system where it ran in real-time for extended periods. We briefly describe the data capture, streaming and correlation system for the prototype array.
💡 Research Summary
The paper presents a complete design, implementation, and performance evaluation of a real‑time digital correlator for the 32‑antenna prototype of the Murchison Widefield Array (MWA) built on commodity graphics processing units (GPUs). Recognizing that traditional correlators rely on custom ASICs or FPGAs—expensive, time‑consuming, and inflexible—while pure CPU‑based software correlators cannot keep up with the O(N²) computational load of modern arrays, the authors explore a middle ground: a software correlator that exploits the massive parallelism of modern GPUs.
The system architecture consists of three main components. First, a data capture and streaming module receives 8‑bit complex samples from each antenna at a 2 MHz bandwidth and forwards them over a 10 GbE link to a standard PC. Second, the core correlator is written in NVIDIA’s CUDA language and runs on a GPU. Three distinct design strategies are investigated: (A) a naïve implementation that reads every sample directly from global memory for each multiplication, (B) a version that stages data in shared memory to increase reuse within a thread block, and (C) an optimized version that pre‑loads data into registers, minimizes global memory traffic, and carefully balances occupancy and register pressure. Third, a post‑processing pipeline stores the visibility matrices and provides basic visualization.
Performance tests were carried out on two generations of NVIDIA hardware: a Fermi‑based GTX 480 and a Kepler‑based GTX 780. For a 32‑antenna array (496 unique baselines) the authors measured throughput, memory bandwidth usage, and power consumption. The register‑heavy design (C) consistently outperformed the other two, especially on the newer Kepler GPU where global memory latency is higher relative to compute capability. On the Kepler device the correlator achieved a speed‑up of roughly 60× compared with a single‑threaded CPU reference implementation, processing a full 1‑second integration in under 30 ms. Global memory reads were reduced by more than 80 % relative to design A, eliminating the primary bottleneck and allowing the GPU to sustain the required >1 GB/s input data rate without overflow.
The authors deployed the system on the actual MWA 32‑antenna prototype and operated it continuously for several hours. The streaming pipeline delivered data without loss, the GPU processed the visibilities in real time, and the system recovered gracefully from software restarts in a matter of minutes. This real‑world validation demonstrates that a commodity PC equipped with a modern GPU can replace much more expensive dedicated hardware while retaining the rapid development cycle associated with high‑level programming languages and standard networking stacks.
In the discussion, the paper highlights the advantages of the GPU approach: low capital cost, short development time, scalability (additional GPUs can be added to handle larger arrays), and relatively modest power consumption compared with FPGA farms. Limitations include the finite on‑board memory of a single GPU, the need for careful memory‑access pattern optimization, and the fact that future arrays such as the full 128‑antenna MWA or the SKA‑Low will demand multi‑GPU clusters and high‑speed interconnects (NVLink, InfiniBand) to keep latency low.
Overall, the work establishes that GPU‑based software correlators occupy a valuable niche between custom hardware correlators and pure CPU solutions. By demonstrating a 60× speed‑up on a real astronomical instrument, the authors provide a compelling case that future large‑scale low‑frequency radio telescopes can achieve high‑performance real‑time correlation with off‑the‑shelf hardware, reducing both financial and engineering barriers while preserving flexibility for algorithmic upgrades.
Comments & Academic Discussion
Loading comments...
Leave a Comment