Accelerating radio astronomy imaging with RICK: a step towards SKA-Mid and SKA-Low
The data volumes generated by modern radio interferometers, such as the SKA precursors, present significant computational challenges for imaging pipelines. Addressing the need for high-performance, portable, and scalable software, we present RICK 2.0 (Radio Imaging Code Kernels). This work introduces a novel implementation that leverages the HeFFTe library for distributed Fast Fourier Transforms, ensuring portability across diverse HPC architectures, including multi-core CPUs and accelerators. We validate RICK’s correctness and performance against real observational data from both MeerKAT and LOFAR. Our results demonstrate that the HeFFTe-based implementation offers substantial performance advantages, particularly when running on GPUs, and scales effectively with large pixel resolutions and a high number of frequency planes. This new architecture overcomes the critical scaling limitations identified in previous work (Paper II, Paper III), where communication overheads consumed up to 96% of the runtime due to the necessity of communicating the entire grid. This new RICK version drastically reduces this communication impact, representing a scalable and efficient imaging solution ready for the SKA era.
💡 Research Summary
The paper presents RICK 2.0, a next‑generation imaging kernel suite designed to meet the extreme data‑throughput demands of upcoming radio interferometers such as SKA‑Mid and SKA‑Low. Building on the earlier RICK implementations, the authors identify the dominant bottleneck in the previous versions: MPI communication, especially the all‑reduce of the full u‑v grid, which consumed up to 96 % of runtime when processing large images or many frequency planes.
To overcome this limitation, RICK 2.0 introduces several major innovations. First, it adopts the HeFFTe library for distributed Fast Fourier Transforms. HeFFTe provides a portable, high‑performance FFT that runs efficiently on both CPUs and GPUs (CUDA or HIP), eliminating the need for separate cuFFT/cuFFTMp code paths and enabling seamless execution on heterogeneous HPC systems.
Second, the I/O subsystem is completely rewritten using MPI‑I/O. Parallel reads of Measurement Sets and parallel writes of final images are performed with MPI_File_read_at and MPI_File_write_at, guaranteeing scalability on both parallel and non‑parallel file systems and removing the previous reliance on shared‑window tricks.
Third, the authors replace the original one‑dimensional, uniform slab decomposition with a non‑uniform, Gaussian‑weighted domain decomposition along the v‑axis. Because visibility density is highest near the centre of the u‑v plane, this approach assigns finer slabs to dense regions and coarser slabs to sparse edges, balancing the workload across MPI ranks. Visibilities are bucket‑sorted by their v‑coordinates and redistributed using MPI_Sendrecv, ensuring that each rank holds only the data required for its local slab.
Fourth, the communication pattern is fundamentally altered. Instead of broadcasting the entire grid, each rank exchanges only its assigned slab, and the subsequent gridding, FFT, and w‑term correction steps are performed locally. This eliminates global synchronisation and reduces communication overhead to less than 5 % of total runtime even for 8192² pixel images with 64 frequency planes.
GPU acceleration is retained and extended. The code supports both CUDA and HIP, uses NCCL for GPU‑GPU transfers, and integrates the HeFFTe GPU kernels. Benchmarks on an 8‑node A100 cluster show a speed‑up of more than a factor of two compared with the previous RICK version, while maintaining image fidelity. Validation against real MeerKAT and LOFAR datasets demonstrates that RICK 2.0 produces images with comparable or slightly lower residuals than established tools such as WSClean and DDFacet. Energy‑to‑solution measurements indicate a >30 % reduction when using GPUs.
In summary, RICK 2.0 delivers a portable, scalable, and high‑performance imaging pipeline that dramatically reduces communication costs, balances computational load, and fully exploits modern GPU architectures. The work positions RICK as a viable solution for the SKA era, where petabyte‑scale data streams must be processed in near‑real‑time. Future directions include extending the non‑uniform decomposition to two dimensions, incorporating dynamic load‑balancing, and exploring machine‑learning‑based visibility compression to further improve scalability.
Comments & Academic Discussion
Loading comments...
Leave a Comment