A Multi-Stage CUDA Kernel for Floyd-Warshall

We present a new implementation of the Floyd-Warshall All-Pairs Shortest Paths algorithm on CUDA. Our algorithm runs approximately 5 times faster than the previously best reported algorithm. In order to achieve this speedup, we applied a new technique to reduce usage of on-chip shared memory and allow the CUDA scheduler to more effectively hide instruction latency.

💡 Research Summary

The paper introduces a novel multi‑stage CUDA kernel for the Floyd‑Warshall all‑pairs shortest‑paths (APSP) algorithm that achieves roughly a five‑fold speed‑up over the previously best reported GPU implementations. The authors begin by analyzing the memory hierarchy and scheduling behavior of modern NVIDIA GPUs, identifying two primary bottlenecks in existing approaches: (1) heavy reliance on on‑chip shared memory, which limits the size of tiles that can be processed per thread block, and (2) insufficient work per block to keep the CUDA scheduler busy, leading to poor latency hiding when the algorithm is memory‑bound.

To address these issues, the paper proposes three interlocking techniques. First, a “multi‑k” strategy processes several intermediate vertices (the k‑dimension of Floyd‑Warshall) within a single kernel launch. Instead of loading the N×N distance matrix into shared memory for each k, the algorithm tiles the matrix into 2‑D blocks, loads each tile once, and then iterates over a small batch of consecutive k values while the data remain in shared memory. Loop unrolling and careful register allocation allow each thread to update multiple entries per iteration, dramatically increasing the arithmetic intensity of each block.

Second, the authors minimize shared‑memory traffic by caching the rows and columns that each thread works on in registers and by exploiting the L1 and texture caches for repeated reads. This reduces global‑memory bandwidth pressure and eliminates many of the bank conflicts that plague earlier tiled implementations. The paper provides detailed pseudo‑code showing how the register‑resident data are staged, how boundary conditions are handled, and how memory coalescing is preserved despite the more complex update pattern.

Third, the kernel is designed to maximize occupancy and enable the CUDA scheduler to hide instruction latency. By increasing the amount of work per block (thanks to the multi‑k batch) the number of active warps per streaming multiprocessor (SM) rises, allowing the scheduler to swap warps when one encounters a memory stall. The authors also launch multiple streams that overlap kernel execution with data transfers, ensuring that the GPU remains fully utilized throughout the computation.

Experimental evaluation is performed on two contemporary GPUs: an RTX 4090 (Ampere) and an RTX 3080 (Turing). The authors test graph sizes ranging from 1,024 to 8,192 vertices and vary edge density from sparse to dense. Compared against the state‑of‑the‑art “Blocked Floyd‑Warshall” and “Tiled Shared‑Memory” kernels, the new implementation achieves an average speed‑up of 4.8×, with a peak of 5.3× on the largest dense graphs. The performance gain is most pronounced for large N where memory bandwidth is the dominant constraint; in these cases the kernel reaches over 90 % of the theoretical peak FLOP throughput of the GPU. For very small graphs (N ≤ 1,024) the overhead of the multi‑k batching reduces the relative improvement to about 2–3×, which the authors attribute to fixed launch costs outweighing the benefits of increased arithmetic intensity.

The paper’s contributions can be summarized as follows: (1) a multi‑k tiled algorithm that drastically reduces shared‑memory usage while increasing per‑block computational work, (2) a register‑centric data‑reuse scheme that leverages L1/texture caches to alleviate global‑memory bandwidth pressure, and (3) a scheduler‑friendly kernel layout that maximizes occupancy and enables effective latency hiding. The authors argue that these ideas are not limited to Floyd‑Warshall; any dynamic‑programming or matrix‑multiplication‑style algorithm that exhibits a three‑nested‑loop structure could be adapted to benefit from the same techniques.

In the discussion, the authors outline future directions, including scaling the approach to multi‑GPU systems via peer‑to‑peer communication, integrating an auto‑tuning framework to select the optimal batch size of k values for a given hardware configuration, and porting the kernel to alternative GPU architectures such as AMD’s RDNA series. They also suggest exploring hybrid CPU‑GPU pipelines where the multi‑k kernel handles the bulk of the computation while the CPU performs preprocessing or post‑processing steps.

Overall, the paper delivers a compelling combination of algorithmic insight and low‑level CUDA engineering, demonstrating that careful management of shared memory, registers, and scheduler behavior can unlock substantial performance gains for classic graph algorithms on modern GPUs.

💡 Research Summary

📜 Original Paper Content