Computing Optimal Cycle Mean in Parallel on CUDA
Computation of optimal cycle mean in a directed weighted graph has many applications in program analysis, performance verification in particular. In this paper we propose a data-parallel algorithmic solution to the problem and show how the computation of optimal cycle mean can be efficiently accelerated by means of CUDA technology. We show how the problem of computation of optimal cycle mean is decomposed into a sequence of data-parallel graph computation primitives and show how these primitives can be implemented and optimized for CUDA computation. Finally, we report a fivefold experimental speed up on graphs representing models of distributed systems when compared to best sequential algorithms.
💡 Research Summary
The paper addresses the problem of computing the optimal cycle mean (OCM) in a directed weighted graph, a metric that captures the minimal average weight over all cycles and is widely used for performance analysis, worst‑case resource consumption estimation, and latency evaluation in systems modeled as Petri nets, process graphs, or data‑flow graphs. Recognizing that existing sequential algorithms (Karp, Howard, Young‑Tarjan, etc.) become impractical for graphs with millions of vertices and billions of edges, the authors propose a data‑parallel solution that leverages NVIDIA’s CUDA platform to exploit massive SIMD parallelism on modern GPUs.
The authors first formalize OCM: given a graph G = (V, E) with weight function w: E → ℝ, the mean of a cycle π is µ(π) = w(π)/|π|, and the optimal cycle mean µ* = min_{π∈Z_G} µ(π). They introduce a parametric weight function w_Λ = w – Λ and show that a value Λ is feasible (no cycle has negative weight under w_Λ) if and only if Λ ≤ µ*. Moreover, µ* equals the smallest feasible Λ across all strongly connected components. This theoretical foundation enables a binary search over Λ combined with a feasibility test.
For parallelization, the authors evaluate three families of OCM algorithms—shortest‑path‑feasibility (SPF) based, cycle‑search based, and non‑linear optimization based—and select the SPF approach because its core operation is a “scan” that updates tentative distances for all vertices simultaneously. The scan can be expressed as a vector operation: for each edge (u, v) ∈ E, if π(u) + w(e) < π(v) then π(v) ← π(u) + w(e). This operation maps naturally to a CUDA kernel where each thread processes one edge, allowing the whole edge set to be examined in parallel.
GPU memory constraints dictate a compact graph representation. The authors store the graph in CSR format using two one‑dimensional arrays: A_i (size |V|+1) holds the start index of each vertex’s adjacency list in A_t (size |E|), which contains target vertex IDs. To support backward reachability required by many OCM algorithms, they also store a reverse edge list, effectively duplicating the edge set. Although forward‑only reachability can be simulated with additional logic, experiments showed prohibitive slowdown when restricting the scan to a subgraph with a single outgoing edge per vertex; thus both forward and backward arrays are kept.
The parallel OCM algorithm proceeds as follows:
- Initialize lower and upper bounds for Λ (e.g., min and max edge weights).
- While the interval is larger than a tolerance, set Λ = (low + high)/2 and compute w_Λ.
- Launch a CUDA kernel that performs the SPF scan on the entire graph using w_Λ. The kernel iteratively relaxes edges until no updates occur (or a fixed number of iterations sufficient for convergence).
- After the scan, a reduction determines whether any vertex still has a negative distance, indicating infeasibility of the current Λ.
- Adjust the binary‑search interval accordingly.
Key implementation details include:
- Using shared memory to cache a block of edges and reduce global memory traffic.
- Organizing threads into warps to ensure coalesced memory accesses and hide latency by switching warps.
- Minimizing kernel launch overhead by batching multiple scan iterations within a single kernel launch.
- Employing atomic operations only when necessary (e.g., to set a global “found” flag).
The authors integrate their CUDA OCM engine into the DIVINE model‑checking tool, extending its modeling language (DVE) to annotate transitions with cost and time. They evaluate the approach on realistic case studies, notably a client‑server PDF editor model, and on synthetic large sparse graphs. Compared with the best sequential algorithms, the CUDA implementation achieves an average speed‑up of 4.8× to 5.2× on graphs with up to several million vertices and tens of millions of edges, while staying within the memory limits of a typical GPU (3 GB of global memory). The experimental results also demonstrate that the parallel algorithm scales well with graph size, and that the overhead of data transfer between host and device is amortized by the massive parallel work.
In conclusion, the paper shows that OCM—traditionally considered a sequentially intensive problem—can be transformed into a data‑parallel scan suitable for GPUs. The work provides practical guidelines for graph layout, kernel design, and memory management on CUDA, and validates the approach on both synthetic benchmarks and a real‑world performance‑analysis scenario. Future directions suggested include multi‑GPU scaling, incremental updates for dynamic graphs, and extending the methodology to related metrics such as maximal cycle mean or cycle ratio problems.
Comments & Academic Discussion
Loading comments...
Leave a Comment