D-PDLP: Scaling PDLP to Distributed Multi-GPU Systems
We present a distributed framework of the Primal-Dual Hybrid Gradient (PDHG) algorithm for solving massive-scale linear programming (LP) problems. Although PDHG-based solvers demonstrate strong performance on single-node GPU architectures, their applicability to industrial-scale instances is often limited by single-GPU computational throughput. To overcome these challenges, we propose D-PDLP, the first Distributed PDLP framework, which extends PDHG to a multi-GPU setting via a practical two-dimensional grid partitioning of the constraint matrix. To improve load balance and computational efficiency, we introduce a block-wise random permutation strategy combined with nonzero-aware matrix partitioning. By distributing the intensive computation required in PDHG iterations, the proposed framework harnesses multi-GPU parallelism to achieve substantial speedups with relatively low communication overhead. Extensive experiments on standard LP benchmarks (including MIPLIB and Mittelmann instances) as well as huge-scale real-world datasets show that our distributed implementation, built upon cuPDLPx, achieves strong scalability and high performance while preserving full FP64 numerical accuracy.
💡 Research Summary
The paper introduces D‑PDLP, a distributed framework that extends the Primal‑Dual Hybrid Gradient (PDHG) algorithm to multi‑GPU systems for solving massive linear programming (LP) problems. While recent first‑order solvers such as PDLP, cuPDLP.jl, cuPDLP‑C, and cuPDLPx have demonstrated impressive performance on single‑GPU nodes, their scalability is fundamentally limited by the memory capacity and compute throughput of a single device. D‑PDLP addresses this bottleneck by decomposing the constraint matrix A into a two‑dimensional grid of blocks and distributing these blocks across a mesh of GPUs.
Two‑dimensional grid partitioning
The authors define a logical device mesh with |R| rows and |C| columns. Row partitions correspond to subsets of constraints (dual variables y) and column partitions correspond to subsets of primal variables x. Each GPU located at grid coordinate (i, j) stores the local sparse block A
Comments & Academic Discussion
Loading comments...
Leave a Comment