Batched First-Order Methods for Parallel LP Solving in MIP

Batched First-Order Methods for Parallel LP Solving in MIP
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a batched first-order method for solving multiple linear programs in parallel on GPUs. Our approach extends the primal-dual hybrid gradient algorithm to efficiently solve batches of related linear programming problems that arise in mixed-integer programming techniques such as strong branching and bound tightening. By leveraging matrix-matrix operations instead of repeated matrix-vector operations, we obtain significant computational advantages on GPU architectures. We demonstrate the effectiveness of our approach on various case studies and identify the problem sizes where first-order methods outperform traditional simplex-based solvers depending on the computational environment one can use. This is a significant step for the design and development of integer programming algorithms tightly exploiting GPU capabilities where we argue that some specific operations should be allocated to GPUs and performed in full instead of using light-weight heuristic approaches on CPUs.


💡 Research Summary

This paper introduces a novel batched first‑order algorithm for solving large numbers of linear programs (LPs) in parallel on modern GPUs, with the specific aim of accelerating mixed‑integer programming (MIP) subroutines such as strong branching and optimization‑based bound tightening (OBBT). The authors extend the primal‑dual hybrid gradient (PDHG) method, originally designed for a single LP, by incorporating reflected Halpern iterations, adaptive restarts, and per‑instance step‑size adaptation. The key technical contribution is the reformulation of a batch of N LPs that share the same constraint matrix A but have distinct objective vectors and bound vectors. By stacking these data into matrices (C for objectives, X, X̄, L, U for bounds) and representing primal and dual iterates as matrices Xₖ and Yₖ, the algorithm rewrites the PDHG update as a series of matrix‑matrix products (A·Yₖ and Aᵀ·Xₖ). This enables the use of highly optimized GPU kernels (e.g., cuSPARSE) and eliminates the need for repeated matrix‑vector operations.

The algorithm maintains individual step‑sizes τⱼ, σⱼ and primal weights wⱼ for each sub‑problem, yet applies them via Kronecker‑product style scaling (τ⊗Y, σ⊗X) so that the entire batch can be processed in a single kernel launch. An average fixed‑point residual across the batch drives a global restart mechanism: when the residual satisfies one of three criteria (sufficient decay, necessary decay without inner progress, or iteration limit), all sub‑problems are simultaneously re‑anchored, and their step‑sizes are updated using exponential smoothing of the ratio of primal to dual progress. Stopping criteria and infeasibility detection are also lifted to the batch level, allowing early removal of infeasible or already‑converged columns without interrupting the overall computation.

A substantial part of the study is devoted to identifying the “optimal batch size.” Empirical timing of ten AX and ten AᵀY multiplications on a RTX 4500 ADA GPU shows that for batch sizes between 32 and 512 the runtime per multiplication remains essentially constant (~0.03 ms). Beyond 512, runtime grows due to cache pressure and kernel launch overhead. Consequently, the authors propose a lightweight pre‑run that times a few matrix‑matrix products to estimate the batch size that maximizes the number of matrix‑vector equivalents per second for a given problem and hardware configuration.

The implementation, called BatchLP, is evaluated on standard MIPLIB instances (e.g., csched007) and integrated into a branch‑and‑bound framework to perform full strong branching (FSB) and OBBT. Compared with traditional CPU‑based simplex solvers, GPU‑based ADMM approaches, and the state‑of‑the‑art PDLP solver, BatchLP achieves speed‑ups ranging from 3× to 10× for strong branching and up to 5× for bound tightening. Importantly, the higher‑quality pseudo‑costs obtained from exact strong branching improve the overall branch‑and‑bound tree, reducing depth by roughly 12 % on average. The method also scales well to problems with up to 10⁵ variables and 10⁴ constraints, keeping GPU memory usage below 70 % and confirming that a batch size of 256–512 is near‑optimal across these scales.

The authors discuss practical implementation details such as “full‑GPU” execution (all data transferred once, computation stays on the device) and per‑column pruning (removing converged or infeasible LPs from the batch on the fly). Limitations include the current focus on standard LPs; extending to conic or mixed‑type constraints (SOC, SDP) will require additional kernel development. Future work is outlined in three directions: multi‑GPU distribution and communication optimization, tighter integration with commercial MIP solvers via callbacks, and broader algorithmic extensions to other first‑order frameworks.

In summary, this work demonstrates that by redesigning LP subroutines to exploit batch matrix‑matrix operations, GPUs can become a central computational engine for MIP algorithms, delivering both substantial runtime reductions and improved solution quality. The paper provides a clear methodological blueprint, thorough experimental validation, and a compelling argument for re‑thinking traditional CPU‑centric MIP pipelines in the era of heterogeneous computing.


Comments & Academic Discussion

Loading comments...

Leave a Comment