From Sequential to Parallel: Reformulating Dynamic Programming as GPU Kernels for Large-Scale Stochastic Combinatorial Optimization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A major bottleneck in scenario-based Sample Average Approximation (SAA) for stochastic programming (SP) is the cost of solving an exact second-stage problem for every scenario, especially when each scenario contains an NP-hard combinatorial structure. This has led much of the SP literature to restrict the second stage to linear or simplified models. We develop a GPU-based framework that makes full-fidelity integer second-stage models tractable at scale. The key innovation is a set of hardware-aware, scenario-batched GPU kernels that expose parallelism across scenarios, dynamic-programming (DP) layers, and route or action options, enabling Bellman updates to be executed in a single pass over more than 1,000,000 realizations. We evaluate the approach in two representative SP settings: a vectorized split operator for stochastic vehicle routing and a DP for inventory reinsertion. Implementation scales nearly linearly in the number of scenarios and achieves a one-two to four-five orders of magnitude speedup, allowing far larger scenario sets and reliably stronger first-stage decisions. The computational leverage directly improves decision quality: much larger scenario sets and many more first-stage candidates can be evaluated within fixed time budgets, consistently yielding stronger SAA solutions. Our results show that full-fidelity integer second-stage models are tractable at scales previously considered impossible, providing a practical path to large-scale, realistic stochastic discrete optimization.

💡 Research Summary

The paper tackles a fundamental bottleneck in scenario‑based Sample Average Approximation (SAA) for stochastic programming: the need to solve an exact, integer‑valued second‑stage problem for every scenario. When each scenario embeds an NP‑hard combinatorial structure (e.g., vehicle routing, inventory reinsertion), traditional CPU‑based solvers or linear relaxations become infeasible beyond a few thousand scenarios, forcing researchers to simplify the second stage and sacrifice solution quality.

The authors propose a GPU‑accelerated framework that reformulates dynamic programming (DP) recursions as min‑plus matrix‑vector products and executes them in a “scenario‑batched” fashion. The key technical contributions are:

Transition‑Based Matrix Formulation – The Bellman recursion is expressed as Jₜ₊₁ = (Aₜ^ω)ᵀ ⊗ Jₜ, where Aₜ^ω contains the cost of every feasible state‑to‑state transition for scenario ω. Infeasible transitions are masked with +∞, preserving correctness while enabling regular tensor shapes.
Multidimensional GPU Kernels – Three levels of parallelism are exploited simultaneously: (a) the scenario dimension (hundreds of thousands to millions of independent ω), (b) the predecessor‑state dimension within each DP stage, and (c) the set of alternative route or replenishment options associated with a transition. CUDA blocks are assigned to scenarios, warps to predecessor states, and a second reduction across options is performed when needed. The kernels use warp‑level reductions, shared‑memory tiling, and numerically safe masking to achieve high throughput.
Two Representative Applications –
- Stochastic Vehicle Routing Split Operator (CVRPSD) – The DP splits a giant tour into capacity‑feasible routes under stochastic demand. The GPU implementation evaluates over 10⁶ demand realizations, delivering a 10‑ to 100‑fold speedup over a CPU baseline and up to 10⁴‑fold over a Gurobi MILP formulation.
- Dynamic Stochastic Inventory Reinsertion Problem (DSIRP) – For each customer, the DP decides daily whether to deliver and how much, balancing holding, stock‑out, and routing costs. The transition matrix includes multiple routing alternatives, leading to a 3‑D parallelism pattern. The GPU achieves 10⁴‑10⁵‑fold acceleration, making previously intractable scenario sets solvable.
Scalability and Memory Study – Empirical results show near‑linear scaling with the number of scenarios. A standard 11 GB GPU can store the full data structures for 10⁶ scenarios, confirming that memory is not the limiting factor.
Impact on Decision Quality – Because the GPU enables evaluation of far more scenarios and far more first‑stage candidate solutions within a fixed time budget, SAA solutions consistently improve. The expected objective value decreases monotonically as scenario count rises, validating the theoretical benefit of larger sample sizes.
General Recipe for GPU‑DP Conversion – The authors outline a systematic process: pad state spaces, construct masked transition tensors, implement warp‑level min reductions, and orchestrate kernel launches across DP stages. This recipe is applicable to a broad class of discrete stochastic optimization problems beyond the two case studies.

In summary, the paper introduces a novel “scenario‑batched GPU DP” paradigm that bridges the gap between realistic integer second‑stage models and large‑scale stochastic optimization. By exposing and exploiting hidden parallelism across scenarios, DP layers, and action options, the framework delivers orders‑of‑magnitude speedups, linear scalability, and tangible improvements in SAA solution quality, opening new avenues for high‑fidelity stochastic combinatorial optimization on modern hardware.

From Sequential to Parallel: Reformulating Dynamic Programming as GPU Kernels for Large-Scale Stochastic Combinatorial Optimization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment