Latent Spherical Flow Policy for Reinforcement Learning with Combinatorial Actions
Reinforcement learning (RL) with combinatorial action spaces remains challenging because feasible action sets are exponentially large and governed by complex feasibility constraints, making direct policy parameterization impractical. Existing approaches embed task-specific value functions into constrained optimization programs or learn deterministic structured policies, sacrificing generality and policy expressiveness. We propose a solver-induced \emph{latent spherical flow policy} that brings the expressiveness of modern generative policies to combinatorial RL while guaranteeing feasibility by design. Our method, LSFlow, learns a \emph{stochastic} policy in a compact continuous latent space via spherical flow matching, and delegates feasibility to a combinatorial optimization solver that maps each latent sample to a valid structured action. To improve efficiency, we train the value network directly in the latent space, avoiding repeated solver calls during policy optimization. To address the piecewise-constant and discontinuous value landscape induced by solver-based action selection, we introduce a smoothed Bellman operator that yields stable, well-defined learning targets. Empirically, our approach outperforms state-of-the-art baselines by an average of 20.6% across a range of challenging combinatorial RL tasks.
💡 Research Summary
This paper tackles reinforcement learning (RL) in environments where each decision must satisfy complex combinatorial constraints, leading to an exponentially large discrete action space. Existing methods either embed a learned value function into a mixed‑integer program (requiring solver‑compatible network architectures) or learn deterministic structured policies (limiting exploration). The authors introduce LSFlow, a “solver‑induced latent spherical flow policy.” The core idea is to separate stochasticity from feasibility: a state‑conditional distribution πθ(c | s) is learned over cost vectors c that lie on the unit sphere S^{m‑1}. A standard combinatorial optimization (CO) solver then maps each sampled cost direction to a feasible binary action a* = arg min_{a∈A(s)} cᵀa. Because the linear objective is scale‑invariant, only the direction of c matters, justifying the spherical restriction and yielding a compact latent space.
The policy over the sphere is trained using spherical flow matching, a continuous‑time generative model that integrates a projected ODE d c_t/dt = Π_{c_t} vθ(c_t, s, t) where Π projects onto the tangent space. This provides a rich, multimodal distribution without enumerating actions.
To avoid costly end‑to‑end differentiation through the solver, LSFlow learns a critic Qφ(s, c) directly in the latent cost space. During policy improvement, proposals c₁ ∼ π_k(· | s) are re‑weighted by w(s, c₁) ∝ exp(λ Qφ(s, a*(s, c₁))). The weighted spherical flow‑matching loss is then minimized, which is shown to be equivalent to a KL‑regularized policy update (a trust‑region step akin to PPO/TRPO).
A major technical challenge is that the solver induces a piecewise‑constant, discontinuous value landscape over the sphere, destabilizing Bellman backups. The authors address this by smoothing the cost‑space values with a von Mises–Fisher kernel Kκ(· | c), producing a smoothed value \tilde Q(s, c). They prove that this smoothed Bellman operator admits a unique fixed point and yields a continuous value function, improving learning stability.
Experiments on several public combinatorial RL benchmarks (routing, subset selection, scheduling) and a real‑world STI testing application demonstrate that LSFlow outperforms state‑of‑the‑art baselines by an average of 20.6 % in cumulative reward, while requiring far fewer solver calls during training.
The paper’s contributions are: (1) a novel flow‑based stochastic policy that guarantees feasibility via a downstream solver; (2) an efficient training scheme that learns the critic in the latent cost space and uses vMF smoothing to handle solver‑induced discontinuities; (3) empirical evidence of superior performance and scalability. Limitations include the reliance on a potentially expensive solver and sensitivity to the smoothing bandwidth κ, suggesting future work on approximate solvers, parallelization, and multi‑objective cost vectors.
Comments & Academic Discussion
Loading comments...
Leave a Comment