Black-Box Combinatorial Optimization with Order-Invariant Reinforcement Learning

Black-Box Combinatorial Optimization with Order-Invariant Reinforcement Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce an order-invariant reinforcement learning framework for black-box combinatorial optimization. Classical estimation-of-distribution algorithms (EDAs) often rely on learning explicit variable dependency graphs, which can be costly and fail to capture complex interactions efficiently. In contrast, we parameterize a multivariate autoregressive generative model trained without a fixed variable ordering. By sampling random generation orders during training, a form of information-preserving dropout, the model is encouraged to be invariant to variable order, promoting search-space diversity, and shaping the model to focus on the most relevant variable dependencies, improving sample efficiency. We adapt Group Relative Policy Optimization (GRPO) to this setting, providing stable policy-gradient updates from scale-invariant advantages. Across a wide range of benchmark algorithms and problem instances of varying sizes, our method frequently achieves the best performance and consistently avoids catastrophic failures.


💡 Research Summary

The paper introduces a novel framework for black‑box combinatorial optimization that unifies ideas from Estimation‑of‑Distribution Algorithms (EDAs) and modern reinforcement learning (RL). Traditional discrete EDAs rely on explicit dependency graphs (e.g., Bayesian networks) to factorize the joint distribution of solution variables. Learning such graphs is computationally expensive (often NP‑hard) and can be brittle when only a limited number of samples are available, especially in high‑dimensional combinatorial spaces.

To overcome these limitations, the authors propose to replace the explicit graph with a neural autoregressive generative model whose parameters are shared across all possible variable orderings. During training, a random permutation of the variable indices is sampled for each episode; the model then predicts the next variable conditioned on the already‑selected subset. This “order‑invariant” training acts as an information‑preserving dropout: the model must succeed regardless of which subset of conditioning variables it sees, thereby focusing on the most salient dependencies while avoiding over‑fitting to transient patterns. The generative policy (\pi_\theta) is implemented as a collection of multilayer perceptrons (one per variable) that output either a sigmoid (binary case) or a softmax (categorical case).

The optimization process is cast as an episodic Markov Decision Process (MDP). A state consists of the partially constructed solution together with the current permutation; an action selects the value of the next variable. Rewards are zero for incomplete states and equal to the black‑box objective value for complete solutions. The policy is updated using a variant of Group Relative Policy Optimization (GRPO). Instead of raw returns, GRPO computes a “group‑relative advantage” based on the rank of each sample within its batch, yielding a scale‑invariant advantage estimator. This estimator is unbiased, aligns with the natural gradient direction in probability‑distribution space, and provides stable gradients even when the reward distribution is highly skewed.

Theoretical contributions include: (1) a proof that training across all permutations forces the learned conditional distributions to converge to a permutation‑independent joint model; (2) an analysis showing that the rank‑based, scale‑invariant advantage is equivalent to a natural‑gradient step on the underlying distribution; (3) an interpretation of random‑order training as a structured dropout that preserves information while regularizing the model.

Empirically, the method is evaluated on a broad suite of benchmark problems: binary MAXSAT, NK‑landscape, various routing and scheduling tasks, and synthetic combinatorial functions of increasing dimensionality. Baselines include classic EDAs (CMA‑ES, PBIL, BOA), Bayesian optimization, recent RL‑based construction methods, and a standard PPO implementation. Across almost all settings, the proposed order‑invariant RL‑EDA achieves the highest final objective values, converges faster, and exhibits far fewer “catastrophic failures” (i.e., runs that get stuck in poor local optima). Ablation studies confirm that both the random‑order training and the GRPO‑style advantage contribute significantly to performance gains.

In summary, the paper makes three key contributions: (i) a principled way to incorporate permutation‑invariant training into PPO‑style policy gradient methods; (ii) the demonstration that random generation orders act as an effective regularizer akin to dropout but without discarding information; (iii) a scale‑invariant, rank‑based reward formulation that can be interpreted as a group‑relative advantage, providing theoretical and practical benefits. The work bridges the gap between classical EDAs and modern RL, offering a powerful, sample‑efficient tool for black‑box combinatorial optimization and opening avenues for future research such as integrating more expressive sequence models (e.g., Transformers), handling mixed continuous‑discrete spaces, and scaling to distributed training environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment