CADO: From Imitation to Cost Minimization for Heatmap-based Solvers in Combinatorial Optimization
Heatmap-based solvers have emerged as a promising paradigm for Combinatorial Optimization (CO). However, we argue that the dominant Supervised Learning (SL) training paradigm suffers from a fundamental objective mismatch: minimizing imitation loss (e.g., cross-entropy) does not guarantee solution cost minimization. We dissect this mismatch into two deficiencies: Decoder-Blindness (being oblivious to the non-differentiable decoding process) and Cost-Blindness (prioritizing structural imitation over solution quality). We empirically demonstrate that these intrinsic flaws impose a hard performance ceiling. To overcome this limitation, we propose CADO (Cost-Aware Diffusion models for Optimization), a streamlined Reinforcement Learning fine-tuning framework that formulates the diffusion denoising process as an MDP to directly optimize the post-decoded solution cost. We introduce Label-Centered Reward, which repurposes ground-truth labels as unbiased baselines rather than imitation targets, and Hybrid Fine-Tuning for parameter-efficient adaptation. CADO achieves state-of-the-art performance across diverse benchmarks, validating that objective alignment is essential for unlocking the full potential of heatmap-based solvers.
💡 Research Summary
**
The paper investigates a fundamental flaw in the prevailing training paradigm for heatmap‑based combinatorial optimization (CO) solvers. These solvers, typically built on diffusion models, generate a full‑size probability heatmap in a single forward pass and then apply a non‑differentiable decoder to obtain a discrete feasible solution. Most recent works train the heatmap generator with supervised learning (SL), minimizing a surrogate loss such as cross‑entropy against optimal solutions. The authors argue that this approach suffers from two intrinsic deficiencies:
- Decoder‑Blindness – the SL loss ignores the decoder f (g), which can dramatically alter the final solution; the model receives no gradient signal about how its continuous heatmap will be discretized.
- Cost‑Blindness – the SL loss is oblivious to the true cost function c (g). Even if the heatmap is close to the optimal label in a structural sense (e.g., low Hamming distance), the resulting cost may not improve, as demonstrated empirically.
Through extensive experiments with the state‑of‑the‑art diffusion solver DIFUSCO, the authors show that reductions in SL loss correlate only weakly with reductions in Hamming distance, and that Hamming distance correlates almost zero with the final tour length (for TSP) or objective value (for MIS). Hence, the SL objective is not a reliable proxy for the true CO objective.
To remedy this mismatch, the paper proposes CADO (Cost‑Aware Diffusion models for Optimization), a reinforcement‑learning (RL) fine‑tuning framework that aligns the training objective with the actual cost minimization goal. The key ideas are:
- MDP formulation of denoising – The reverse diffusion process is cast as a Markov Decision Process where each state consists of the problem instance, remaining timesteps, and the current noisy heatmap. Actions are the denoising steps, and the reward is zero at intermediate steps and the negative cost of the decoded solution at the final step. This directly incorporates both the decoder f (g) and the cost c (g) into the learning signal.
- Policy gradient via REINFORCE – The gradient of the expected return is computed as the usual log‑probability times the final reward, allowing the pre‑trained diffusion model πθ^SL to be refined without altering its architecture.
- Reward designs – Two variants are explored:
- Standard Reward (SR) – simply −c (g) with batch‑wise normalization, suitable when no labeled data are available.
- Label‑Centered Reward (LCR) – uses the cost of the ground‑truth label b_D(g) as an instance‑specific baseline, defining the reward as −(c (g) − b_D(g)). This treats the label as an unbiased baseline rather than an imitation target, preserving gradient correctness even if the label is sub‑optimal.
- Hybrid Fine‑Tuning (Hybrid‑FT) – To keep RL training stable and memory‑efficient, the authors apply Low‑Rank Adaptation (LoRA) to the input layer and the first 11 GNN layers, while fully fine‑tuning the final GNN layer and the output layer (Selective‑FT). This yields a small set of trainable parameters while retaining the expressive power of the pre‑trained model.
Extensive experiments on Traveling Salesman Problem (TSP) and Maximum Independent Set (MIS) benchmarks demonstrate that CADO consistently outperforms the original SL‑trained diffusion models and prior RL baselines. The LCR variant especially shines when optimal labels are scarce: it leverages the available label costs as baselines, achieving faster convergence and larger cost reductions than SR. Ablation studies confirm that the hybrid LoRA/Selective‑FT scheme reduces GPU memory consumption without sacrificing performance.
In summary, CADO provides a principled solution to the “objective mismatch” problem of heatmap‑based CO solvers. By formulating diffusion denoising as an MDP, incorporating the decoder and true cost into the reward, and employing a label‑centered baseline together with parameter‑efficient fine‑tuning, the framework aligns training with the ultimate optimization goal. This work not only sets new state‑of‑the‑art results across several combinatorial tasks but also establishes a general recipe for turning any SL‑trained heatmap generator into a cost‑aware optimizer.
Comments & Academic Discussion
Loading comments...
Leave a Comment