Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research
Operations Research practitioners routinely debug infeasible models through an iterative process: analyzing Irreducible Infeasible Subsystems (\IIS{}), identifying constraint conflicts, and systematically repairing formulations until feasibility is achieved. Yet existing LLM benchmarks evaluate OR as one-shot translation – given a problem description, generate solver code – ignoring this diagnostic loop entirely. We introduce two benchmarks that place the \textbf{solver in the evaluation loop}. \textbf{\ORDebug{}} evaluates iterative self-correction through 5,000+ problems spanning 9 error types; each repair action triggers solver re-execution and \IIS{} recomputation, providing deterministic, verifiable feedback. \textbf{\ORBias{}} evaluates behavioral rationality through 2,000 newsvendor instances (1,000 ID + 1,000 OOD), measuring systematic deviations from closed-form optimal policies. Across 26 models and 12,000+ samples, we find that domain-specific RLVR training enables an 8B model to surpass frontier APIs: 95.3% vs 86.2% recovery rate (+9.1%), 62.4% vs 47.8% diagnostic accuracy (+14.6%), and 2.25 vs 3.78 steps to resolution (1.7$\times$ faster). On \ORBias{}, curriculum training achieves the only negative ID$\rightarrow$OOD bias drift among models evaluated (-9.6%), reducing systematic bias by 48% (from 20.0% to 10.4%). These results demonstrate that process-level evaluation with verifiable oracles enables targeted training that outperforms scale.
💡 Research Summary
The paper addresses a critical gap in the evaluation of large language models (LLMs) for Operations Research (OR): existing benchmarks treat OR as a one‑shot translation task, ignoring the iterative debugging process that practitioners actually follow. In real‑world OR, when a model returns “Infeasible,” analysts must compute an Irreducible Infeasible Subsystem (IIS), pinpoint the conflicting constraints, and iteratively repair the formulation until feasibility is restored. This loop provides deterministic, verifiable feedback that can be leveraged for self‑correction, yet current benchmarks do not incorporate it.
To fill this void, the authors introduce two novel benchmarks that embed the solver directly into the evaluation loop:
-
OR‑Debug‑Bench – a debugging benchmark comprising over 5,000 linear programming (LP) instances corrupted in nine distinct error categories (A–I). Each instance supplies a natural‑language description, sabotaged Gurobi Python code, the ground‑truth original code, the IIS, and the error type. An agent interacts with the environment by selecting from six actions (e.g., “Get IIS”, “Check Slack”, “Relax Constraint”, “Drop Constraint”, “Rewrite Code”, “Submit”). After each action the Gurobi solver is re‑run, producing a new IIS and updated feasibility status. The benchmark is formalized as a Markov Decision Process (MDP) with a composite reward that balances outcome (optimality), diagnostic quality, and efficiency. Evaluation metrics include Recovery Rate at k steps (RR@k), Diagnostic Accuracy (DA), and Optimality Preservation (OP).
-
OR‑Bias‑Bench – a decision‑making benchmark built on the classic newsvendor problem. It generates 2,000 instances (1,000 in‑distribution, 1,000 out‑of‑distribution) with controlled critical ratios (CR). The closed‑form optimal order quantity Q* = F⁻¹(CR) serves as an oracle. Models are assessed on Rationality (how close the predicted order Q is to Q*) and on systematic “pull‑to‑center” bias, measured as the absolute difference in average Q/Q* between high‑CR and low‑CR regimes (Bias Diff). An ID→OOD drift metric captures generalization loss.
Training proceeds via two parallel tracks:
-
Debugging Track – Starting from Qwen‑3‑8B‑Instruct, the authors collect 696 high‑quality expert debugging trajectories (derived from GPT‑5.2‑chat, o4‑mini, DeepSeek‑R1) and perform supervised fine‑tuning (SFT). They then apply Group Relative Policy Optimization (GRPO) with a composite reward (0.5 outcome, 0.3 diagnosis, 0.2 efficiency). A Process Reward Model (PRM) provides step‑level supervision, improving DA by 4.7 % at a modest RR@5 cost.
-
Bias‑Mitigation Track – After SFT, a three‑stage curriculum is used: (1) extreme CR values to learn directionality, (2) boundary CR values to refine magnitude, and (3) full CR distribution for consolidation. A faithfulness penalty discourages “lucky” fixes that do not address the true IIS. This curriculum reduces bias Diff from 20 % to 10.4 % (48 % reduction) and uniquely yields a negative ID→OOD drift (‑9.6 %), indicating genuine generalization rather than memorization.
Experimental Findings
The authors evaluate 26 models (three locally fine‑tuned 8 B models and 22 frontier APIs) on more than 12 000 samples. Key results include:
-
OR‑Debug‑Bench – The fine‑tuned Qwen‑3‑8B‑GRPO achieves RR@5 = 95.3 %, DA = 62.4 %, and an average of 2.25 steps to optimality, outperforming all tested APIs (best API RR@5 = 97.8 % but with ~3.5–5.0 steps). Even larger models such as Llama‑3.1‑8B show higher RR@5 but require more steps, highlighting the efficiency advantage of the proposed training pipeline.
-
OR‑Bias‑Bench – Curriculum‑trained models are the only ones to exhibit a negative ID→OOD drift and achieve a 48 % bias reduction. Baseline APIs retain a bias Diff around 20 %, confirming that standard instruction‑tuned models systematically over‑order low‑CR items and under‑order high‑CR items.
The paper demonstrates that deterministic solver feedback can be harnessed to construct MDP‑based benchmarks that evaluate both self‑correction (through iterative debugging) and behavioral rationality (through bias measurement). By integrating structured rewards, diagnostic penalties, and curriculum learning, the authors show that a modest‑size 8 B model can surpass much larger commercial APIs on both fronts. The work opens a pathway for future research to extend these ideas to integer programming, non‑linear optimization, and real‑time operational settings, and to explore richer human‑LLM collaborative debugging interfaces.
Comments & Academic Discussion
Loading comments...
Leave a Comment