HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.
💡 Research Summary
HeuriGym introduces an agentic benchmark designed to evaluate large language models (LLMs) on their ability to generate, execute, and iteratively refine heuristic algorithms for combinatorial optimization problems. The authors argue that existing evaluation paradigms fall into two problematic categories: (1) closed‑ended, ground‑truth benchmarks that quickly saturate and are vulnerable to memorization, and (2) subjective, pairwise comparison systems that suffer from high variance and lack rigorous, domain‑specific metrics. To bridge this gap, HeuriGym selects combinatorial optimization as a testbed because these problems have well‑defined objectives, large solution spaces, and are computationally hard, making memorization infeasible while still offering clear quantitative evaluation criteria.
The framework presents each problem with three components: a background description, a formal mathematical definition of the objective and constraints, and precise input/output specifications. LLMs receive only a function signature and must produce a complete, self‑contained program—including data structures, algorithmic logic, and any required library calls—without any template scaffolding. The generated code is compiled or interpreted in a sandbox, then passed through three verification stages: (I) successful execution (no compile/runtime errors), (II) production of a non‑empty, correctly formatted output within a time limit, and (III) satisfaction of problem‑specific constraints enforced by a custom verifier. After each iteration, execution logs, verification results, and objective scores are appended to the prompt as feedback, enabling the model to learn from its mistakes in a few‑shot, in‑context manner.
Recognizing that traditional PASS@k does not capture iterative reasoning, the authors propose a new suite of metrics. SOLVEs@i measures the proportion of instances solved within i iterations, while QUALITY and YIELD separately assess solution optimality (relative to expert solutions) and the fraction of instances that pass verification. The Quality‑Yield Index (QYI) combines these aspects into a single score ranging from 0 (no feasible or low‑quality solutions) to 1 (expert‑level performance). This metric captures both the ability to produce a feasible solution and how close that solution is to the best known.
Empirically, nine state‑of‑the‑art LLMs—including GPT‑o4‑mini‑high and Gemini‑2.5‑Pro—are evaluated on nine diverse problems spanning computer systems (operator scheduling), logistics (vehicle routing), computational biology (protein interaction design), and electronic design automation. Across all tasks, the best models achieve QYI ≈ 0.6, far below the expert baseline of 1.0. Detailed analysis reveals systematic weaknesses: limited tool use (e.g., invoking external solvers), insufficient multi‑step planning (failure to decompose problems effectively), and poor adaptive reasoning when presented with runtime feedback. For instance, models often compile successfully but violate resource constraints or produce sub‑optimal cost values, and additional refinement iterations yield diminishing returns.
HeuriGym’s problem set is open‑source, each accompanied by a domain‑specific verifier and evaluator, allowing researchers to extend the benchmark with new instances or constraints. The authors also distinguish a small “demo” set used for in‑context learning from a larger held‑out evaluation set, mitigating data contamination risks. Compared to contemporaneous efforts like NPHardEval or GraphArena, HeuriGym emphasizes realistic engineering workflows, requiring models to design bespoke heuristics rather than fill in pre‑defined templates.
The paper’s contributions are threefold: (1) an open‑source benchmark suite that tests LLMs on genuine algorithmic creativity and engineering rigor; (2) a novel metric suite (SOLVEs@i, QUALITY, YIELD, QYI) that quantifies both feasibility and optimality in an iterative setting; and (3) a comprehensive empirical study exposing current models’ limitations and offering concrete directions for future work—such as incorporating multi‑objective constraints, meta‑prompt optimization, and hybrid human‑LLM collaboration. HeuriGym thus provides a critical infrastructure for steering LLM development toward truly useful problem‑solving capabilities in scientific and engineering domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment