Are Your Generated Instances Truly Useful? GenBench-MILP: A Benchmark Suite for MILP Instance Generation
The proliferation of machine learning-based methods for Mixed-Integer Linear Programming (MILP) instance generation has surged, driven by the need for diverse training datasets. However, a critical question remains: Are these generated instances truly useful and realistic? Current evaluation protocols often rely on superficial structural metrics or simple solvability checks, which frequently fail to capture the true computational complexity of real-world problems. To bridge this gap, we introduce GenBench-MILP, a comprehensive benchmark suite designed for the standardized and objective evaluation of MILP generators. Our framework assesses instance quality across four key dimensions: mathematical validity, structural similarity, computational hardness, and utility in downstream tasks. A distinctive innovation of GenBench-MILP is the analysis of solver-internal features – including root node gaps, heuristic success rates, and cut plane usage. By treating the solver’s dynamic behavior as an expert assessment, we reveal nuanced computational discrepancies that static graph features miss. Our experiments on instance generative models demonstrate that instances with high structural similarity scores can still exhibit drastically divergent solver interactions and difficulty levels. By providing this multifaceted evaluation toolkit, GenBench-MILP aims to facilitate rigorous comparisons and guide the development of high-fidelity instance generators.
💡 Research Summary
The paper addresses a critical gap in the rapidly growing field of machine learning‑based Mixed‑Integer Linear Programming (MILP) instance generation: the lack of a standardized, comprehensive evaluation framework. Existing assessments rely mainly on superficial structural metrics (e.g., graph‑based feature distributions) or simple feasibility checks, which do not capture the true computational difficulty that real‑world MILP problems exhibit.
To fill this void, the authors introduce GenBench‑MILP, a benchmark suite that evaluates generated MILP instances across four orthogonal dimensions:
- Mathematical Validity – feasibility and boundedness of the instance.
- Structural Similarity – a Jensen‑Shannon divergence based score computed over eleven graph‑derived features.
- Computational Hardness – solver‑dependent metrics such as branch‑and‑bound node count, solving‑time gap, and relative error.
- Downstream Utility – impact on tasks like hyper‑parameter tuning (SMAC3) and other ML pipelines.
A distinctive contribution is the incorporation of solver‑internal features (root node gap, heuristic success rate, cut‑plane usage) as expert assessments of instance difficulty. By treating the solver’s dynamic behavior as a proxy for human expert evaluation, the framework uncovers discrepancies that static graph metrics miss.
The experimental protocol compares three recent generative models—G2MILP, ACM‑MILP, and DIG‑MILP—against a baseline consisting of synthetic instances generated with the Ecole library (Set Cover, Combinatorial Auction, Capacitated Facility Location, Independent Set) and two challenging real‑world families from the ML4CO competition.
Key findings include:
- Feasibility is generally high (>99 %) across models, with minor drops for G2MILP on MIS at high mask ratios.
- Structural similarity scores range from 0.47 to 0.99, indicating that some generators (e.g., G2MILP on Set Cover) can closely mimic original graph statistics, while others diverge substantially.
- Solver‑dependent hardness varies dramatically. For instance, G2MILP‑generated Independent Set instances cause branch‑and‑bound node counts to explode (>50,000 % relative error) and often time out, whereas ACM‑MILP on Combinatorial Auction yields a 26,000 % increase in node count. DIG‑MILP remains closest to baseline hardness.
- Solver‑internal metrics reveal that even when structural similarity is high, root gaps, heuristic success rates, and cut‑plane usage can differ by orders of magnitude, confirming that static graph features are insufficient proxies for computational behavior.
- Downstream utility experiments show that hyper‑parameter tuning performance can be dramatically altered: ACM‑MILP on Combinatorial Auction improves SMAC3‑derived performance by over 2400 %, while G2MILP yields modest gains (~20 %).
The authors also demonstrate the framework’s modular extensibility: new metrics, datasets, or solvers can be added with minimal effort, and the code is publicly released. Cross‑solver experiments (Gurobi, SCIP, HiGHS) confirm that high‑quality commercial solvers provide more precise internal feature measurements, yet the methodology remains applicable to open‑source alternatives.
Finally, the paper critiques the prevailing graph‑based generation paradigm, arguing that representing MILP instances solely as bipartite graphs may overlook intricate constraint interactions and integer variable properties crucial for feasibility and hardness. Future work is suggested to shift toward models that directly capture the underlying mathematical structure or solution‑space characteristics.
In summary, GenBench‑MILP offers a rigorous, multi‑dimensional benchmark that standardizes the evaluation of MILP instance generators, highlights the limitations of current structural metrics, and provides actionable insights for developing higher‑fidelity, computationally realistic synthetic MILP datasets.
Comments & Academic Discussion
Loading comments...
Leave a Comment