OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling
Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real-world tasks. To bridge this gap, we propose OPT-ENGINE, an extensible benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels. OPT-ENGINE spans 10 canonical tasks across operations research, with five Linear Programming and five Mixed-Integer Programming. Utilizing OPT-ENGINE, we conduct an extensive study of LLMs’ reasoning capabilities, addressing two critical questions: 1.) Do LLMs’ performance remain robust when generalizing to out-of-distribution optimization tasks that scale in complexity beyond current benchmark levels? and 2.) At what stage, from problem interpretation to solution generation, do current LLMs encounter the most significant bottlenecks? Our empirical results yield two key insights: first, tool-integrated reasoning with external solvers exhibits significantly higher robustness as task complexity escalates, while pure-text reasoning reaches a ceiling; second, the automated formulation of constraints constitutes the primary performance bottleneck. These findings provide actionable guidance for developing next-generation LLMs for advanced optimization. Our code is publicly available at \textcolor{blue}{https://github.com/Cardinal-Operations/OPTEngine}.
💡 Research Summary
The paper addresses the growing interest in using large language models (LLMs) to automatically translate natural‑language problem statements into precise mathematical optimization models and to solve them. While recent LLMs have shown impressive performance on mathematical reasoning benchmarks, their limits in real‑world, high‑dimensional optimization tasks remain unclear. To fill this gap, the authors introduce OPT‑Engine, an extensible benchmark framework that can generate optimization instances with controllable difficulty and linguistic variation.
OPT‑Engine covers ten canonical operations‑research problems: five linear programming (LP) classes (inventory, portfolio allocation, production, transportation, pollution control) and five mixed‑integer programming (MIP) classes (traveling salesman, knapsack, bin packing, job‑shop scheduling, minimum‑cost network flow). For each class, a set of structural parameters (e.g., number of cities, number of assets, constraint density) can be tuned to scale the instance size from trivial to industrial‑scale.
The generation pipeline consists of four stages. (1) Numeric instance generation creates feasible numeric data and computes the exact optimal objective using an external solver (Gurobi, COPT, etc.). (2) Canonical problem construction maps the numeric data to a formal mathematical description (variables, objective, constraints) via a template. (3) Problem augmentation uses an LLM agent to rewrite the canonical description into diverse, domain‑specific natural‑language narratives, thereby testing linguistic robustness. (4) Instance validation combines an LLM‑based judge with rule‑based checks to ensure that the augmented narrative preserves the original mathematical structure; if not, the augmentation is repeated. This loop guarantees that every generated instance comes with a verified optimal solution and a well‑formed textual description.
Using OPT‑Engine, the authors conduct a systematic comparison of two prevailing reasoning paradigms. Pure‑Text Reasoning (PTR) asks the LLM to solve the problem end‑to‑end via chain‑of‑thought prompting, extracting the optimal objective directly from the final reasoning step. Tool‑Integrated Reasoning (TIR) asks the LLM to output executable code (e.g., a Python snippet) that encodes the formulated model; this code is then run in an external solver to obtain the objective. Both paradigms are evaluated on ten difficulty levels per problem class (varying dimensionality) using a pass@1 metric defined as relative error < 10⁻³.
The empirical results reveal two key insights. First, as problem size grows, PTR performance collapses sharply (from ~70‑80 % accuracy on small LPs to below 30 % on larger MIPs), whereas TIR maintains high accuracy (> 90 %) across all scales. This demonstrates that external solvers effectively off‑load the combinatorial search that LLMs struggle with. Second, error analysis shows that more than half of all failures stem from the “constraint auto‑formulation” stage: the LLM often mis‑identifies variables, mis‑writes inequality directions, or drops constraints when converting natural language into a formal model. By contrast, the earlier “problem interpretation” stage (identifying objective and decision variables) is relatively reliable.
The authors conclude that while LLMs possess strong deductive reasoning abilities, their bottleneck lies in precise structural modeling, especially constraint generation. Tool integration is therefore essential for robust performance on realistic, high‑dimensional optimization tasks.
Limitations of the current work include the focus on ten problem types, which may not capture the full diversity of industrial optimization (e.g., non‑linear, stochastic, or dynamic problems). The LLM‑Judge validation, though automated, may miss subtle modeling errors that a human expert would catch. Future directions suggested are (a) expanding the benchmark to cover non‑linear and multi‑objective problems, (b) incorporating human‑in‑the‑loop correction mechanisms for constraint formulation, and (c) exploring multimodal inputs (tables, graphs, images) to bridge the gap between synthetic benchmarks and real‑world data.
In sum, OPT‑Engine provides a rigorous, scalable platform for probing LLM capabilities in optimization modeling, demonstrates the superiority of tool‑integrated reasoning under increasing complexity, and pinpoints constraint formulation as the primary obstacle to autonomous, industrial‑scale optimization by LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment