Automatic Design of Optimization Test Problems with Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The development of black-box optimization algorithms depends on the availability of benchmark suites that are both diverse and representative of real-world problem landscapes. Widely used collections such as BBOB and CEC remain dominated by hand-crafted synthetic functions and provide limited coverage of the high-dimensional space of Exploratory Landscape Analysis (ELA) features, which in turn biases evaluation and hinders training of meta-black-box optimizers. We introduce Evolution of Test Functions (EoTF), a framework that automatically generates continuous optimization test functions whose landscapes match a specified target ELA feature vector. EoTF adapts LLM-driven evolutionary search, originally proposed for heuristic discovery, to evolve interpretable, self-contained numpy implementations of objective functions by minimizing the distance between sampled ELA features of generated candidates and a target profile. In experiments on 24 noiseless BBOB functions and a contamination-mitigating suite of 24 MA-BBOB hybrid functions, EoTF reliably produces non-trivial functions with closely matching ELA characteristics and preserves optimizer performance rankings under fixed evaluation budgets, supporting their validity as surrogate benchmarks. While a baseline neural-network-based generator achieves higher accuracy in 2D, EoTF substantially outperforms it in 3D and exhibits stable solution quality as dimensionality increases, highlighting favorable scalability. Overall, EoTF offers a practical route to scalable, portable, and interpretable benchmark generation targeted to desired landscape properties.

💡 Research Summary

The paper introduces Evolution of Test Functions (EoTF), a novel framework that leverages large language models (LLMs) to automatically generate continuous optimization benchmark functions whose landscapes match a user‑specified Exploratory Landscape Analysis (ELA) feature vector. The authors argue that existing benchmark suites such as BBOB and CEC are dominated by hand‑crafted synthetic functions, which only sparsely cover the high‑dimensional space of ELA descriptors. This limited coverage can bias algorithm evaluation and hampers the training of meta‑black‑box optimizers that rely on diverse problem instances.

EoTF formulates test‑function generation as a feature‑matching problem: given a target ELA vector ϕ*, the goal is to find a function f∈F (where F is the space of human‑readable Python expressions) that minimizes the Euclidean distance ‖ϕ_f − ϕ*‖₂. The pipeline consists of four stages. First, target ELA vectors are obtained by averaging the ELA descriptors of 100 independent samples from each reference benchmark function (24 noiseless BBOB functions and 24 MA‑BBOB hybrid functions). The selected eight descriptors include linear and quadratic regression adjusted R² (with and without interaction terms), skewness, nearest‑better‑graph correlation, nearest‑better‑standard‑deviation ratio, and fitness standard deviation. Second, an initial population of candidate functions is generated by prompting an LLM (primarily Gemini 2.0 Flash) to produce self‑contained NumPy‑based Python functions. Third, evolutionary operators—one initialization, two exploration, and three mutation operators—are applied. Each operator is implemented as a tailored prompt that asks the LLM to modify the code (e.g., insert a new term, replace a coefficient, or change a functional form). Fourth, each candidate is sampled (250·D points, where D is the problem dimension) and its ELA vector is computed using the pflacco library. The fitness of a candidate is the distance to the target vector, and the evolutionary loop proceeds for a fixed number of generations.

The authors evaluate EoTF on 2‑D and 3‑D instances of the 48 target functions. Results show that the generated functions achieve an average ELA distance below 0.07, outperforming the neural‑network‑based generator of Prager et al. (which attains ~0.04 in 2‑D but degrades to ~0.15 in 3‑D). Moreover, when a portfolio of optimizers (CMA‑ES, Differential Evolution, Particle Swarm Optimization, etc.) is run on both the original and the EoTF‑generated functions under identical evaluation budgets, the ranking correlation exceeds 0.96, indicating that the surrogate benchmarks preserve algorithm performance orderings.

Scalability is a key advantage of the LLM approach. Because the generated code uses vectorized NumPy operations, the same symbolic expression can be evaluated in any dimension without retraining or redesign, whereas the NN‑based method requires a separate model per dimensionality and suffers from performance loss as D grows. Experiments with newer Gemini models (2.5 Flash and 3.0 Flash) reveal modest gains in convergence speed and final ELA distance, at the cost of higher token pricing and latency.

Interpretability is another major contribution. Unlike NN surrogates that yield opaque weight matrices, EoTF outputs concise, human‑readable Python functions (e.g., a combination of quadratic, trigonometric, and cubic terms). This transparency enables direct analytical investigations—gradient analysis, curvature studies, or theoretical classification of landscape motifs—facilitating deeper insight into problem difficulty and algorithm behavior.

The paper also discusses limitations. The quality of generated functions depends on prompt engineering and the underlying LLM; different models produce varying results. There is a risk of “data contamination” because many papers describing BBOB ELA analyses are part of the LLM’s training corpus, potentially biasing generation toward known functions. To mitigate this, the authors supplement the benchmark set with randomly generated ELA vectors and MA‑BBOB functions not widely documented.

In conclusion, EoTF demonstrates that LLM‑driven evolutionary search can automatically produce diverse, scalable, and interpretable benchmark functions that faithfully reproduce desired landscape characteristics and preserve optimizer performance rankings. This opens a practical pathway for constructing customized benchmark suites tailored to specific research needs, advancing the development and evaluation of black‑box and meta‑optimization algorithms. Future work may explore richer ELA descriptor sets, higher‑dimensional problems, automated prompt optimization, and systematic taxonomy of the generated function families.

Automatic Design of Optimization Test Problems with Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment