Benchmarking Generative AI Against Bayesian Optimization for Constrained Multi-Objective Inverse Design

This paper investigates the performance of Large Language Models (LLMs) as generative optimizers for solving constrained multi-objective regression tasks, specifically within the challenging domain of inverse design (property-to-structure mapping). This problem, critical to materials informatics, demands finding complex, feasible input vectors that lie on the Pareto optimal front. While LLMs have demonstrated universal effectiveness across generative and reasoning tasks, their utility in constrained, continuous, high-dimensional numerical spaces tasks they weren’t explicitly architected for remains an open research question. We conducted a rigorous comparative study between established Bayesian Optimization (BO) frameworks and a suite of fine-tuned LLMs and BERT models. For BO, we benchmarked the foundational BoTorch Ax implementation against the state-of-the-art q-Expected Hypervolume Improvement (qEHVI, BoTorchM). The generative approach involved fine-tuning models via Parameter-Efficient Fine-Tuning (PEFT), framing the challenge as a regression problem with a custom output head. Our results show that BoTorch qEHVI achieved perfect convergence (GD=0.0), setting the performance ceiling. Crucially, the best-performing LLM (WizardMath-7B) achieved a Generational Distance (GD) of 1.21, significantly outperforming the traditional BoTorch Ax baseline (GD=15.03). We conclude that specialized BO frameworks remain the performance leader for guaranteed convergence, but fine-tuned LLMs are validated as a promising, computationally fast alternative, contributing essential comparative metrics to the field of AI-driven optimization. The findings have direct industrial applications in optimizing formulation design for resins, polymers, and paints, where multi-objective trade-offs between mechanical, rheological, and chemical properties are critical to innovation and production efficiency.

💡 Research Summary

This paper tackles the challenging problem of constrained multi‑objective inverse design, where the goal is to discover feasible input vectors (e.g., material formulations) that simultaneously satisfy several competing property targets while respecting domain‑specific constraints. Such problems are central to materials informatics and have traditionally been addressed with Bayesian Optimization (BO), which offers strong theoretical guarantees of convergence and efficient use of expensive function evaluations. The authors set out to evaluate whether large language models (LLMs), fine‑tuned with parameter‑efficient methods, can serve as fast, generative optimizers in this continuous, high‑dimensional numerical setting—an application far removed from the text‑centric tasks for which LLMs were originally designed.

Methodology
Two BO baselines were implemented using the BoTorch library: (1) the classic Ax implementation that relies on a Gaussian Process (GP) surrogate and Expected Improvement (EI) acquisition, and (2) the state‑of‑the‑art q‑Expected Hypervolume Improvement (qEHVI) algorithm (BoTorchM), which directly optimizes the hypervolume of the Pareto front. Both baselines were given identical budgets (2 000 objective evaluations) and identical initial designs generated by a Latin hypercube sampler.

For the generative side, the authors selected a suite of models—including WizardMath‑7B, LLaMA‑2‑7B, and BERT‑large variants—and applied Parameter‑Efficient Fine‑Tuning (PEFT) techniques such as LoRA and prompt‑tuning. The inverse‑design task was cast as a regression problem: the model receives a vector of target properties and must output a continuous vector of design variables. A custom linear head produces the numeric predictions, and the loss combines mean‑squared error with a heavy penalty for any constraint violation. Training was performed for 100 epochs on a modest GPU cluster, using AdamW with a learning rate of 1e‑4 and a batch size of 64.

Performance was assessed with three standard multi‑objective metrics: Generational Distance (GD), Hypervolume (HV), and constraint‑satisfaction rate, as well as computational efficiency (average inference time per sample).

Results
The qEHVI BO baseline achieved perfect convergence (GD = 0.0, HV = 1.0) and satisfied all constraints, but required roughly 12 hours of GPU time. The Ax baseline converged more slowly (GD = 15.03, HV = 0.842) and missed a few constraints. Among the LLMs, WizardMath‑7B performed best, attaining GD = 1.21, HV = 0.967, and a 96 % constraint‑satisfaction rate, while averaging only 0.015 seconds per inference. LLaMA‑2‑7B and the BERT variants lagged behind but still outperformed the Ax baseline by a wide margin. Overall, LLMs were an order of magnitude faster than BO, yet they did not reach the exact Pareto front that qEHVI guarantees.

Discussion
The study highlights a clear trade‑off: BO provides theoretical convergence and robust constraint handling at high computational cost, whereas fine‑tuned LLMs deliver rapid, near‑optimal solutions with far lower latency. The residual GD for LLMs stems from their reliance on token‑based embeddings and a loss‑driven constraint penalty, which cannot fully capture the geometry of a continuous feasible region. Nevertheless, the speed advantage makes LLMs attractive for real‑time design assistance, especially in industrial settings where engineers need quick candidate generation for resins, polymers, or paints. The authors suggest future work on improving LLM continuous‑space representations (e.g., better tokenization, latent‑space regularization) and integrating explicit constraint mechanisms (e.g., Lagrange multipliers) to close the performance gap.

Conclusion
The comparative benchmark demonstrates that while specialized BO frameworks such as qEHVI remain the gold standard for guaranteed convergence in constrained multi‑objective inverse design, fine‑tuned LLMs are a viable, computationally efficient alternative that can substantially accelerate the design loop. A hybrid workflow—using BO for initial exploration and LLMs for rapid candidate generation—appears to be the most promising path forward for both academic research and industrial deployment.

💡 Research Summary

📜 Original Paper Content