An Exploratory Study of Bayesian Prompt Optimization for Test-Driven Code Generation with Large Language Models
We consider the task of generating functionally correct code using large language models (LLMs). The correctness of generated code is influenced by the prompt used to query the given base LLM. We formulate the problem of finding the appropriate prompt as combinatorial search process and propose a Bayesian optimization (BO) approach referred to as {\em BO for Code GENeration (BODE-GEN)}. BODE-GEN performs an adaptive data-driven search over prompts guided by training data in the form of prompts tried and the functional accuracy of the generated code over a set of given test cases. The key insight is to perform BO in continuous embedding space by using an auxiliary LLM to bridge the gap between discrete prompt space and continuous embedding space. We leverage two synergistic ideas, namely, random projections and dimensionality scaled priors, to build effective Gaussian process based surrogate models over the high-dimensional embedding space. Our experiments on the HumanEval+ benchmark using multiple base LLMs show that BODE-GEN can improve performance in terms of code generation accuracy compared to fixed prompts and manual prompt engineering. Additionally, we demonstrate that BODE-GEN is sample-efficient, requiring relatively few iterations of BO to demonstrate improvements in code accuracy.
💡 Research Summary
The paper tackles the problem of improving functional correctness of code generated by large language models (LLMs) through automated prompt optimization. While recent work has shown that the wording of a prompt can dramatically affect the quality of generated code, existing approaches rely on manual prompt engineering or static templates, which are costly and sub‑optimal. The authors formulate prompt selection as a combinatorial optimization problem and propose a Bayesian optimization (BO) framework called BODE‑GEN (BO for Code GENeration).
Key ideas:
- Continuous embedding search – Instead of searching directly in the discrete token space, candidate prompts are represented as high‑dimensional vectors (embeddings) produced by an auxiliary LLM (e.g., LLaMA‑2). BO operates on these continuous vectors, which enables gradient‑free optimization techniques.
- Embedding‑to‑text conversion – A second auxiliary LLM (LLM_aux) receives a concatenated embedding consisting of a fixed instruction, the candidate embedding, and the embedding of an initial prompt. LLM_aux then generates a human‑readable prompt that is fed to the base LLM (LLM_base) for code synthesis. This bridges the gap between the continuous search space and the discrete prompt space.
- Scalable Gaussian Process surrogate – The embedding dimension (≈4096) is far larger than typical BO settings. To keep Gaussian Process (GP) modeling tractable, the authors apply random projections to reduce dimensionality and employ dimensionality‑scaled priors that adjust variance per dimension. This yields a GP that can provide reliable mean and uncertainty estimates even in high‑dimensional spaces.
- Acquisition function – Expected Improvement (EI) is used to balance exploration and exploitation. At each iteration, EI is maximized to select the next embedding, which is then turned into a prompt, evaluated, and used to update the GP.
Algorithmically, BODE‑GEN proceeds as follows:
- Embed the initial prompt and a fixed instruction.
- Initialize a GP surrogate with randomly sampled embeddings.
- For a fixed number of iterations (or until a target accuracy is reached):
- Optimize EI to obtain a candidate embedding.
- Concatenate with the fixed instruction and initial prompt embeddings.
- Pass the combined embedding to LLM_aux to obtain a textual prompt.
- Query LLM_base with this prompt, generate code, and run the developer‑provided test suite.
- Record the fraction of passed tests as the objective value and update the GP.
- Return the prompt that achieved the highest test‑pass rate.
Experiments are conducted on the HumanEval+ benchmark, covering a variety of coding tasks (e.g., implementing algorithms, data‑structure manipulations). Multiple base LLMs are evaluated, including GPT‑3.5, LLaMA‑2‑7B, and CodeLlama. Results show that BODE‑GEN consistently outperforms static prompts and manually engineered prompts, achieving 4–7 percentage‑point gains in pass rate on average. Smaller models benefit the most, sometimes exceeding a 10‑point improvement. Importantly, the BO loop converges after only 10–15 iterations, demonstrating strong sample efficiency and reducing the monetary/computational cost of prompt search.
Limitations are acknowledged: the quality of the embedding‑to‑prompt conversion depends on the auxiliary LLM, which may introduce noise; the current objective focuses solely on test‑case pass rate, ignoring other quality dimensions such as runtime efficiency, memory usage, or code readability; and the method relies on an initial prompt and fixed instruction that may bias the search.
Future directions suggested include: (1) using ensembles of auxiliary LLMs to improve robustness of the conversion step; (2) extending the BO framework to multi‑objective optimization to jointly consider correctness, performance, and security; (3) leveraging meta‑learning to initialize the surrogate model from prior tasks, thereby accelerating convergence on new problems; and (4) integrating an interactive loop where developers can add or modify test cases on the fly, allowing the optimizer to adapt in real time.
In summary, BODE‑GEN demonstrates that Bayesian optimization over continuous prompt embeddings, coupled with a language‑model‑driven bridge back to textual prompts, can automatically discover high‑quality prompts that markedly increase the functional correctness of LLM‑generated code. The approach is model‑agnostic, sample‑efficient, and opens avenues for automated prompt engineering across a broad range of generative AI applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment