Simple Baselines are Competitive with Code Evolution
Code evolution is a family of techniques that rely on large language models to search through possible computer programs by evolving or mutating existing code. Many proposed code evolution pipelines show impressive performance but are often not compared to simpler baselines. We test how well two simple baselines do over three domains: finding better mathematical bounds, designing agentic scaffolds, and machine learning competitions. We find that simple baselines match or exceed much more sophisticated methods in all three. By analyzing these results we find various shortcomings in how code evolution is both developed and used. For the mathematical bounds, a problem’s search space and domain knowledge in the prompt are chiefly what dictate a search’s performance ceiling and efficiency, with the code evolution pipeline being secondary. Thus, the primary challenge in finding improved bounds is designing good search spaces, which is done by domain experts, and not the search itself. When designing agentic scaffolds we find that high variance in the scaffolds coupled with small datasets leads to suboptimal scaffolds being selected, resulting in hand-designed majority vote scaffolds performing best. We propose better evaluation methods that reduce evaluation stochasticity while keeping the code evolution economically feasible. We finish with a discussion of avenues and best practices to enable more rigorous code evolution in future work.
💡 Research Summary
This paper critically examines whether the added complexity of modern code‑evolution pipelines—systems that iteratively mutate, recombine, and evolve code using large language models (LLMs)—actually yields superior results compared to very simple baselines. The authors introduce two minimalist baselines: (1) IID Random Sampling (IID RS), which prompts an LLM to generate a batch of independent programs for a task and then selects the best‑performing one after execution; and (2) Sequential Conditioned Sampling (SCS), which builds on IID RS by conditioning subsequent generations on a random subset of previously successful programs, optionally restarting after a few generations. Neither baseline employs explicit fitness‑based selection; they rely purely on execution outcomes.
The study evaluates these baselines across three distinct domains, each with a different primary constraint: (a) finding tighter mathematical bounds under a fixed API‑cost budget, (b) designing agentic scaffolds where the number of function evaluations (≈100 validation examples) is limited, and (c) competing in machine‑learning challenges where wall‑clock time is the bottleneck.
In the mathematical‑bounds experiments, nine problems from Novikov et al. (2025) spanning analysis, combinatorics, and geometry were tackled. Using a $20 budget, the authors compared against ShinkaEvolve (an open‑source analogue of AlphaEvolve) and, for reference, the proprietary AlphaEvolve results. SCS matched or exceeded ShinkaEvolve on six of nine problems, while IID RS did so on four. Both baselines often reached the same performance ceiling as ShinkaEvolve when the search space and prompt‑embedded domain knowledge were held constant. The authors demonstrate that reformulating a problem—i.e., changing its search space—has a far larger impact on the final bound than the sophistication of the evolution pipeline itself. Moreover, modest prompt tweaks (adding a helpful initial function) can boost performance, whereas overly informative prompts can bias the search toward suboptimal regions.
For agentic scaffolds, the authors replicate the setting of Hu et al. (2024), where LLM‑generated scaffolds are evaluated on roughly 100 examples. The limited validation set induces high variance; consequently, the more elaborate evolution pipelines tend to overfit to noisy signals and select scaffolds that generalize poorly. A simple hand‑crafted majority‑vote scaffold consistently outperformed all automatically evolved candidates. The paper proposes more robust evaluation practices—such as increasing validation sample size, employing cross‑validation, and using bootstrap confidence intervals—while keeping API costs manageable.
In the machine‑learning competition domain, the baselines’ inherent parallelism allowed them to finish within 1–3 hours per problem, far faster than ShinkaEvolve’s ~10 hours, while achieving comparable or better scores under the same wall‑clock constraint. Cost‑efficiency analyses (budget‑normalized probability of beating ShinkaEvolve) show the baselines dominate across the entire budget range.
From these findings, the authors draw several key insights: (1) the design of the search space and the amount of domain knowledge encoded in prompts are the primary determinants of success; (2) added algorithmic complexity (selection, crossover, multi‑model ensembles) does not guarantee better outcomes and can even increase stochasticity; (3) evaluation methodology must be statistically sound to avoid selecting overfit solutions, especially when evaluation budgets are tight.
The paper concludes with concrete best‑practice recommendations for future code‑evolution research: (i) always benchmark against simple IID and sequential baselines under identical resource constraints; (ii) involve domain experts early to craft expressive yet tractable search spaces; (iii) adopt robust statistical evaluation (e.g., bootstrapping, confidence intervals) and allocate sufficient validation data; (iv) release code, prompts, and raw results to ensure reproducibility. By following these guidelines, the community can more reliably assess the true value of sophisticated code‑evolution pipelines and focus effort on the aspects—search‑space engineering and evaluation rigor—that truly drive progress.
Comments & Academic Discussion
Loading comments...
Leave a Comment