EvoGPT: Leveraging LLM-Driven Seed Diversity to Improve Search-Based Test Suite Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Search-Based Software Testing (SBST) is a well-established approach for automated unit test generation, yet it often suffers from premature convergence and limited diversity in the generated test suites. Recently, Large Language Models (LLMs) have emerged as an alternative technique for unit test generation. We present EvoGPT, a hybrid test generation system that integrates LLM-based test generation with SBST-based test suite optimization. EvoGPT uses LLMs to generate an initial population of test suites, and uses an Evolutionary Algorithm (EA) to further optimize this test suite population. A distinguishing feature of EvoGPT is its explicit enforcement of diversity, achieved through the use of multiple temperatures and prompt instructions during test generation. In addition, each LLM-generated test is refined using a generation-repair loop and coverage-guided assertion generation. To address evolutionary plateaus, EvoGPT also detects stagnation during search and injects additional LLM-generated tests aimed at previously uncovered branches. Here too diversity is enforced using multiple temperatures and prompt instructions. We evaluate EvoGPT on Defects4J, a standard benchmark for test generation. The results show that EvoGPT achieves, on average, a 10% improvement in both code coverage and mutation score metrics compared to TestART, an LLM-only baseline; and EvoSuite, a standard SBST baseline. An ablation study indicates that explicitly enforcing diversity both at initialization and during the search is key to effectively leveraging LLMs for automated unit test generation.

💡 Research Summary

The paper introduces EvoGPT, a hybrid test generation framework that combines Large Language Model (LLM)‑driven seed creation with traditional Search‑Based Software Testing (SBST) optimization. The authors identify two complementary weaknesses: SBST excels at systematic structural coverage but often suffers from premature convergence and a lack of semantic diversity; LLMs can generate human‑like, semantically rich unit tests but lack iterative feedback mechanisms and struggle to explore deep input spaces. EvoGPT addresses these issues by explicitly enforcing diversity at both the initialization and plateau‑escape stages.

In the initialization phase, five distinct prompt templates (e.g., edge‑case focus, deep object chain, creative assertions) are paired with multiple temperature settings. Five asynchronous queries per configuration yield a total of 25 diverse test suites, each embodying different semantic characteristics. Every generated test undergoes a lightweight generation‑repair loop: compilation or runtime failures trigger stack‑trace capture, rule‑based fixes, and a re‑prompt to the LLM, ensuring that the initial population consists of valid, runnable tests.

A coverage‑enhancement step follows, where uncovered branches are identified via static analysis and the LLM is asked to add tests targeting those branches, inserting assertions in a manner inspired by TestART. The resulting pool of 25 suites is fed into an Evolutionary Algorithm (EA). The EA employs standard selection, crossover, and mutation operators, but its fitness function is a weighted sum of three normalized metrics: Line Coverage of Correct Tests (LCCT), Branch Coverage of Correct Tests (BCCT), and Mutation Score of Correct Tests (MSCT). This multi‑objective fitness encourages both structural thoroughness and fault‑detection capability.

During evolution, EvoGPT monitors for stagnation: if no significant fitness improvement occurs over a configurable number of generations, a plateau‑detection mechanism triggers. At this point the system re‑invokes the LLM with the same diverse prompt‑temperature configurations, this time explicitly asking for tests that cover the still‑uncovered branches and mutants. The newly generated tests are injected into the current best suite, allowing the search to escape local optima—a strategy borrowed from CodaMOSA.

The authors evaluate EvoGPT on the Defects4J benchmark, comparing it against EvoSuite (a state‑of‑the‑art SBST tool) and TestART (an LLM‑only baseline). Across all projects, EvoGPT achieves an average 10 % improvement in line coverage, branch coverage, and mutation score relative to both baselines. The gains are especially pronounced on classes with many conditional branches, confirming that semantic diversity in the seed population expands the search space effectively.

An ablation study isolates three key components: (1) removing prompt/temperature diversity, (2) omitting the evolutionary optimization (LLM‑only), and (3) disabling the plateau‑escape injection. Each removal leads to a statistically significant drop in performance, with the loss of seed diversity reducing overall gains to under 6 %. This empirically validates the authors’ claim that diversity at both stages is essential for leveraging LLMs within SBST.

The paper’s contributions are: (i) a novel method for generating semantically diverse test seeds via multi‑prompt, multi‑temperature LLM queries; (ii) a tightly integrated hybrid pipeline that combines LLM‑driven generation, repair, coverage‑guided augmentation, and EA‑based optimization; (iii) a plateau‑escape mechanism that re‑uses LLM diversity to overcome evolutionary stagnation; and (iv) a comprehensive empirical evaluation demonstrating statistically significant improvements over strong baselines.

Limitations include the computational cost of repeated LLM calls, the reliance on manually crafted prompt templates (which may not generalize across languages or domains), and the focus on Java projects within Defects4J. Future work is suggested on cost‑effective LLM scheduling, automated prompt synthesis, extending the approach to other programming languages, and incorporating richer feedback (e.g., symbolic execution results) into the repair loop. Overall, EvoGPT showcases how carefully managed semantic diversity can bridge the gap between generative AI and classic evolutionary search, yielding higher‑quality automated unit tests.

EvoGPT: Leveraging LLM-Driven Seed Diversity to Improve Search-Based Test Suite Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment