LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework

Reading time: 5 minute
...

📝 Original Info

  • Title: LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework
  • ArXiv ID: 2511.20403
  • Date: 2025-11-27
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AgoneTest, an automated evaluation framework for Large Language Model-generated (LLM) unit tests in Java. AgoneTest does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to-end evaluation pipeline under realistic conditions. We introduce the Classes2Test dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contribute to test quality. AgoneTest clarifies the potential of LLMs in software testing and offers insights for future improvements in model design, prompt engineering, and testing practices.

💡 Deep Analysis

Deep Dive into LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework.

Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AgoneTest, an automated evaluation framework for Large Language Model-generated (LLM) unit tests in Java. AgoneTest does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to-end evaluation pipeline under realistic conditions. We introduce the Classes2Test dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contri

📄 Full Content

LLMs for Automated Unit Test Generation and Assessment in Java: The AGONETEST Framework Andrea Lops∗‡, Fedelucio Narducci∗, Azzurra Ragone†, Michelantonio Trizio‡, Claudio Bartolini‡ ∗Polytechnic University of Bari, Bari, Italy Email: {andrea.lops, fedelucio.narducci}@poliba.it †University of Bari, Bari, Italy Email: azzurra.ragone@uniba.it ‡Wideverse, Bari, Italy Email: {andrea.lops, michelantonio.trizio, claudio.bartolini.consultant}@wideverse.com Abstract—Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AGONETEST, an au- tomated evaluation framework for Large Language Model- generated (LLM) unit tests in Java. AGONETEST does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to- end evaluation pipeline under realistic conditions. We introduce the CLASSES2TEST dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM- generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contribute to test quality. AGONETEST clarifies the potential of LLMs in software testing and offers insights for future improvements in model design, prompt engineering, and testing practices. Index Terms—Software Testing, Large Language Model, Au- tomatic Assessment and Evaluation, Assessment and Evaluation in Software Testing I. INTRODUCTION Software testing is a critical step in the software de- velopment lifecycle, essential for ensuring code correctness and reliability. Unit testing, in particular, verifies the proper functioning of individual code units. However, designing and building unit tests is a costly and labor-intensive process that requires significant time and specialized skills [1]. Automating this process is an active area of research and development. Automated tools for generating unit tests can reduce the workload of test engineers and software developers. These tools typically use static code analysis methods to generate test suites. For example, EvoSuite [2], a popular tool that combines static code analysis with evolutionary search, has demonstrated the ability to achieve adequate coverage. Large Language Models (LLMs), efficiently exploited in various aspects of software development, could also handle the automatic generation of unit tests. Several empirical studies on LLMs have highlighted their ability to generate tests for simple scenarios, often limited to single methods [3]–[6]. Though directionally useful, these explorations often focus on independent, small-scale test units and rely on manual integration into projects, providing a limited view of LLM performance in real-world software development scenarios [6], [7]. This manual process restricts the number of tests that can be executed and reduces overall efficiency. To address these gaps, we have developed a framework ex- plicitly focused on the evaluation of unit test suites generated by LLMs. Rather than proposing a novel generation method, our contribution lies in providing an end-to-end pipeline that standardizes how LLM-based test suites can be assessed in realistic software projects. A simple use case illustrates how AGONETEST can be applied in a real-world scenario. Imagine a developer or a researcher who needs to evaluate which LLM and prompting strategy performs best for generating unit tests. Doing this manually would require repeated project setup, test execution, and metric collection, making the process slow and error-prone. With AGONETEST’s standardized end- to-end pipeline, the developer can automate the workflow and directly compare LLMs under different prompting strategies. The framework produces reliable and reproducible metrics, revealing, for instance, that one model generates more com- pilable tests while another achieves higher coverage. In this way, AGONETEST turns ad hoc experimentation into system- atic benchmarking. Our approach focuses on class-level test code evaluation, which is closer to real-world practices as it covers method interactions and shared state, reducing code redundancy [8]. For instance, a simple ItemManager class with two methods: addItem() and getItemCount(), illustrates this point. A method-level test for getItemCount() in isolation might only verify its behavior on an empty list, ignoring how the state changes when new items are added. In contrast, a class-level test naturally exercises the interaction between addItem() and getItemCount(), ensuring that the internal state is updated consistently across method calls. This example highlights thre

…(Full text truncated)…

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut