📝 Original Info
- Title: LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework
- ArXiv ID: 2511.20403
- Date: 2025-11-27
- Authors: Researchers from original ArXiv paper
📝 Abstract
Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AgoneTest, an automated evaluation framework for Large Language Model-generated (LLM) unit tests in Java. AgoneTest does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to-end evaluation pipeline under realistic conditions. We introduce the Classes2Test dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contribute to test quality. AgoneTest clarifies the potential of LLMs in software testing and offers insights for future improvements in model design, prompt engineering, and testing practices.
💡 Deep Analysis
Deep Dive into LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework.
Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AgoneTest, an automated evaluation framework for Large Language Model-generated (LLM) unit tests in Java. AgoneTest does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to-end evaluation pipeline under realistic conditions. We introduce the Classes2Test dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contri
📄 Full Content
LLMs for Automated Unit Test Generation and
Assessment in Java: The AGONETEST Framework
Andrea Lops∗‡, Fedelucio Narducci∗, Azzurra Ragone†, Michelantonio Trizio‡, Claudio Bartolini‡
∗Polytechnic University of Bari, Bari, Italy
Email: {andrea.lops, fedelucio.narducci}@poliba.it
†University of Bari, Bari, Italy
Email: azzurra.ragone@uniba.it
‡Wideverse, Bari, Italy
Email: {andrea.lops, michelantonio.trizio, claudio.bartolini.consultant}@wideverse.com
Abstract—Unit testing is an essential but resource-intensive
step in software development, ensuring individual code units
function correctly. This paper introduces AGONETEST, an au-
tomated evaluation framework for Large Language Model-
generated (LLM) unit tests in Java. AGONETEST does not
aim to propose a novel test generation algorithm; rather, it
supports researchers and developers in comparing different
LLMs and prompting strategies through a standardized end-to-
end evaluation pipeline under realistic conditions. We introduce
the CLASSES2TEST dataset, which maps Java classes under
test to their corresponding test classes, and a framework that
integrates advanced evaluation metrics, such as mutation score
and test smells, for a comprehensive assessment. Experimental
results show that, for the subset of tests that compile, LLM-
generated tests can match or exceed human-written tests in terms
of coverage and defect detection. Our findings also demonstrate
that enhanced prompting strategies contribute to test quality.
AGONETEST clarifies the potential of LLMs in software testing
and offers insights for future improvements in model design,
prompt engineering, and testing practices.
Index Terms—Software Testing, Large Language Model, Au-
tomatic Assessment and Evaluation, Assessment and Evaluation
in Software Testing
I. INTRODUCTION
Software testing is a critical step in the software de-
velopment lifecycle, essential for ensuring code correctness
and reliability. Unit testing, in particular, verifies the proper
functioning of individual code units. However, designing and
building unit tests is a costly and labor-intensive process that
requires significant time and specialized skills [1]. Automating
this process is an active area of research and development.
Automated tools for generating unit tests can reduce the
workload of test engineers and software developers. These
tools typically use static code analysis methods to generate
test suites. For example, EvoSuite [2], a popular tool that
combines static code analysis with evolutionary search, has
demonstrated the ability to achieve adequate coverage.
Large Language Models (LLMs), efficiently exploited in
various aspects of software development, could also handle
the automatic generation of unit tests. Several empirical studies
on LLMs have highlighted their ability to generate tests for
simple scenarios, often limited to single methods [3]–[6].
Though directionally useful, these explorations often focus
on independent, small-scale test units and rely on manual
integration into projects, providing a limited view of LLM
performance in real-world software development scenarios [6],
[7]. This manual process restricts the number of tests that can
be executed and reduces overall efficiency.
To address these gaps, we have developed a framework ex-
plicitly focused on the evaluation of unit test suites generated
by LLMs. Rather than proposing a novel generation method,
our contribution lies in providing an end-to-end pipeline that
standardizes how LLM-based test suites can be assessed in
realistic software projects. A simple use case illustrates how
AGONETEST can be applied in a real-world scenario. Imagine
a developer or a researcher who needs to evaluate which
LLM and prompting strategy performs best for generating
unit tests. Doing this manually would require repeated project
setup, test execution, and metric collection, making the process
slow and error-prone. With AGONETEST’s standardized end-
to-end pipeline, the developer can automate the workflow and
directly compare LLMs under different prompting strategies.
The framework produces reliable and reproducible metrics,
revealing, for instance, that one model generates more com-
pilable tests while another achieves higher coverage. In this
way, AGONETEST turns ad hoc experimentation into system-
atic benchmarking. Our approach focuses on class-level test
code evaluation, which is closer to real-world practices as it
covers method interactions and shared state, reducing code
redundancy [8].
For instance, a simple ItemManager class with two
methods: addItem() and getItemCount(), illustrates
this point. A method-level test for getItemCount() in
isolation might only verify its behavior on an empty list,
ignoring how the state changes when new items are added.
In contrast, a class-level test naturally exercises the interaction
between addItem() and getItemCount(), ensuring that
the internal state is updated consistently across method calls.
This example highlights thre
…(Full text truncated)…
Reference
This content is AI-processed based on ArXiv data.