MASTEST 대규모 언어 모델 기반 API 테스트 자동화 멀티에이전트 시스템
📝 Abstract
Testing RESTful API is increasingly important in quality assurance of cloud-native applications. Recent advances in machine learning (ML) techniques have demonstrated that various testing activities can be performed automatically by large language models (LLMs) with reasonable accuracy. This paper develops a multi-agent system called MASTEST that combines LLM-based and programmed agents to form a complete tool chain that covers the whole workflow of API test starting from generating unit and system test scenarios from API specification in the OpenAPI Swagger format, to generating of Pytest test scripts, executing test scripts to interact with web services, to analysing web service response messages to determine test correctness and calculate test coverage. The system also supports the incorporation of human testers in reviewing and correcting LLM generated test artefacts to ensure the quality of testing activities. MASTEST system is evaluated on two LLMs, GPT-4o and DeepSeek V3.1 Reasoner with five public APIs. The performances of LLMs on various testing activities are measured by a wide range of metrics, including unit and system test scenario coverage and API operation coverage for the quality of generated test scenarios, data type correctness, status code coverage and script syntax correctness for the quality of LLM generated test scripts, as well as bug detection ability and usability of LLM generated test scenarios and scripts. Experiment results demonstrated that both DeepSeek and GPT-4o achieved a high overall performance. DeepSeek excels in data type correctness and status code detection, while GPT-4o performs best in API operation coverage. For both models, LLM generated test scripts maintained 100% syntax correctness and only required minimal manual edits for semantic correctness. These findings indicate the effectiveness and feasibility of MASTEST.
💡 Analysis
Testing RESTful API is increasingly important in quality assurance of cloud-native applications. Recent advances in machine learning (ML) techniques have demonstrated that various testing activities can be performed automatically by large language models (LLMs) with reasonable accuracy. This paper develops a multi-agent system called MASTEST that combines LLM-based and programmed agents to form a complete tool chain that covers the whole workflow of API test starting from generating unit and system test scenarios from API specification in the OpenAPI Swagger format, to generating of Pytest test scripts, executing test scripts to interact with web services, to analysing web service response messages to determine test correctness and calculate test coverage. The system also supports the incorporation of human testers in reviewing and correcting LLM generated test artefacts to ensure the quality of testing activities. MASTEST system is evaluated on two LLMs, GPT-4o and DeepSeek V3.1 Reasoner with five public APIs. The performances of LLMs on various testing activities are measured by a wide range of metrics, including unit and system test scenario coverage and API operation coverage for the quality of generated test scenarios, data type correctness, status code coverage and script syntax correctness for the quality of LLM generated test scripts, as well as bug detection ability and usability of LLM generated test scenarios and scripts. Experiment results demonstrated that both DeepSeek and GPT-4o achieved a high overall performance. DeepSeek excels in data type correctness and status code detection, while GPT-4o performs best in API operation coverage. For both models, LLM generated test scripts maintained 100% syntax correctness and only required minimal manual edits for semantic correctness. These findings indicate the effectiveness and feasibility of MASTEST.
📄 Content
RESTful Web Services have been widely adopted by cloudnative computer applications due to their flexibility, scalability, reliability, and efficiency. However, the complexity of the structure and dynamic behaviour of cloud native applications, especially for those in microservices architecture, imposes a grave challenge to their testing and quality assurance. The existing testing methods widely employed in practices include manual tasks supported by automated tools. They rely on testers to design test cases, prepare test data, invoke web services, and verify correctness of the responses from the web service under test, whereas automated testing tools support various activities of the process, but require manual development of test frameworks and/or test scripts. Despite significant efficiency improvements in the past decade, API tests remain labour-intensive and prone to errors. Automated testing has been intensively researched and practically exercised for several decades [1], but has not completely replaced manual testing. It is highly desirable to improve the technology of automated testing of cloud-native applications.
With the rapid advances of machine learning (ML) techniques, especially large language models (LLMs), researchers have explored ML technique’s capabilities in performing various types of software development tasks, especially their potential to assist in API testing. For instance, empirical studies have demonstrated the applicability of LLMs to generate diverse input parameters [2]- [4], interpret the upstream and downstream dependencies between API operations and restrictions on API responses [5], [6], and generate test code from program source code [7], business requirements [8] and API specifications [9]. However, there remains a wide gap to practical uses of LLMs since their imperfect performance does not scale up to automatically complete the whole workflow of testing large and complex could-native applications.
In this paper, we propose a multi-agent system, called MASTEST, that integrates a group of agents that are empowered by a LLM or implemented in a programming language to perform testing activities autonomously and interact with human users through graphical user interfaces. They together conduct various testing activities covering the whole workflow of RESTful API test. In particular, LLM-based intelligent agents are employed to conduct labour intensive but creative tasks like the generation of test scenarios, generation of test scripts, and analysing the response messages received from web services during testing for checking the correctness and calculate test coverage. Programmed agents are also developed to perform routine tasks such as parsing API specification and invocation of test scripts to test web services. Graphical user interfaces are designed to include human testers in the loop to perform quality assurance tasks such as reviewing the output generated by LLM-based agents and correcting their errors.
The paper is organised as follows. Section II reviews related work. Section III presents the design and implementation of the MASTEST system. Section IV reports the experiments that evaluate the system. Section V concludes the paper with a discussion of the limitations and future work.
This section reviews the existing methods and tools for testing RESTful APIs in industry as well as related research arXiv:2511.18038v1 [cs.SE] 22 Nov 2025 works, including traditional testing approaches and those utilising ML and LLMs.
Manual testing of RESTful APIs is a common practice in the industry, usually accomplished with testing tools or by manually writing test code. Postman 1 is one of the most widely adopted tools and is often preferred by API developers and testers. It provides an intuitive graphical interface that supports the creation and management of API requests, batch execution with parameterisation, customisation of input parameters and assertions, as well as visual inspection of responses, thereby facilitating unified testing and management. Furthermore, automated testing can be achieved by organising requests into collections. However, it has limitations when handling complex scenarios. Postman’s free plan supports collaboration among up to three users. For larger teams and advanced collaboration features, a paid subscription is required, and complex business workflows involving conditional branching or cyclic dependencies require the use of custom scripts. Additionally, it does not support direct integration with dynamic data sources such as databases. Compared to codelevel custom frameworks, it is less flexible and scalable for complex APIs.
Swagger UI 2 is another widely used tool in the industry, designed for visualising and interacting with APIs defined by the OpenAPI Specification 3 . It enables developers and testers to explore API functionality and perform quick manual verification. However, it lacks support for parameterisation, assertions, and autom
This content is AI-processed based on ArXiv data.