Detecting and Mitigating Flakiness in REST API Fuzzing

Detecting and Mitigating Flakiness in REST API Fuzzing Man Zhang 1 , Chongyang Shen 1 , Andrea Arcuri 2 , and Tao Yue 1 1 Beihang University 2 Kristiania University of Applied Sciences and Oslo Metropolitan University Abstract Test flakiness is a common problem in industry, which hinders the reliability of automated build and testing workflows. Most existing research on test flakiness has primarily focused on unit and small-scale integration tests. In contrast, flakiness in system-level testing such as REST APIs are comparatively under-explored. A large body of literature has been dedicated to the topic of fuzzing REST APIs, whereas relatively little attention has been paid to detecting and possibly mitigating negative effects of flakiness in this context. To fill this major gap, in this paper, we study the flakiness of tests generated by one of the popularly applied REST API fuzzer in the literature, namely EvoMaster , conduct empirical studies with a corpus of 36 REST APIs to understand flakiness of REST APIs. Based on the results of the empirical studies, we categorize and analyze flakiness sources by inspecting near 3000 failing tests. Based on the understanding, we propose FlakyCatch to detect and mitigate flakiness in REST APIs and empirically evaluate its performance. Results show that FlakyCatch is effective in detecting and handling flakiness in tests generated by white-box and black-box fuzzers. Keywords : REST API, Flaky Test, API Fuzzing 1 Introduction Flaky tests are automated tests that exhibit non-deterministic behavior, i.e., producing different outcomes when executed under the same conditions [ 46 ]. A rich body of literature has investigated flaky tests and identified their negative impact on developer productivity, trust in test suites, and continuous integration pipelines [ 56 , 41 , 31 , 43 , 51 , 53 ]. Based on a recent survey conducted within the BMW Group [ 30 ], flaky tests are both common and severe in industrial settings, which reinforces the need for systematic approaches to detect and mitigate flakiness. While prior work primarily focus on conventional test suites, flakiness in REST API testing remains relatively under-explored. REST API tests can be specified as sequences of requests interacting with stateful services. As a result, flakiness occurs when identical request sequences yield different responses across repeated executions, despite no deliberate changes to the system under test (SUT). Compared to traditional test settings, REST APIs are typically more susceptible to such behavior due to their stateful interactions, distributed nature, and dependence on runtime environments. This challenge is particularly critical in the context of API fuzzing, a widely adopted technique for automati- cally generating and executing API requests to uncover faults. Existing approaches predominantly operate in black-box settings, leveraging API specifications (e.g., OpenAPI [ 1 ]) or request–response analysis, while some exploit white-box execution information to guide test generation. A variety of REST API testing techniques have been proposed, such as APIF [ 70 ], AutoRestTest [ 37 ], DeepRest [ 19 ], EmRest [ 72 ], LLamaRestTest [ 36 ], LogiaAgent [ 73 ], Nautilus [ 20 ], RESTest [ 48 ], RestTestGen [ 68 ], and WuppieFuzz [ 62 ]. However, the impact of flakiness on API fuzzing remains largely unexplored, particularly in terms of how frequently it occurs and what causes it. To address this gap, in this paper, we aim to systematically investigate flakiness in REST API fuzzing by first addressing the following two research questions (RQs): RQ1 : How frequently do flaky white-box and black-box tests exist in REST APIs? RQ2 : What are the primary sources of flakiness in REST APIs? To answer these RQs, we conducted an empirical study on 36 real-world REST APIs using the fuzzer EvoMaster [ 13 ] in both its black-box and white-box modes, under two different runtime environments. We chose EvoMaster [ 4 ] because it is the state-of-the-art tool that support both black-box and white-box testing. Furthermore, many academic REST API fuzzing prototypes only make HTTP calls and are unable to generate executable test suites with assertions (e.g., in JUnit or Pytest format), so they would had been unusable for this study. Our analysis reveals that flakiness is widespread: flaky tests were observed in 31 out of 36 APIs across both environments. White-box fuzzing exhibits flaky behavior more frequently than black-box fuzzing, suggesting that it exposes non-deterministic behaviors that remain hidden in black-box settings. At the same time, white-box 1 @ G e t M a p p i n g ( " / p r i c e / e s t i m a t e " ) f u n e s t i m a t e P r i c e ( @ R e q u e s t P a r a m b a s e : I n t ) : M a p < S t r i n g , I n t > { v a l r a n d o m J i t t e r = r a n d o m I n t ( 1 0 ) v a l t o t a l = b a s e + r a n d o m J i t t e r r e t u r n m a p O f ( " b a s e " to b a s e , " j i t t e r " t o r a n d o m J i t t e r , " t o t a l " t o t o t a l ) } Figure 1: Snippet of an example of an endpoint with flakiness fuzzing tends to more consistently expose failures, whereas black-box fuzzing more often leads to intermittently failing tests. To better understand the root causes of flakiness, we further conducted a manual analysis of nearly 3,000 failing tests and identified nine categories of flakiness sources. Among these, the most prominent ones are runtime environment variations, runtime-dependent messages, and stateful resources, highlighting the strong influence of external conditions and system state on REST API behavior. Based on these insights, we propose FlakyCatch , a novel approach for detecting and handling flakiness in REST API fuzzing. FlakyCatch employs a re-execution-based detection mechanism augmented with lightweight inference to efficiently identify flaky tests. Detected flaky tests are then automatically post-processed to improve the stability and reusability of generated test suites. In particular, FlakyCatch preserves the structural and behavioral coverage of flaky tests while mitigating their non-deterministic failures. Finally, we evaluate FlakyCatch by addressing RQ3 : How effective is FlakyCatch in detecting and mitigating flakiness in tests? Our results show that FlakyCatch effectively identifies flaky tests and improves the stability of test suites. Contributions. This paper first presents an empirical study of flakiness in REST API testing, based on 36 real-world APIs and fuzzing-generated test cases under both black-box and white-box settings. We also introduce a manually labeled dataset of failing test cases, providing a valuable resource for benchmarking and further research. From this analysis, we derive a taxonomy of nine distinct sources of flakiness and propose FlakyCatch , a novel technique for detecting and mitigating selected flakiness issues in REST API fuzzing. All experimental artifacts are made publicly available to support reproducibility and serve as a baseline for future work. Structure. In Section 2, we briefly discuss flaky tests and EvoMaster , followed by the related work in Section 3. In Section 4, we present the emprical study for answering RQ1 and RQ2. In Section 5, we present FlakyCatch and its evaluation. Section 6 provides threats to validity and Section 7 concludes the paper. 2 Background In this section, we briefly introduce the concept of flakiness in software testing of REST APIs. Also, we provide more information on the used fuzzer in this study, i.e., EvoMaster . 2.1 Flaky Test A flaky test may pass in one execution and fail in another, even when it exercises the same code under the same environment and test input. Primary causes of flaky tests are related to various factors, such as asynchronous waits, concurrency issues, test order dependencies, dependencies on external resources, time, randomness, and algorithmic non-determinism [46, 16, 23, 56]. In the context of REST API testing, a test can be regarded as a sequence of requests. Therefore, a flaky test is reflected as the same sequence of requests producing different responses when executed on the same code. The example shown in Figure 1 illustrates a GET endpoint that estimates a price based on a given base value provided as a request parameter. The endpoint introduces a random jitter by generating a random integer using randomInt(10) , which is then added to the base value to compute the total price. As a result, even when the same request is repeatedly sent with identical input parameters, the returned jitter and total values may differ across executions. This dependency on randomness introduces non-deterministic behavior into the API. For example, the test shown in Figure 2 interacting with this endpoint may pass in some executions and fail in others, even sending the same request. 2.2 EvoMaster EvoMaster is an open-source evolutionary fuzzer for automated REST API testing, which has been actively maintained since its inception in 2016 [ 4 ]. It is one of the few fuzzers that support both black-box and white-box testing of REST APIs, enabling test generation with and without access to the source code. Its white-box mode is built on instrumentation that is developed to automatically collect runtime execution information, such as statement coverage and executed SQL commands. Based on this runtime feedback, EvoMaster 2 @ T e s t @ T i m e o u t ( 6 0 ) f u n t e s t _ 8 _ g e t O n E s t i m a t e R e t u r n s O b j e c t ( ) { g i v e n ( ) . ac c e p t ( " * / * " ) . h e a d e r ( " x - E M e x t r a H e a d e r 1 2 3 " , " " ) . ge t ( " $ { b a s e U r l O f S u t } / a p i / f l a k i n e s s d e t e c t / p r i c e / e s t i m a t e ? " + " b a s e = 6 6 6 & " + " E M e x t r a P a r a m 1 2 3 = _ E M _ 2 _ X Y Z _ " ) . t h e n ( ) . s t a t u s C o d e ( 2 0 0 ) . a s s e r t T h a t ( ) . c o n t e n t T y p e ( " a p p l i c a t i o n / j s o n " ) . b o d y ( " ’ b a s e ’ " , n u m b e r M a t c h e s ( 6 6 6 ) ) . b o d y ( " ’ j i t t e r ’ " , n u m b e r M a t c h e s ( 3 ) ) / / m a y v a r y d u e t o r a n d o m n e s s . b o d y ( " ’ t o t a l ’ " , n u m b e r M a t c h e s ( 6 6 9 ) ) / / d e p e n d s o n t h e g e n e r a t e d j i t t e r } Figure 2: Snippet of an example test linked to flaky endpoints employs an evolutionary algorithm: Many Independent Objective (MIO) algorithm [ 5 ], which is specifically designed for system-level testing to effectively handle many and potentially conflicting testing objectives, e.g., EvoMaster treats each predicate, branch, and line of code as a separate testing objective. In its black-box mode, EvoMaster adopts random strategies by default to guide test generation based on API schema (e.g., OpenAPI specifications for REST APIs) and request–response coverage. Over the years, EvoMaster has been integrated with various advanced techniques, e.g., testability transformation [ 7 ], SQL handling [ 6 , 10 ], resource-based strategies [ 79 ], adaptive hypermutation [ 74 ], mocking [ 66 ], and security testing [ 9 ]. SQL handling and mocking for managing web service interactions can help mitigate flakiness in Web APIs. Beyond REST APIs, EvoMaster also supports fuzzing of GraphQL APIs [ 15 ] and RPC-based APIs [ 76 , 78 , 77 ]. Empirical studies on REST API fuzzers applied to open-source REST APIs show that EvoMaster in white-box mode achieves state-of-the-art performance in terms of code coverage and fault detection [ 38 , 75 ]. Moreover, its effectiveness in both black-box and white-box modes has been demonstrated in industrial settings, including large-scale enterprise systems at companies such as Volkswagen [ 8 , 58 ] and Meituan [76, 77]. 3 Related Work In this section, we discuss the related work from three aspects: API fuzzing, flakiness at the unit-testing and system testing levels. 3.1 API Fuzzing API Fuzzing (or API Fuzz Testing, API Testing) has received significant attention in both academia and industry. In the literature, most approaches target black-box testing, often relying on API specification (e.g., OpenAPI) or analyzing requests and responses. In contract, few studies have explored white-box testing by utilizing runtime execution information to guide test generation. In API testing, REST APIs remain the dominant research problem, likely due to their early introduction and widespread adoption in modern web services [ 29 , 38 , 75 ]. Researchers have proposed various techniques for REST API testing, e.g., APIF [ 70 ], APIRL [ 27 ], ARAT-RL [ 35 ], ASTRA [ 67 ], AutoRestTest [ 37 ], bBOXRT [ 39 ], DeepRest [ 19 ], EmRest [ 72 ], EvoMaster [ 13 ], KAT [ 40 ], LLamaRestTest [ 36 ], LogiaAgent [ 73 ], MINER [ 47 ], Morest [ 45 ], Nautilus [ 20 ], OpenAPI-Fuzzer [ 26 ], RAFT [ 63 ], RestCT [ 71 ], RESTest [ 48 ], RESTler [ 14 ], RestTest- Gen [ 68 ], Schemathesis [ 32 ], VoAPI2 [ 22 ] and WuppieFuzz [ 62 ]. For example, RESTler employs dynamic analysis of request–response dependencies to guide test generation. Morest and RestTestGen constructs model or graph for driving test generation, while Schemathesis is developed in the context of property-based testing. Recent works such as APIF [ 70 ], APIRL [ 27 ], ARAT-RL [ 35 ], ASTRA [ 67 ], DeepRest [ 19 ], KAT [ 40 ], LLamaRestTest [ 36 ], and LogiaAgent [ 73 ] leverage artificial intelligence techniques to improve test effectiveness. EvoMaster [ 13 ] is the approach that supports both black-box and white-box testing. GraphQL API testing has also been attracting attention [ 59 ]. GraphQL APIs, which allow clients to flexibly query structured data, introduce new testing challenges due to their complex query schema and nested request structures. Recent work has proposed using search-based testing techniques [ 15 ], property-based testing [ 34 ], and mutation-based testing [55] for GraphQL APIs. Remote Procedure Call (RPC) is widely adopted in industry for high-performance communication in microservice architectures [ 77 , 78 , 76 , 44 ]. Testing RPC-based APIs is challenging due to the diversity of frameworks (e.g., gRPC, Thrift, and Dubbo) and their complex inter-service dependencies. Existing studies 3 have proposed various techniques to address these challenges, including white-box fuzzing [ 76 , 77 ], seeding and mocking [78], traffic recording and replay [44], and Protobuf-schema based testing approaches [69]. 3.2 Flakiness at the Unit Testing Level In the literature, there is a rich body of research that investigates the challenge of flaky unit tests across various programming languages, including Java, Python, and .NET. and a comprehensive pipeline for the detection, classification, prediction, and mitigation of flaky unit tests has been established by leveraging techniques ranging from traditional static and dynamic analyses to advanced AI techniques such as LLMs. A comprehensive survey [ 56 ] has been conducted to summarize the existing body of the literature on flaky tests, hence in this section we mainly focus on the most recent studies within the last few years since then. Leesatapornwongsa et al. [ 41 ] proposed FlakeRepro to help developers reproduce failed executions of flaky tests caused by concurrency, by combining both static and dynamic analysis. Akli et al. [ 2 ] proposed FlakyCat , a method for predicting flaky test categories by relying on CodeBERT. Fatima et al. [ 25 ] proposed FlakyFix , a method for predicting a fix category (out of 13 in total) for a flaky test by analyzing the test code only with CodeBERT and UniXcoder, and leverages these labels to guide LLMs in generating successful automated repairs. Along the same line, Rahman and Shi proposed FlakeSync [ 61 ] to repair async flaky tests by introducing synchronization for a specific test execution. Gruber et al. [ 31 ] conducted a comprehensive empirical study by analyzing over 6,000 Java and Python projects, on which each test generated by two test generation tools was executed 200 times. Results reveal that: 1) automated test generation tools produce flaky tests even more frequently than developers do; 2) generated flaky tests are often caused by randomness and unspecified behaviours, while developers write flaky tests more often caused by concurrency and networking operations; and 3) flakiness suppression mechanisms are effective in reducing the number of tests but also revealing previously-unknown types of flakiness. Li et al. [ 43 ] proposed HiFlaky , a method that detects and classify flaky tests by considering hierarchical dependencies among the root causes of flakiness in test code Their empirical study results show that HiFlaky achieves higher flaky test detection accuracy than two baselines: FlakeFlagger [ 3 ] and Fakify [ 24 ], which are both flaky test predictors. Moreover, when observing the increase use of LLMs for classifying flaky tests, Rahman et al. [ 60 ] proposed FlakyQ , a framework that utilizes quantized LLMs for feature extraction coupled with traditional machine learning classifiers to achieve the reduction in prediction time and memory usage without compromising accuracy. Parry et al. [ 57 ] introduced the concept of FLIMsiness , a form of non-determinism in unit tests that is intentionally induced by applying mutation operators to the code under test, and demonstrates that filmsiness can significantly detect more flaky tests than traditional rerunning strategies. Schroeder et al. [ 65 ] conducted a preliminary study of fixed flaky tests in Rust projects and identified nine common root causes of test flakiness. Note that the above-mentioned body of work focuses specifically on unit testing to address non-determinism through automated detection, root cause categorization, and code repair. Our work focuses on system testing, where tests are composed of sequences of HTTP calls towards the tested API. Assertions are based on what returned in these calls, e.g., HTTP headers and body payloads. Calls over a network, and interactions with databases, might introduce further sources of flakiness not commonly seen in unit testing. 3.3 Flakiness at the System Testing Level While a rich body of research has established for managing flakiness at the unit testing level, studies focusing on system testing remain comparatively scarce. In the rest of the section, we discuss some of these works that investigate flakiness in complex domains such as database management systems and autonomous driving simulators. Mor´ an et al. [ 51 ] proposed FlakcLoc , a spectrum-based localization technique to identify the root causes of flakiness in web applications by analyzing how uncontrolled environmental factors (e.g., network latency, memory) trigger inconsistent test outputs. Dong et al. [ 21 ] proposed FlakeScanner to detect flaky tests for Android apps by systematically exploring event orders, along with a benchmark named FlakyAppRepo for enabling the study of GUI test flakiness. Ngo et al. [ 52 ] provided a review of how academia and industry handle test flakiness and highlighted that current research trends indicate a concentration on unit-level flakiness and leaves end-to-end system testing comparatively under-explored. Osikowicz et al. [ 53 ] recently conducted an empirical study to investigate flaky tests in simulation based autonomous driving testing, and observed that one-third of driving scenarios executed in CARLA are potentially flaky due to unintentional nondeterminism in simulators (e.g., bugs and the use of rendering engine). Berndt et al. [ 17 ] recently conducted a study to investigate the flakiness of LLM-generated tests for database management systems. Results of their study found that LLM-generated tests are often more prone to flakiness than original test suites, largely due to non-deterministic data ordering, and the flakiness transfer (from existing tests to the newly generated ones via prompt) is more prevalent in closed-source database systems than in open-source ones. 4 To our knowledge, there is limited work that systematically studies and addresses flakiness in the context of web services such as REST APIs, despite the substantial focus on REST API fuzzing in prior work. In contrast, our approach FlakyCatch specifically targets the detection and mitigation of flakiness in REST API fuzzing. 4 Empirical Analysis on Flakiness of REST APIs To better understand flakiness existing in REST APIs, we carried out a comprehensive empirical analysis to answer these two questions: RQ1: How frequently do flaky white-box and black-box tests exist in REST APIs? RQ2: What are the primary sources of flakiness in REST APIs? 4.1 Experiment Settings and Design Open-Source REST APIs. Several benchmarks exist for evaluating fuzzing techniques (e.g., [ 33 , 42 , 49 , 18 , 54 , 50 ]). In our case, as we focus on REST APIs, we selected Web Fuzzing Dataset (WFD) [ 64 ], previously known as EMB [ 12 ]. This dataset is the most used in the research literature of fuzzing REST APIs, extended each year with new APIs since 2017. With 36 APIs, it currently provides the largest publicly available collection of REST APIs with source code and experiment infrastructure. Table 1 shows these 36 REST APIs along with their descriptive statistics, e.g., the number of source files (#Files), lines of code (LoCs), endpoints (#End.), runtime environments, and databases used. These APIs cover a broad range of characteristics, e.g., LoCs ranging from 117 to 174,781, 1 to 258 endpoints, from with no database usage to with multiple connections across different databases. Overall, our analysis covers 36 REST APIs comprising 6,465 source files, 657,162 LoCs, and 1,487 endpoints, with 25 out of 36 APIs interacting with databases. Fuzzer Selection. Since test flakiness can only be identified through repeated executions of test cases, its analysis requires executable tests that exercise diverse inputs, request sequences, and state-dependent interactions, in order to increase the likelihood of exposing non-deterministic behavior. Fuzzers that can automatically generate such tests are well suited for our flakiness analysis. Moreover, flakiness characteristics of tests generated by white-box and black-box fuzzers may differ. For example, tests generated by white-box approaches may manipulate internal system states, including external web interactions [ 66 ], as well as SQL [ 6 , 11 ] and NoSQL databases [ 28 ], which may either introduce or reduce flakiness. Moreover, empirical studies involving both black-box and white-box fuzzers have also shown that EvoMaster , when used in white-box mode, achieves the best performance in terms of code coverage and fault detection [ 38 , 75 ]. Hence, we selected EvoMaster because it supports both testing modes and enables the analysis of flakiness in tests generated by different fuzzing techniques. Experiment Settings. We applied EvoMaster in both black-box (denoted as BB) and white-box mode (denoted as WB) to each of the 36 APIs, using a one-hour search budget, which is the most commonly adopted configuration in the REST API fuzzing literature [ 38 , 75 ]. Considering the randomness of the fuzzer, we run the test generations 10 times for each configuration. As the runtime environment may also affect test behavior, we executed generated test cases under two execution environments: • FuzzEnv : a DELL Precision 7875 Tower equipped with an AMD Ryzen Threadripper PRO 7975WX (64 cores, up to 5.35 GHz), 128 GB of RAM, running 64-bit Ubuntu 22.04.5; • ExecEnv : a ThinkStation P620 equipped with an AMD Ryzen Threadripper PRO 5995WX (128 cores, up to 4.58 GHz), 256G GB of RAM, running 64-bit Ubuntu 24.04.3. To observe the non-deterministic behavior of these test cases, we compiled and executed tests generated by each configuration 100 times in both FuzzEnv and ExecEnv . 4.2 Results of Flaky Tests Table 2 reports, for each API on each environment, the average failure rate ( F R %) and the standard error ( sd r ) across the 10 runs for both BB and WB modes of EvoMaster . To characterize the stability of flakiness, we also report the numbers of failed tests (# F ), consistently failed tests (# F c ), and unstable failed tests (# F u ). Overall, we can observe that across the two execution environments, with two generation strategies, among 36 APIs, flaky tests were observed for 31 APIs in both FuzzEnv and ExecEnv , indicating that flakiness is common rather than an exception in REST APIs. By comparing two execution environments, flakiness results are similar for most APIs in both modes. However, several APIs exhibit noticeable differences, suggesting that flakiness can be strongly influenced by the environment. More specifically, ExecEnv yields higher failure rates than FuzzEnv for WB on blogapi , catwatch , 5 Table 1: Descriptive information of the REST APIs employed SUT #Files #LOCs #End. Runtime Databases bibliothek 33 2176 8 JDK 17 MongoDB blogapi 89 4787 52 JDK 8 MySQL catwatch 106 9636 14 JDK 8 H2 cwa-verification 47 3955 5 JDK 11 H2 erc20-rest-service 7 1378 13 JDK 8 familie-ba-sak 1089 143556 183 JDK 17 PostgreSQL features-service 39 2275 18 JDK 8 H2 genome-nexus 405 30004 23 JDK 8 MongoDB gestaohospital 33 3506 20 JDK 8 MongoDB http-patch-spring 30 1450 6 JDK 11 languagetool 1385 174781 2 JDK 8 market 124 9861 13 JDK 11 H2 microcks 471 66186 88 JDK 21 MongoDB ocvn 526 45521 258 JDK 8 H2, MongoDB ohsome-api 87 14166 134 JDK 17 OSHDB pay-publicapi 377 34576 10 JDK 11 Redis person-controller 16 1112 12 JDK 21 MongoDB proxyprint 73 8338 74 JDK 8 H2 quartz-manager 129 5068 11 JDK 11 reservations-api 39 1853 7 JDK 11 MongoDB rest-ncs 9 605 6 JDK 8 rest-news 11 857 7 JDK 8 H2 rest-scs 13 862 11 JDK 8 restcountries 24 1977 22 JDK 8 scout-api 93 9736 49 JDK 8 H2 session-service 15 1471 8 JDK 8 MongoDB spring-actuator-demo 5 117 2 JDK 8 spring-batch-rest 65 3668 5 JDK 8 spring-ecommerce 58 2223 26 JDK 8 MongoOB, Redis, Elastic- search spring-rest-example 32 1426 9 JDK 17 MySQL swagger-petstore 23 1631 19 JDK 8 tiltaksgjennomforing 472 27316 79 JDK 17 PostgreSQL tracking-system 87 5947 67 JDK 11 H2 user-management 69 4274 21 JDK 8 MySQL webgoat 355 27638 204 JDK 21 H2 youtube-mock 29 3229 1 JDK 8 Total 36 6465 657162 1487 25 6 Table 2: Results of the number of generated tests (# T ), the failure rate ( F R %), the number of failed tests (# F ), the number of consistent failed tests (# F c ), and the number of unstable failed tests (# F u ) on the FuzzEnv and ExecEnv environments. FuzzEnv ExecEnv SUT Mode # T Lines% F R % ( sd r ) # F (# F c , # F u ) F R % ( sd r ) # F (# F c , # F u ) bibliothek BB 10.0 27.1 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) WB 13.4 26.7 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) blogapi BB 178.1 24.6 9.0 (0.2) 16.0 (16.0, 0.0) 9.0 (0.2) 16.0 (16.0, 0.0) WB 201.2 31.5 ▽ 3.4 (0.8) ▽ 6.9 ( ▽ 6.9, 0.0) △ 10.4 (1.5) △ 21.0 ( △ 21.0, 0.0) catwatch BB 56.1 29.7 7.1 (2.3) 4.0 (2.0, 2.0) 7.1 (2.4) 4.0 (2.2, 1.8) WB 115.5 44.8 ▽ 4.7 (2.4) △ 5.4 ( △ 5.4, ▼ 0.0) △ 9.4 (16.1) △ 11.1 ( △ 5.7, △ 5.4) cwa-verification BB 5.0 37.9 100.0 (0.0) 5.0 (5.0, 0.0) 100.0 (0.0) 5.0 (5.0, 0.0) WB 5.0 16.7 100.0 (0.0) 5.0 (5.0, 0.0) 100.0 (0.0) 5.0 (5.0, 0.0) erc20-rest-service BB 13.0 25.4 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) WB 20.4 31.2 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) familie-ba-sak BB 1326.5 14.4 0.6 (0.1) 8.4 (8.4, 0.0) 3.1 (7.8) 41.9 (8.4, 33.5) WB 187.6 17.5 △ 1.1 (0.0) ▽ 2.0 ( ▽ 2.0, 0.0) △ 8.9 (1.6) ▽ 17.0 ( △ 17.0, ▼ 0.0) features-service BB 26.3 57.0 4.9 (3.3) 1.3 (1.0, 0.3) 4.9 (3.3) 1.3 (1.0, 0.3) WB 17.8 39.9 ▽ 0.6 (1.8) ▽ 0.1 ( ▽ 0.1, ▼ 0.0) ▽ 0.6 (1.8) ▽ 0.1 ( ▽ 0.1, ▼ 0.0) genome-nexus BB 43.9 28.1 28.2 (2.9) 12.4 (7.8, 4.6) 20.9 (1.1) 9.2 (8.8, 0.4) WB 76.3 35.8 ▽ 23.2 (6.3) △ 18.1 ( △ 17.4, ▽ 0.7) △ 22.6 (6.5) △ 17.7 ( △ 17.7, ▼ 0.0) gestaohospital BB 38.6 42.5 37.5 (9.6) 15.0 (14.5, 0.5) 37.5 (9.6) 15.0 (14.5, 0.5) WB 121.0 43.0 ▽ 7.2 (4.3) ▽ 8.7 ( ▽ 6.6, △ 2.1) ▽ 7.6 (4.0) ▽ 9.2 ( ▽ 8.7, 0.5) http-patch-spring BB 22.8 55.1 25.4 (3.4) 5.8 (5.0, 0.8) 25.4 (3.4) 5.8 (5.0, 0.8) WB 26.6 57.9 △ 38.1 (5.1) △ 10.2 ( △ 10.2, ▼ 0.0) △ 46.1 (4.9) △ 12.3 ( △ 12.3, ▼ 0.0) languagetool BB 3.9 9.0 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) WB 37.8 26.6 ▲ 0.2 (0.6) ▲ 0.1 (0.0, ▲ 0.1) ▲ 1.0 (2.0) ▲ 0.5 ( ▲ 0.1, ▲ 0.4) market BB 53.5 25.9 21.4 (3.8) 11.5 (5.5, 6.0) 21.4 (3.8) 11.5 (5.6, 5.9) WB 43.9 44.8 ▽ 21.0 (3.1) ▽ 9.2 ( ▽ 5.1, ▽ 4.1) △ 36.1 (4.2) △ 15.7 ( △ 11.6, ▽ 4.1) microcks BB 289.2 11.4 13.7 (1.1) 39.7 (36.4, 3.3) 14.8 (1.1) 42.7 (39.4, 3.3) WB 94.4 23.0 ▽ 1.5 (0.5) ▽ 1.4 ( ▽ 1.4, ▼ 0.0) ▽ 2.5 (0.5) ▽ 2.4 ( ▽ 2.4, ▼ 0.0) ocvn BB 2045.1 19.8 28.6 (0.5) 584.7 (96.2, 488.5) 28.6 (0.5) 584.8 (97.2, 487.6) WB 1315.5 21.0 △ 55.2 (1.5) △ 725.2 ( ▽ 56.6, △ 668.6) △ 67.3 (0.8) △ 887.3 ( △ 887.3, ▼ 0.0) ohsome-api BB 627.8 41.2 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) WB 640.0 23.9 ▲ 0.1 (0.3) ▲ 0.8 ( ▲ 0.8, 0.0) ▲ 0.1 (0.3) ▲ 0.8 ( ▲ 0.8, 0.0) pay-publicapi BB 60.4 15.3 66.7 (6.2) 40.5 (0.0, 40.5) 0.0 (0.0) 0.0 (0.0, 0.0) WB 38.4 13.0 ▽ 1.8 (1.7) ▽ 0.7 ( ▲ 0.7, ▼ 0.0) ▲ 1.8 (1.7) ▲ 0.7 ( ▲ 0.7, 0.0) person-controller BB 14.0 54.0 34.3 (6.6) 4.8 (4.8, 0.0) 34.3 (6.6) 4.8 (4.8, 0.0) WB 32.3 62.0 ▽ 29.9 (3.8) △ 9.6 ( △ 9.6, 0.0) ▽ 29.9 (3.8) △ 9.6 ( △ 9.6, 0.0) proxyprint BB 556.6 6.8 83.8 (0.2) 466.5 (466.5, 0.0) 83.8 (0.2) 466.7 (466.7, 0.0) WB 380.4 30.3 ▽ 2.9 (1.5) ▽ 11.1 ( ▽ 10.7, ▲ 0.4) ▽ 45.4 (1.2) ▽ 172.6 ( ▽ 172.3, ▲ 0.3) quartz-manager BB 27.0 28.2 29.6 (0.0) 8.0 (2.0, 6.0) 29.6 (0.0) 8.0 (2.0, 6.0) WB 50.8 37.7 ▽ 13.1 (6.7) ▽ 6.6 ( △ 6.6, ▼ 0.0) ▽ 13.1 (6.7) ▽ 6.6 ( △ 6.6, ▼ 0.0) reservations-api BB 23.0 44.8 12.6 (1.4) 2.9 (2.9, 0.0) 12.6 (1.4) 2.9 (2.9, 0.0) WB 44.1 54.0 △ 37.9 (42.2) △ 16.1 ( △ 4.8, ▲ 11.3) △ 63.8 (46.8) △ 27.6 ( △ 4.8, ▲ 22.8) rest-ncs BB 22.1 60.5 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) WB 40.3 79.9 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) rest-news BB 18.1 47.5 40.3 (7.6) 7.3 (5.9, 1.4) 40.3 (7.6) 7.3 (5.9, 1.4) WB 45.8 61.8 ▽ 3.7 (1.7) ▽ 1.7 ( ▽ 1.7, ▼ 0.0) ▽ 3.7 (1.7) ▽ 1.7 ( ▽ 1.7, ▼ 0.0) rest-scs BB 15.7 56.7 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) WB 76.7 71.6 ▲ 0.1 (0.4) ▲ 0.1 ( ▲ 0.1, 0.0) ▲ 0.1 (0.4) ▲ 0.1 ( ▲ 0.1, 0.0) restcountries BB 62.4 76.0 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) WB 307.9 70.5 ▲ 18.7 (5.2) ▲ 58.0 ( ▲ 58.0, 0.0) ▲ 18.7 (5.2) ▲ 58.0 ( ▲ 58.0, 0.0) scout-api BB 267.4 30.5 32.7 (5.5) 86.8 (78.9, 7.9) 32.7 (5.5) 86.8 (78.9, 7.9) WB 145.8 35.1 ▽ 2.3 (1.7) ▽ 3.4 ( ▽ 1.8, ▽ 1.6) ▽ 2.3 (1.7) ▽ 3.4 ( ▽ 1.8, ▽ 1.6) session-service BB 66.2 56.7 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) WB 138.5 71.5 ▲ 0.1 (0.3) ▲ 0.2 ( ▲ 0.2, 0.0) ▲ 0.1 (0.3) ▲ 0.2 ( ▲ 0.2, 0.0) spring-actuator-demo BB 8.2 87.1 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) WB 7.7 83.9 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) spring-batch-rest BB 12.1 34.2 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) WB 22.2 66.2 ▲ 31.1 (5.6) ▲ 6.9 ( ▲ 6.9, 0.0) ▲ 31.1 (5.6) ▲ 6.9 ( ▲ 6.8, ▲ 0.1) spring-ecommerce BB 62.1 28.0 2.1 (0.8) 1.3 (1.0, 0.3) 1.6 (0.0) 1.0 (1.0, 0.0) WB 121.7 42.8 △ 3.9 (1.5) △ 4.8 ( △ 4.8, ▼ 0.0) △ 5.1 (2.0) △ 6.3 ( △ 6.3, 0.0) spring-rest-example BB 19.5 48.1 0.5 (1.6) 0.1 (0.1, 0.0) 0.5 (1.6) 0.1 (0.1, 0.0) WB 128.5 58.4 △ 3.5 (4.6) △ 4.6 ( △ 4.6, 0.0) △ 3.5 (4.6) △ 4.6 ( △ 4.6, 0.0) swagger-petstore BB 53.1 66.6 39.0 (2.8) 20.7 (18.4, 2.3) 39.0 (2.8) 20.7 (18.4, 2.3) WB 84.7 68.3 ▽ 17.4 (4.3) ▽ 14.7 ( ▽ 14.7, ▼ 0.0) ▽ 17.4 (4.3) ▽ 14.7 ( ▽ 14.7, ▼ 0.0) tiltaksgjennomforing BB 429.9 9.0 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) WB 84.3 9.6 0.0 (0.0) 0.0 (0.0, 0.0) 0.0 (0.0) 0.0 (0.0, 0.0) tracking-system BB 209.8 33.9 74.4 (1.0) 156.2 (147.6, 8.6) 74.4 (1.0) 156.2 (147.7, 8.5) WB 201.4 35.4 ▽ 46.8 (13.6) ▽ 101.9 ( ▽ 94.7, ▽ 7.2) ▽ 46.8 (13.6) ▽ 101.9 ( ▽ 94.7, ▽ 7.2) user-management BB 45.9 38.1 38.7 (3.0) 17.8 (17.6, 0.2) 38.7 (3.0) 17.8 (17.6, 0.2) WB 67.5 45.1 ▽ 8.9 (4.8) ▽ 6.1 ( ▽ 6.1, ▼ 0.0) ▽ 8.9 (4.8) ▽ 6.1 ( ▽ 6.1, ▼ 0.0) webgoat BB 619.2 52.0 36.6 (0.2) 226.5 (224.6, 1.9) 37.9 (0.3) 234.6 (232.7, 1.9) WB 262.1 22.2 ▽ 0.4 (0.0) ▽ 1.0 ( ▽ 1.0, ▼ 0.0) ▽ 0.4 (0.0) ▽ 1.0 ( ▽ 1.0, ▼ 0.0) youtube-mock BB 21.2 45.6 0.9 (2.9) 0.1 (0.1, 0.0) 0.9 (2.9) 0.1 (0.1, 0.0) WB 24.5 43.7 ▼ 0.0 (0.0) ▼ 0.0 ( ▼ 0.0, 0.0) ▼ 0.0 (0.0) ▼ 0.0 ( ▼ 0.0, 0.0) # API △ 11 (10, 10) 16 (17, 12) ▽ 16 (15, 15) 13 (11, 14) BB 25 (24, 17) 24 (24, 16) WB 30 (29, 9) 30 (30, 9) ALL 31 (30, 20) 31 (31, 20) 7 familie-ba-sak , http-patch-spring , languagetool , market , ocvn , proxyprint , and reservations-api , as well as for BB on familie-ba-sak , microcks , and webgoat . The largest differences are observed for proxyprint in WB, whose failure rate increases from 2.9% in FuzzEnv to 45.4% in ExecEnv , reservations-api in WB, which rises from 37.9% to 63.8%, and ocvn in WB, which increases from 55.2% to 67.3%. In contrast, cases where FuzzEnv yields higher failure rates than ExecEnv are less common. These mainly occur for genome-nexus in both BB and WB, and pay-publicapi in BB. Among them, the most substantial differences are observed for pay-publicapi in BB, whose failure rate drops from 66.7% in FuzzEnv to 0.0% in ExecEnv . When comparing white-box and black-box tests, we observed flakiness in more APIs under WB than under BB, i.e., 30 vs. 26 on FuzzEnv , 30 vs. 25 on ExecEnv . This may indicate that WB can expose additional flaky behaviors that remain hidden to BB. For instance, there are six APIs, i.e., languagetool , ohsome-api , rest-scs , rest-ncs , session-service , and spring-batch-rest , where BB shows no flakiness but WB introduces non-deterministic behavior. In addition, we also observed higher failure rate on ocvn (i.e., rises from 28.6% in BB to 55.2% in FuzzEnv and 67.2% in ExecEnv ), http-patch-spring (from 25.4% to 38.1% and 46.1%), and reservations-api (12.6% to 37.9% in FuzzEnv and 23.1% in ExecEnv ). On these APIs, except for ohsome-api , WB achieved higher code coverage, suggesting that it exercises deeper behaviors that are more likely to trigger non-deterministic failures. However, among flaky APIs, WB more often reduces flakiness in FuzzEnv , but its effect is more balanced in ExecEnv . Specifically, in FuzzEnv , WB decreases the failure rate for 16 APIs and increases it for 11. In ExecEnv , WB decreases it for 13 APIs and increases it for 16. Several APIs show substantially lower flakiness under WB than under BB. Representative examples include rest-news (40.3% vs. 3.7%), scout-api (32.7% vs. 2.3%), microcks (13.7%/14.8% vs. 1.5%/2.5%), user-management (38.7% vs. 8.9%), webgoat (36.6/37.9% vs. 0.4%), and proxyprint in FuzzEnv (83.8% vs. 2.9%). There also exist a small number of flaky cases appear only in BB, i.e., on youtube-mock , BB exhibits non-zero failure rates while WB eliminates flakiness entirely. By looking at the results of stability of flakiness, we found that WB exhibits fewer unstable flaky cases than BB, i.e., 9 vs. 17 in FuzzEnv , and 9 vs. 16 in ExecEnv . One plausible explanation is that white-box tests can better exercise internal states and hidden dependencies, leading to failures that occur more consistently during test execution. Findings of RQ1 Flaky tests are widespread in REST APIs under both WB and BB modes. Across 36 APIs, flakiness was observed in 31 APIs in both execution environments, showing that it is common rather than exceptional. WB exhibits flaky behavior more frequently than BB, affecting 30 APIs vs. 25 in FuzzEnv and 30 vs. 24 in ExecEnv , which suggests that WB can expose non-determinism hidden from BB. However, this does not mean WB always increases flakiness: among already flaky APIs, WB reduces failure rates for many cases. Moreover, WB yields nearly half as many unstable flaky cases as BB under both runtime environments, indicating that WB more often exposes failures consistently, whereas BB more frequently produces intermittently failing tests. 4.3 Flakiness Analysis We analyzed failing te sts by manually inspecting the source code of these test cases and debugging them (i.e., running them in debug-mode inside an IDE) against the corresponding APIs. Based on the results of RQ1, we observed flakiness in 31 of the 36 APIs. As WB exposes more flaky cases and produces more stable tests, which are easier to debug, we selected, for each API, the WB run with the largest number of flaky tests for detailed analysis. For the API where flakiness was eliminated by EvoMaster in WB mode, we instead analyzed the BB tests. In total, we performed manual analysis on 2991 failed WB tests (i.e., 1297 on FuzzEnv , and 1694 on ExecEnv ), and two failed BB tests (i.e., 1 on FuzzEnv , and 1 on ExecEnv ). Based on the analysis, we identified 9 categories of flakiness sources across the 31 studied APIs, as summarized in Table 3. 4.3.1 Env Overall, Env is the most dominant category in terms of the number of flaky tests, accounting for 1,161 flaky tests in 15 APIs. This dominance is mainly driven by a few APIs with a very large number of environment-related flaky tests, most notably ocvn (906) and proxyprint (175). However, we found that, on these two APIs, all 1081 flaky tests occur because the test asserts against a localized validation message whose exact text depends on the runtime environment Locale . The same @Pattern constraint can produce different default messages across environments. In an English locale, the message appears as "must match "^[a-zA-Z0-9]* $ "" , whereas under a different locale, the phrase "must match" is replaced by its localized equivalent. As a result, the test may pass when executed under one locale and fail under another, even though the validation logic itself behaves identically. On languagetool and catwatch , we also observed sources of flakiness that depend on properties of the execution environment, such as classpath resource descriptors and temporary folders as examples: / / languagetool l a n g u a g e t o o l 8 Table 3: Classification of test flakiness sources Category Description Time-Dependent ( Time ) Business logic or fields in responses that depend on the current system time (e.g., timestamps). Randomness-Dependent ( Rand ) Nondeterministic values such as UUIDs and random seeds. Cryptographic Validation ( Crypt ) Domain-specific verification logic (e.g., hashing and encryption), primarily semantic and partially influenced by randomness. Unordered Collection ( Unord ) Assertions assume a fixed element order in collections with undefined iteration order (e.g., Set , HashSet ), or undocumented ordering guarantees. Runtime-Dependent Message ( RunMsg ) Dynamically generated error messages containing runtime-dependent elements, such as memory addresses, stack traces, instance identifiers, reflection-related information, and InputStream references. Stateful Resource ( State ) Dependence on mutable persistent or in-memory state, e.g., databases, caches, in-memory stores, and static singletons. Runtime Environment ( Env ) Variations in execution environment, including locale and language settings, JAR metadata, classpath scanning, module loading, and framework-managed endpoints, and runtime configuration values derived from the host environment, such as cache locations, temporary directories, and filesystem paths. Unclassified ( Unk ) Failures whose root causes could not be clearly identified based on current investigation. Generation Error ( GenErr ) Failures caused by improperly generated test cases, such as incorrect request paths, invalid request sequences, unsatisfied state dependencies, brittle assertions, environment-dependent interactions, or insufficient cleanup between tests. . po s t ( ba s e U r l Of S u t + " / v 2 / c h e c k " ) . th e n ( ) . st at u s Co d e ( 4 0 0 ) . as se r t Th a t ( ) . bo d y ( co n t ai n sS t r in g ( " E r r o r : ’ K P j 5 J ’ i s n o t a l a n g u a g e c o d e k n o w n t o L a n g u a g e T o o l . S u p p o r t e d l a n g u a g e c o d e s a r e : a r , a s t - E S , b e - B Y , b r - F R , c a - E S , c a - E S - v a l e n c i a , d a - D K , d e , d e - A T , d e - C H , d e - D E , d e - D E - x - s i m p l e - l a n g u a g e , e l - G R , e n , e n - A U , e n - C A , e n - G B , e n - N Z , e n - U S , e n - Z A , e o , e s , f a , f r , g a - I E , g l - E S , i t , j a - J P , k m - K H , p l - P L , p t , p t - A O , p t - B R , p t - M Z , p t - P T , r o - R O , r u - R U , s k - S K , s l - S I , s v , t a - I N , t l - P H , u k - U A , z h - C N . T h e l i s t o f l a n g u a g e s i s r e a d f r o m M E T A - I N F / o r g / l a n g u a g e t o o l / l a n g u a g e - m o d u l e . p r o p e r t i e s i n t h e J a v a c l a s s p a t h . S e e h t t p s : / / d e v . l a n g u a g e t o o l . o r g / j a v a - a p i f o r d e t a i l s . " ) ) ; / / catwatch c a t w a t c h . get ( b as e Ur l O fS u t + " / c o n f i g ? E M e x t r a P a r a m 1 2 3 = 4 2 " ) . th e n ( ) . st at u s Co d e ( 2 0 0 ) . as se r t Th a t ( ) . co n t e n tT y p e ( " a p p l i c a t i o n / j s o n " ) . bo d y ( " ’ c a c h e . p a t h ’ " , c o n ta i n s S t ri n g ( " / h o m e / u s e r / w o r k s p a c e / t e m p / t m p _ c a t w a t c h / c a c h e _ 1 0 0 6 2 " ) ) Moreover, we observed additional flakiness in framework-provided endpoints such as /actuator/health on spring-ecommerce . The structure and semantics of these responses often depend on runtime configuration, registered health contributors, security settings, and the c urrent state of external infrastructure, which can lead to flaky tests when the execution environment differs. 4.3.2 RunMsg RunMsg is the second largest source in terms of affected APIs, i.e., across 11 APIs. One main cause of RunMsg relates to runtime-dependent elements in response messages as illustrated in the example below: / / proxyprint p r o x y p r i n t . bo d y ( " { " + " \ " i d \ " : 5 3 0 , " + " \ " r o l e s \ " : { " + " \ " E M _ t a i n t e d _ m a p \ " : \ " _ E M _ 0 _ X Y Z _ \ " " + " } " + " } " ) . po s t ( ba s e U r l Of S u t + " / a d m i n / r e g i s t e r " ) . th e n ( ) . st at u s Co d e ( 4 0 0 ) . as se r t Th a t ( ) . co n t e n tT y p e ( " a p p l i c a t i o n / j s o n " ) . bo d y ( " ’ s t a t u s ’ " , n um b e rM a t ch e s ( 4 0 0 . 0 ) ) . bo d y ( " ’ e r r o r ’ " , co n ta i n sS t r in g ( " B a d R e q u e s t " ) ) . bo d y ( " ’ e x c e p t i o n ’ " , c o nt a in s S tr i n g ( " o r g . s p r i n g f r a m e w o r k . h t t p . c o n v e r t e r . H t t p M e s s a g e N o t R e a d a b l e E x c e p t i o n " ) ) . bo d y ( " ’ m e s s a g e ’ " , c o n ta i ns S t ri n g ( " C o u l d n o t r e a d d o c u m e n t : C a n n o t d e s e r i a l i z e i n s t a n c e o f j a v a . u t i l . H a s h S e t o u t o f S T A R T _ O B J E C T t o k e n \ n a t [ S o u r c e : j a v a . i o . B y t e A r r a y I n p u t S t r e a m @ 7 2 c 1 1 c 7 0 ; l i n e : 1 , c o l u m n : 2 0 ] ( t h r o u g h r e f e r e n c e c h a i n : i o . g i t h u b . p r o x y p r i n t . k i t c h e n . m o d e l s . A d m i n [ \ " r o l e s \ " ] ) ; n e s t e d e x c e p t i o n i s c o m . f a s t e r x m l . j a c k s o n . d a t a b i n d . J s o n M a p p i n g E x c e p t i o n : C a n n o t d e s e r i a l i z e i n s t a n c e o f j a v a . u t i l . H a s h S e t o u t o f S T A R T _ O B J E C T t o k e n \ n a t [ S o u r c e : j a v a . i o . B y t e A r r a y I n p u t S t r e a m @ 7 2 c 1 1 c 7 0 ; l i n e : 1 , c o l u m n : 2 0 ] ( t h r o u g h r e f e r e n c e c h a i n : i o . g i t h u b . p r o x y p r i n t . k i t c h e n . m o d e l s . A d m i n [ \ " r o l e s \ " ] ) " ) ) . bo d y ( " ’ p a t h ’ " , c o nt a i ns S t ri n g ( " / a d m i n / r e g i s t e r " ) ) ; / * A c t u a l : C o u l d n o t r e a d d o c u m e n t : C a n n o t d e s e r i a l i z e i n s t a n c e o f j a v a . u t i l . H a s h S e t o u t o f S T A R T _ O B J E C T t o k e n \ n a t [ S o u r c e : j a v a . i o . P u s h b a c k I n p u t S t r e a m @ 6 7 d 4 b d 4 8 ; l i n e : 1 , c o l u m n : 2 6 ] ( t h r o u g h r e f e r e n c e c h a i n : i o . g i t h u b . p r o x y p r i n t . k i t c h e n . m o d e l s . A d m i n [ " r o l e s " ] ) ; n e s t e d e x c e p t i o n i s c o m . f a s t e r x m l . j a c k s o n . d a t a b i n d . J s o n M a p p i n g E x c e p t i o n : C a n n o t d e s e r i a l i z e i n s t a n c e o f j a v a . u t i l . H a s h S e t o u t o f S T A R T _ O B J E C T t o k e n \ n a t [ S o u r c e : j a v a . i o . P u s h b a c k I n p u t S t r e a m @ 6 7 d 4 b d 4 8 ; l i n e : 1 , c o l u m n : 2 6 ] ( t h r o u g h r e f e r e n c e c h a i n : i o . g i t h u b . p r o x y p r i n t . k i t c h e n . m o d e l s . A d m i n [ " r o l e s " ] ) * / 9 Table 4: The number of flaky tests for each API across the nine flakiness categories on FuzzEnv ( F ) and ExecEnv ( E ). SUT Time Rand Crypt Unord RunMsg State Env Unk GenErr Total ( F , E ) ( F , E ) ( F , E ) ( F , E ) ( F , E ) ( F , E ) ( F , E ) ( F , E ) ( F , E ) blogapi (10, 6) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 21) (0, 0) (0, 0) (1, 2) catwatch (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (11, 11) (0, 56) (0, 0) (1, 2) cwa-verification (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (5, 5) (1, 1) familie-ba-sak (0, 0) (0, 0) (0, 0) (0, 0) (0, 1) (0, 0) (2, 11) (0, 0) (0, 0) (1, 2) features-service (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (1, 1) (0, 0) (0, 0) (0, 0) (1, 1) genome-nexus (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (3, 3) (23, 23) (0, 0) (2, 2) gestaohospital (0, 0) (0, 0) (0, 0) (0, 0) (16, 16) (0, 0) (2, 2) (0, 0) (0, 0) (2, 2) http-patch-spring (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (13, 13) (0, 2) (0, 0) (0, 0) (1, 2) languagetool (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (1, 3) (0, 0) (0, 0) (1, 1) market (0, 0) (0, 0) (0, 0) (4, 4) (7, 7) (1, 1) (0, 7) (0, 0) (0, 0) (3, 4) microcks (2, 2) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (1, 1) (0, 0) (0, 0) (2, 2) ocvn (0, 0) (0, 0) (0, 0) (753, 906) (0, 0) (0, 0) (0, 906) (0, 0) (0, 0) (1, 2) ohsome-api (0, 0) (0, 0) (0, 0) (0, 0) (2, 2) (0, 0) (0, 0) (0, 0) (3, 3) (2, 2) pay-publicapi (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (2, 2) (0, 0) (0, 0) (1, 1) person-controller (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (11, 11) (0, 0) (0, 0) (1, 1) proxyprint (18, 1) (0, 0) (0, 0) (0, 0) (7, 5) (0, 3) (0, 175) (0, 0) (0, 0) (2, 4) quartz-manager (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (2, 2) (0, 0) (13, 13) (0, 0) (2, 2) reservations-api (0, 0) (0, 0) (0, 0) (0, 0) (5, 5) (0, 0) (0, 0) (0, 0) (0, 0) (1, 1) rest-news (0, 0) (0, 0) (0, 0) (0, 0) (4, 4) (0, 0) (0, 0) (0, 0) (0, 0) (1, 1) rest-scs (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (1, 1) (1, 1) restcountries (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (89, 89) (1, 1) scout-api (5, 5) (0, 0) (0, 0) (3, 3) (0, 0) (0, 0) (0, 0) (0, 0) (1, 1) (3, 3) session-service (0, 0) (1, 1) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (1, 1) spring-batch-rest (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (9, 9) (0, 0) (0, 0) (0, 0) (1, 1) spring-ecommerce (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (7, 10) (0, 0) (0, 0) (1, 1) spring-rest-example (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (16, 16) (0, 0) (0, 0) (0, 0) (1, 1) swagger-petstore (3, 3) (0, 0) (0, 0) (0, 0) (2, 2) (7, 7) (0, 0) (8, 8) (0, 0) (4, 4) tracking-system (0, 0) (0, 0) (1, 0) (12, 11) (34, 46) (69, 68) (0, 0) (34, 35) (0, 0) (5, 4) user-management (0, 0) (1, 1) (0, 0) (0, 0) (4, 4) (10, 10) (0, 0) (4, 4) (0, 0) (4, 4) webgoat (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (1, 1) (0, 0) (0, 0) (1, 1) youtube-mock (0, 0) (1, 1) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (0, 0) (1, 1) #F (38, 17) (3, 3) (1, 0) (772, 924) (81, 92) (128, 130) (41, 1166) (82, 139) (99, 99) #APIs (5, 5) (3, 3) (1, 0) (4, 4) (9, 10) (9, 10) (10, 15) (5, 6) (5, 5) This example also exhibits flakiness relating to the reported exception source. During test generation, the JSON is likely deserialized directly from an in-memory byte array, which leads to ByteArrayInputStream appearing in the exception message. During actual execution, by contrast, the framework reads the payload from the HTTP request body, where it may be wrapped in additional stream layers such as PushbackInputStream . As a result, the same deserialization error can produce slightly different exception texts across runs. This source of flakiness affects various APIs (i.e., gestaohospital , market , tracking-system , rest-news , proxyprint , and user-management ), as JSON parsing is a fundamental mechanism used throughout REST APIs. 4.3.3 State State is also common in the open-source APIs, affecting 11 APIs and accounting for 130 flaky tests. We found that, on 7 APIs, their flaky assertions are primarily due to internal state stored in databases. White-box fuzzers may mitigate this source of flakiness if they are able to handle database state effectively. However, this depends on the capability of the specific fuzzer. In our study, for example, EvoMaster failed to mitigate database-related flakiness in 7 APIs (i.e., features-service , spring-batch-rest , market , spring-rest-example , swagger-petstore , tracking-system and user-management ), although it was effective for the remaining 18 database-interacting APIs. For black-box fuzzers, by contrast, such cases are generally not feasible to handle. Another typical source of State flakiness is external dependencies, but this is handled by WB EvoMaster and was therefore not observed in our analysis. Moreover, we identified two APIs, namely http-patch-spring and quartz-manager , whose flaky behavior is related to in-memory state, as examples illustrated below: / / http-patch-spring h t t p - p a t c h - s p r i n g p r i v a t e s t a t i c f i n a l L is t < Co nt a c t > CO N TA C T S = n e w Ar r ay L is t < > ( ) ; / / quartz-manager q u a r t z - m a n a g e r @G e tt e r p r i v a t e L is t < Cl as s < ? e x t e n d s Ab s t r a c tQ u a r t z Ma n a ge r Jo b > > j o bC l a ss e s = n e w A rr a y L is t < > ( ) ; p r i v a t e L is t < St ri n g > j o b Cl a ss P a ck a g es = n e w Ar r ay L is t < > ( ) ; 10 4.3.4 Unord We observed it in 4 APIs, yet it accounts for 924 flaky tests. The main source of this flakiness is unstable ordering in the response content, which may be caused either by nondeterministic ordering in the underlying data source or by unordered collections returned by the API, as illustrated by the examples below: / / scout-api s c o u n t - a p i . get ( b as e Ur l O fS u t + " / a p i / v 1 / u s e r s ? E M e x t r a P a r a m 1 2 3 = 4 2 " ) . th e n ( ) . st at u s Co d e ( 2 0 0 ) . as se r t Th a t ( ) . co n t e n tT y p e ( " a p p l i c a t i o n / j s o n " ) . bo d y ( " s i z e ( ) " , e q ua l To ( 3 ) ) . bo d y ( " [ 0 ] . ’ n a m e ’ " , c o n ta i n sS t r i n g ( " I N T E G R A T I O N T E S T M O D E R A T O R " ) ) / / ocvn o c v n . bo d y ( " ’ e x c e p t i o n ’ " , c o nt a in s S tr i n g ( " o r g . s p r i n g f r a m e w o r k . v a l i d a t i o n . B i n d E x c e p t i o n " ) ) . . . . bo d y ( " ’ e r r o r s ’ [ 0 ] . ’ c o d e s ’ " , h a sI t em s ( " E a c h P a t t e r n . y e a r F i l t e r P a g i n g R e q u e s t . c o n t r M e t h o d [ 1 ] " , " E a c h P a t t e r n . y e a r F i l t e r P a g i n g R e q u e s t . c o n t r M e t h o d " , " E a c h P a t t e r n . c o n t r M e t h o d [ 1 ] " , " E a c h P a t t e r n . c o n t r M e t h o d " , " E a c h P a t t e r n . j a v a . l a n g . S t r i n g " , " E a c h P a t t e r n " ) ) Upon further investigation, we found that 914 of these flaky tests on market , ocvn and tracking-system (as the above example in ocvn ) were caused by unordered error messages in the responses. 4.3.5 Time , Rand , Crypt and Unk Several categories are less frequent but still recurrent across APIs. Time appears in 5 APIs with 38 flaky tests, and Rand appears in 3 APIs with 3 flaky tests. These categories are thus relatively common in terms of API coverage, but they typically contribute to only a limited number of flaky tests per API. For example, Time - related flakiness is spread across APIs such as blogapi , microcks , proxyprint , scout-api , and swagger-petstore , while Rand -related flakiness is observed in session-service , user-management , and webgoat . Crypt is the rarest category that appears in only one API, i.e., tracking-system , indicating that it is not a major source of flakiness in open-source REST APIs. . bo d y ( " { " + " \ " c r e d e n t i a l I d \ " : 9 0 8 , " + " \ " e n a b l e d \ " : t r u e , " + " \ " p a s s w o r d \ " : \ " _ E M _ 2 0 5 5 _ X Y Z _ \ " , " + " \ " r o l e \ " : \ " O U \ " , " + " \ " u s e r n a m e \ " : \ " i F K w 1 \ " " + " } " ) . put ( b as e Ur l O fS u t + " / a p p / a p i / c r e d e n t i a l s ? " + " p a s s w o r d = N m 2 Z 2 & " + " u s e r n a m e = R " ) . th e n ( ) . st at u s Co d e ( 2 0 0 ) . as se r t Th a t ( ) . co n t e n tT y p e ( " a p p l i c a t i o n / j s o n " ) . bo d y ( " ’ c r e d e n t i a l I d ’ " , n um b e rM a t c h e s ( 1 6 . 0 ) ) . bo d y ( " ’ u s e r n a m e ’ " , co n t ai n sS t r in g ( " i F K w 1 " ) ) . bo d y ( " ’ p a s s w o r d ’ " , co n t ai n sS t r in g ( " $ 2 a $ 1 0 $ d p y E . R i b f Q z a 7 . 9 T D 6 5 v T . d z i . O G m 2 V z q K D Y N j M M k I H 7 o b C b s D 6 . W " ) ) . bo d y ( " ’ e n a b l e d ’ " , e q ua l To ( tr u e ) ) . bo d y ( " ’ r o l e ’ " , c o nt a i ns S t ri n g ( " O U " ) ) ; / * j a v a . l a n g . A s s e r t i o n E r r o r : 1 e x p e c t a t i o n f a i l e d . J S O N p a t h ’ p a s s w o r d ’ d o e s n ’ t m a t c h . E x p e c t e d : a s t r i n g c o n t a i n i n g " $ 2 a $ 1 0 $ d p y E . R i b f Q z a 7 . 9 T D 6 5 v T . d z i . O G m 2 V z q K D Y N j M M k I H 7 o b C b s D 6 . W " A c t u a l : $ 2 a $ 1 0 $ N v 9 O K J P 1 T j I 9 u Q f w W d Y Z s u m N i 0 t L O C 2 a / q 5 D c o 4 k l H c O H s U Z V A C Q i * / In our analysis, we also encountered cases whose sources of flakiness remain unclear ( Unk ), as we were unable to reproduce them during debugging. Nevertheless, this category still appears in 6 APIs and accounts for 139 flaky tests, for example in catwatch (56). As illustrated by the example below, it is unclear how the system state evolves during fuzzing and why the same request yields different outcomes across executions. / / catwatch c a t w a t c h . get ( b as e Ur l O fS u t + " / s t a t i s t i c s / p r o j e c t s ? " + " s t a r t _ d a t e = d h k j 3 T s E A F Y & " + " a c c e s s _ t o k e n = & " + " o f f s e t = Z n 5 I Q & " + " l a n g u a g e = g R " ) . th e n ( ) . st at u s Co d e ( 4 0 0 ) . as se r t Th a t ( ) . co n t e n tT y p e ( " a p p l i c a t i o n / j s o n " ) . bo d y ( " ’ e r r o r ’ " , co n ta i n sS t r in g ( " i n v a l i d _ r e q u e s t " ) ) . bo d y ( " ’ e r r o r _ d e s c r i p t i o n ’ " , co n t ai n s St r i n g ( " A c c e s s T o k e n n o t v a l i d " ) ) ; / * 1 e x p e c t a t i o n f a i l e d . E x p e c t e d s t a t u s c o d e < 4 0 0 > b u t w a s < 5 0 0 > . * / 11 / / quartz-manager q u a r t z - m a n a g e r . get ( b as e Ur l O fS u t + " / q u a r t z - m a n a g e r / s c h e d u l e r / r u n ? E M e x t r a P a r a m 1 2 3 = _ E M _ 2 1 _ X Y Z _ " ) . th e n ( ) . st at u s Co d e ( 5 0 0 ) / / i t / f a b i o f o r m o s a / q u a r t z m a n a g e r / a p i / s e r v i c e s / S c h e d u l e r S e r v i c e _ 2 5 _ s t a r t . as se r t Th a t ( ) . co n t e n tT y p e ( " a p p l i c a t i o n / j s o n " ) . bo d y ( " ’ s t a t u s ’ " , n um b e rM a t ch e s ( 5 0 0 . 0 ) ) . bo d y ( " ’ e r r o r ’ " , co n ta i n sS t r in g ( " I n t e r n a l S e r v e r E r r o r " ) ) . bo d y ( " ’ p a t h ’ " , c o nt a i ns S t ri n g ( " / q u a r t z - m a n a g e r / s c h e d u l e r / r u n " ) ) ; / * j a v a . l a n g . A s s e r t i o n E r r o r : 1 e x p e c t a t i o n f a i l e d . E x p e c t e d s t a t u s c o d e < 5 0 0 > b u t w a s < 2 0 4 > . * / This suggests that identifying the root causes of flakiness in REST APIs remains challenging, because failures may depend on s ystem states that are reached during fuzzing but are difficult to reconstruct exactly during later execution. 4.3.6 GenErr Besides REST API flakiness, the fuzzer itself can also introduce failing tests and false positives that are mistakenly identified as flaky tests. In this study, GenErr appears in 4 APIs, but when it occurs, it can be substantial, as shown by restcountries with 89 flaky tests. . get ( b as e Ur l O fS u t + " / r e s t / v 2 / c a p i t a l / t h e % 2 0 v a l l e y ? f i e l d s = 4 f 1 8 5 d V b V e " ) . th e n ( ) . st at u s Co d e ( 2 0 0 ) . as se r t Th a t ( ) . co n t e n tT y p e ( " a p p l i c a t i o n / j s o n " ) . bo d y ( " s i z e ( ) " , e q ua l To ( 1 ) ) . bo d y ( " ’ [ 0 ] ’ . i s E m p t y ( ) " , i s ( t r ue ) ) ; / / " [ 0 ] . i s E m p t y ( ) " The assertion fails because ‘[0]’ is treated as a string, not as the first JSON array element. So the code checks whether the string "[0]" is empty, which is false. Among the APIs, 16 are affected by multiple sources of flakiness and the other 15 are impacted by a single source. For example, tracking-system involves five categories ( Crypt , Unord , RunMsg , State , and Unk ), while market , proxyprint , swagger-petstore , and user-management each involve four categories. Overall, these results suggest that flakiness in REST APIs is heterogeneous in nature: some APIs are primarily affected by a single source, whereas many others are influenced by multiple interacting sources. Among the sources, environment-related issues ( Env ) and runtime-dependent outputs ( RunMsg and Unord ) are recurring sources of flakiness and account for a large share of flaky tests in the open-source APIs. This may also relate to fuzzers that can expose error scenarios, thereby increasing the likelihood of observing flaky behavior. Findings of RQ2 The primary sources of flakiness in REST APIs are runtime environment variations ( Env ), stateful resources ( State ), unordered responses ( Unord ), and runtime-dependent messages ( RunMsg ). Among them, Env is the dominant source in terms of the number of flaky tests, followed by Unord , while State and RunMsg are also common across APIs. Other causes, such as Time , Rand , appear less frequently, and Crypt is rare. In addition, flakiness in REST APIs is not caused by a single dominant factor, but rather arises from a combination of diverse sources. 5 Detection and Handling of Flaky Tests Based on our flakiness analysis results, we propose a post-processing approach for detecting and handling flaky tests, i.e., FlakyCatch . With the nine identified sources of flakiness, some, such as State and Env , generally require access to source code or execution in different environments to diagnose. In contrast, others, including Time , Rand , Crypt , and Unord , can be identified or inferred from observable HTTP responses. To be applicable to both black-box and white-box fuzzers, FlakyCatch relies only on observable HTTP responses and does not require access to source code or internal execution traces. This design also makes the approach well suited to industrial settings where implementation visibility is limited. In the rest of the se ction, we first discussion the flakiness detection capability of FlakyCatch (Section 5.1) and then its handling strategy of flaky tests in Section 5.2. Sections 5.3 and 5.4 report the empirical study conducted to evaluate FlakyCatch and its results. 5.1 Flakiness Detection To identify flaky tests, we perform a post-processing phase after fuzzing. All test cases generated by each fuzzer are re-executed under the same configuration and environment. We then compare responses obtained during 12 Flakiness Detection Phase Flakiness Handling Phase Re-execution Observation Lightweight Inference Fuzzer T est Suite Parsing Engine (Status Code, Headers, Body) (Body Path Parsing) N Y N Y GenerateAssertion ( ) Note. HandleFlayAssertion ( , ) T est Suite re-execute? N Y Figure 3: The overview of FlakyCatch the fuzzing phase with those collected during the post-processing phase. If the responses differ according to predefined comparison rules, the corresponding test case is marked as potentially flaky. This approach is based on the assumption that deterministic tests should produce equivalent responses across repeated executions under controlled conditions. Deviations from this behavior indicate potential sources of nondeterminism. Each HTTP response is represented as a tuple R = ⟨ C , H , B ⟩ , where C denotes the status code, H represents the response headers, and B denotes the response body. Given two responses R f and R r obtained during the fuzzing and re-execution phases, respectively, we compare their corresponding elements, including the status code, response headers, and response body. If any of these comparisons fails, the corresponding flakiness information is recorded and later used for assertion generation. Figure 3 presents the response comparison procedure. Lightweight Inference. As a complement to re-execution-based detection, we introduce a lightweight inference mechanism as a secondary handler. This is necessary because re-execution may fail to expose some forms of flakiness. For e xample, random or time-dependent values may remain unchanged if re-execution occurs too close to the original run, and a single repeated execution may still produce the same run-dependent identifier by chance, as we observed in reservations-api . " . . d e f a u l t m e s s a g e [ e m a i l ] , [ L j a v a x . v a l i d a t i o n . c o n s t r a i n t s . P a t t e r n $ F l a g ; @ 5 3 7 2 c c 3 4 , . * ] ; . . " Moreover, some unstable response elements may vary only under conditions not exercised by one immediate re-execution. Therefore, when re-execution does not report flakiness, our approach applies a lightweight inference procedure. The inference mechanism is implemented using pattern-based matching rules that replace potential run- dependent elements with predefined placeholders, e.g., EM POTENTIAL OBJECT FLAKINESS . Currently, our approach targets time-dependent values, randomness-dependent identifiers, cryptographic artifacts, and runtime- dependent messages, following widely adopted standards and specifications. 1 Note that when re-execution detection is enabled, the inference procedure is only triggered when re-execution does not report flakiness. However, the inference handler can also be employed independently to reduce the execution overhead associated with repeated test runs. 5.2 Handling Detected Flaky Tests After identifying potentially flaky test cases, we apply automated post-processing to improve the stability and reusability of the generated test suites . Rather than discarding flaky tests entirely, our approach preserves their structural and behavioral coverage while mitigating nondeterministic failures. Specifically, we further analyze response differences to localize unstable assertions within each test. Our goal is to preserve stable checks while disabling only those assertions that depend on nondeterministic response fields. We handled assertions that validate status code, header and response body content, e.g., .body(...) statements. In addition, regarding body content, for each flaky test t , we extract all body-related assertions of the form, i.e., .body(path, matcher) , where path denotes a JSON field and matcher specifies the expected value. During assertion generation, we compare value (i.e., v f and v r in Figure 3) on each path of response bodies obtained in the fuzzing and post-processing phases. If the value associated with path differs between v f and v r , the corresponding assertion is commented out and annotated with additional flakiness information, including the path, the value observed during fuzzing, and the value observed during re-execution. The following illustrates assertions after the post-processing: Original assertion . b o d y ( " ’ c a l c u l a t e d P a s t T i m e ’ " , c o n t a i n s S t r i n g ( " 2 0 2 6 - 1 2 - 0 3 T 0 6 : 3 8 : 3 1 . 2 7 2 2 3 0 " ) ) 1 ISO 8601 for date and time representations, IEEE POSIX for Unix timestamps, RFC 4122 for UUIDs, RFC 4648 and RFC 7519 for Base64 and JSON Web Tokens, RFC 1321 and FIPS 180-4 for cryptographic hash functions, and the Java Platform Specification for runtime-generated identifiers and stack trace formats 13 After post-processing / / F l a k y v a l u e o f f i e l d " ’ c a l c u l a t e d P a s t T i m e ’ " : 2 0 2 6 - 1 2 - 0 3 T 0 6 : 3 8 : 3 1 . 2 7 2 2 3 0 v s . 2 0 2 6 - 1 2 - 0 3 T 0 6 : 3 8 : 3 0 . 7 1 3 5 0 2 / / . b o d y ( " ’ c a l c u l a t e d P a s t T i m e ’ " , c o n t a i n s S t r i n g ( " 2 0 2 6 - 1 2 - 0 3 T 0 6 : 3 8 : 3 1 . 2 7 2 2 3 0 " ) ) Figure 3 presents our procedure for selectively disabling flaky assertions. As the example, assertions involving volatile fields are replaced with commented statements, while stable assertions are preserved. This allows the test to continue validating deterministic aspects of the API behavior. In addition, all modified assertions are explicitly marked in the generated code to facilitate manual inspection and future refinement by developers. This also ensures transparency and prevents the masking of genuine defects. 5.3 Experiment Settings To evaluate the effectiveness of FlakyCatch for detecting and handling test flakiness, we integrated it into EvoMaster and conducted an experiment to answer the following RQ: RQ3: How effective is FlakyCatch in detecting and mitigating flakiness in tests? Experiment Settings. To study the performance on black-box and white-box tests, we enabled our approach in both modes of EvoMaster and ran each mode with a one-hour search budget, repeated 10 times, to generate tests. Since flakiness detection was performed only in the fuzzing environment, our approach could not capture environment-dependent flakiness. Therefore, we executed the generated tests only in FuzzEnv and repeated each test execution 100 times. 5.4 Results of Flakiness Detection and Handling By applying our flakiness-handling strategies, we were able to identify and handle a substantial number of flaky tests across the studied APIs. In Table 5, # RF denotes the number of resolved flaky assertions in tests, i.e., the assertions identified as flaky and handled by commenting them out. Note that, to avoid flakiness completely, a trivial solution would be to comment out every single assertion. A test cannot fail if it does not assert anything. However, such test cases would become useless for regression testing purposes. Our goal is to comment out only the minimal set of assertions to prevent flakiness, while still trying to maintain test case effectiveness. Table 5 shows that FlakyCatch is effective in both BB and WB settings. For example, on ocvn , our approach mitigates on average 6678.3 flaky assertions with BB and 10357.6 with WB on average over 10 generations. Across all 31 flaky APIs, FlakyCatch resolves, on average, 10631.4 flaky assertions with WB and 9343.2 with BB. It also handles flaky assertions in 18 APIs with WB, compared with 13 APIs with BB. In Table 5, we also report the overall failure rate ( F R %), the rate of consistent failures ( F R c = # F c / # F %), the rate of unstable failures ( F R u = # F u / # F %), together with comparisons against tests generated without handling using ˆ A 12 and relative improvement. In terms of FR %, with BB, the remaining flaky rates are reduced to F R c = 18 . 2% and F R u = 12 . 4%, while with WB they further decrease to 10 . 8% and 5 . 5%, respectively. Moreover, with the WB setting, our approach reduced F R in 18 of the 31 APIs, with 8 of these reductions being statistically significant (i.e., ˆ A 12 < 0 . 5 and p < 0 . 05), and completely resolved 2 APIs (i.e., proxyprint and spring-ecommerce ). With the BB setting, our approach reduced F R in 12 APIs, 10 of which showed statistically significant reductions. For consistent failures, the approach reduces F R c in 15 APIs in BB and 19 APIs in WB. Among these reductions, 9 in BB and 7 in WB are statistically significant. Moreover, the remaining consistent-failure rate is reduced to zero in 3 APIs with WB, whereas no API is reduced to zero in BB. For unstable failures, the approach reduces F R u in 7 APIs in both BB and WB. However, WB eliminates unstable failures completely in 9 APIs, compared with only 3 in BB, although only 2 WB reductions and 4 BB reductions are statistically significant. There are also cases where the failure rate increases, as observed in features-service , gestaohospital , microcks , spring-ecommerce , and webgoat . This suggests that some forms of flakiness cannot be effectively resolved without access to source code and internal state. Findings of RQ3 FlakyCatch effectively reduces flaky tests in both BB and WB settings, with better overall performance in WB. It reduces both consistent and unstable failures, and can completely resolve flakiness for some APIs. However, its effectiveness varies across APIs, and some residual flakiness remains difficult to address without access to source code and internal state. 6 Threats To Validity Construct validity. Our taxonomy of flakiness sources in REST API testing is derived from a manual review of near 3000 of failing tests. This process may introduce human errors and subjective bias. To mitigate this 14 Table 5: Results for the number of resolved flaky tests (# RF ), failure rate ( F R %), consistent-failure rate ( F R c %), and unstable-failure rate ( F R u %), together with their comparison results in terms of ˆ A 12 and relative improvement. SUT Mode # RF F R % ˆ A 12 Rel % F R c % ˆ A 12 Rel % F R u % ˆ A 12 Rel % blogapi BB 0.0 9.0 0.5 0.0 9.0 0.5 0.0 0.0 0.5 NaN WB 14.6 0.1 0.0 -97.1 0.1 0.0 -97.1 0.0 0.5 NaN catwatch BB 0.3 5.3 0.2 -25.6 2.8 0.5 -20.3 2.5 0.3 -30.8 WB 0.0 4.5 0.5 -5.8 4.5 0.5 -5.8 0.0 0.5 NaN cwa-verification BB 0.0 100.0 0.5 0.0 100.0 0.5 0.0 0.0 0.5 NaN WB 0.0 100.0 0.5 0.0 100.0 0.5 0.0 0.0 0.5 NaN familie-ba-sak BB 0.0 0.6 0.5 -0.7 0.6 0.5 -0.7 0.0 0.5 NaN WB 7.2 1.1 0.5 +0.1 1.1 0.5 +0.1 0.0 0.5 NaN features-service BB 0.0 6.1 0.9 +24.4 3.2 0.6 -17.1 2.9 0.6 +171.8 WB 0.0 0.6 0.5 +5.9 0.6 0.5 +5.9 0.0 0.5 NaN genome-nexus BB 0.0 29.3 0.6 +3.7 20.6 1.0 +15.9 8.7 0.4 -17.1 WB 0.1 22.0 0.4 -5.1 21.0 0.4 -5.7 1.0 0.5 +11.0 gestaohospital BB 3.6 27.7 0.1 -26.1 23.9 0.1 -33.9 3.8 0.7 +191.5 WB 0.3 11.1 0.7 +54.2 4.0 0.3 -26.3 7.2 0.7 +289.4 http-patch-spring BB 0.0 0.0 0.0 -100.0 0.0 0.0 -100.0 0.0 0.2 -100.0 WB 2.8 38.8 0.5 +1.8 38.8 0.5 +1.8 0.0 0.5 NaN languagetool BB 0.0 0.0 0.5 NaN 0.0 0.5 NaN 0.0 0.5 NaN WB 0.0 0.0 0.4 -100.0 0.0 0.5 NaN 0.0 0.4 -100.0 market BB 5.4 15.1 0.0 -29.6 8.7 0.3 -15.0 6.3 0.1 -43.1 WB 12.0 16.2 0.2 -23.2 9.8 0.3 -15.4 6.3 0.2 -32.8 microcks BB 15.5 16.0 0.8 +16.9 1.3 0.0 -89.3 14.7 1.0 +1187.2 WB 74.8 1.5 0.5 +0.3 1.5 0.5 +0.3 0.0 0.5 NaN ocvn BB 6678.3 23.1 0.0 -19.3 3.3 0.1 -29.8 19.8 0.0 -17.2 WB 10357.6 46.8 0.0 -15.3 1.6 0.0 -63.1 45.2 0.0 -11.3 ohsome-api BB 0.0 0.0 0.5 NaN 0.0 0.5 NaN 0.0 0.5 NaN WB 0.0 0.0 0.3 -100.0 0.0 0.3 -100.0 0.0 0.5 NaN pay-publicapi BB 0.0 69.1 0.6 +3.7 0.0 0.5 NaN 69.1 0.6 +3.7 WB 0.0 1.5 0.3 -18.2 1.5 0.3 -18.2 0.0 0.5 NaN person-controller BB 0.0 31.7 0.3 -7.6 31.7 0.3 -7.6 0.0 0.5 NaN WB 4.3 23.8 0.0 -20.3 23.8 0.0 -20.3 0.0 0.5 NaN proxyprint BB 2472.0 83.3 0.2 -0.6 75.0 0.0 -10.5 8.3 1.0 +Inf WB 11.4 0.0 0.0 -99.1 0.0 0.0 -100.0 0.0 0.4 -74.8 quartz-manager BB 0.0 29.6 0.5 0.0 7.4 0.5 0.0 22.2 0.5 0.0 WB 0.0 16.1 0.6 +22.9 16.1 0.6 +22.9 0.0 0.5 NaN reservations-api BB 0.0 24.8 0.6 +96.6 12.6 0.5 0.0 12.2 0.6 +Inf WB 41.6 45.2 0.5 +19.2 10.0 0.4 -11.0 35.2 0.5 +31.8 rest-news BB 0.7 26.3 0.1 -34.9 26.3 0.3 -19.4 0.0 0.0 -100.0 WB 0.0 3.7 0.5 +0.4 3.7 0.5 +0.4 0.0 0.5 NaN rest-scs BB 0.0 0.0 0.5 NaN 0.0 0.5 NaN 0.0 0.5 NaN WB 0.0 0.4 0.6 +209.3 0.4 0.6 +209.3 0.0 0.5 NaN restcountries BB 0.0 0.0 0.5 NaN 0.0 0.5 NaN 0.0 0.5 NaN WB 0.0 18.0 0.4 -4.0 18.0 0.4 -4.0 0.0 0.5 NaN scout-api BB 75.4 20.0 0.0 -39.0 14.9 0.1 -50.0 5.0 0.6 +77.4 WB 11.0 1.4 0.3 -37.6 0.7 0.5 -39.1 0.7 0.4 -35.9 session-service BB 0.0 0.0 0.5 NaN 0.0 0.5 NaN 0.0 0.5 NaN WB 1.0 0.0 0.4 -100.0 0.0 0.4 -100.0 0.0 0.5 NaN spring-batch-rest BB 0.0 0.0 0.5 NaN 0.0 0.5 NaN 0.0 0.5 NaN WB 29.3 22.0 0.1 -29.3 21.0 0.1 -32.5 1.0 0.6 +Inf spring-ecommerce BB 2.0 3.2 0.9 +52.4 3.2 1.0 +97.4 0.0 0.3 -100.0 WB 6.1 0.0 0.0 -100.0 0.0 0.0 -100.0 0.0 0.5 NaN spring-rest-example BB 0.0 1.6 0.6 +215.8 1.6 0.6 +215.8 0.0 0.5 NaN WB 0.0 1.3 0.3 -61.9 1.3 0.3 -61.9 0.0 0.5 NaN swagger-petstore BB 0.0 43.0 0.7 +10.5 34.3 0.5 -1.0 8.7 0.8 +102.1 WB 5.8 15.3 0.4 -12.2 15.3 0.4 -12.2 0.0 0.5 NaN tracking-system BB 62.8 59.1 0.0 -20.6 40.8 0.0 -42.0 18.3 1.0 +346.6 WB 43.0 45.3 0.4 -3.2 42.8 0.4 -1.1 2.5 0.2 -28.7 user-management BB 6.5 27.4 0.0 -29.2 25.2 0.0 -34.3 2.3 0.9 +425.5 WB 8.5 2.8 0.1 -68.0 2.8 0.1 -68.0 0.0 0.5 NaN webgoat BB 19.9 89.6 1.0 +144.8 9.3 0.0 -74.4 80.3 1.0 +25953.0 WB 0.0 0.4 0.5 +0.0 0.4 0.5 +0.0 0.0 0.5 NaN youtube-mock BB 0.8 1.7 0.5 +84.6 1.7 0.5 +84.6 0.0 0.5 NaN WB 0.0 0.0 0.5 NaN 0.0 0.5 NaN 0.0 0.5 NaN Summary-BB 9343.2 30.6 -- -- 18.2 -- -- 12.4 -- -- Handled APIs 13 Reduced APIs -- -- 12 (10) -- -- 15 (9) -- -- 7 (4) -- Reduced to 0 -- -- 0 -- -- 0 -- -- 3 -- Summary-WB 10631.4 16.3 -- -- 10.8 -- -- 5.5 -- -- Handled APIs 18 Reduced APIs -- -- 18 (8) -- -- 19 (7) -- -- 7 (2) -- Reduced to 0 -- -- 2 -- -- 3 -- -- 9 -- 15 threat, each classification was independently assessed by two authors, with each disagreement resolved through discussions. Internal validity. Due to the inherent randomness of fuzzing, experimental results may vary across runs. To mitigate this threat, we repeated each experiment 10 times and analyzed observations across repetitions. External validity. Our empirical evaluation is based on a dataset of 36 open-source APIs. While relatively large, this sample is by no means representative of all APIs, particularly those developed and deployed in industrial settings. Therefore, we acknowledge that the generalizability of our findings may be limited. Conclusion validity. Our study focuses on test cases generated using a single fuzzer: EvoMaster . As different fuzzing or test generation techniques may exhibit different characteristics, our findings may not directly generalize to other approaches. Nevertheless, our observations provide empirical evidence that can serve as a baseline for comparison in future studies. To support this, we make all collected data publicly available, including the manually labeled failing test cases, enabling replication and further investigation by the research community. 7 Conclusions In this paper, we presented, to the best of our knowledge, the first systematic study of flakiness in test cases generated by fuzzing techniques for REST APIs. Our empirical study on 36 APIs, based on test cases generated by EvoMaster under both black-box and white-box settings, led to the identification of a taxonomy comprising nine distinct sources of flakiness. Building on this taxonomy, we designed and evaluated FlakyCatch to detect and mitigate flakiness in tests generated by white-box and black-box fuzzers for REST APIs. As future work, we plan to extend our analysis to test cases generated by other techniques and to develop more advanced detection and mitigation strategies to reduce the impact of flakiness while preserving test effectiveness. Acknowledgments This work is supported by the National Science Foundation of China (grant agreement No. 62502022). Andrea Arcuri is funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (EAST project, grant agreement No. 864972). Data Availability Statement All our analysis results, and code extension to EvoMaster to handle flakiness, are available at: h t t p s : //anonymous.4open.science/r/FlakyCatch- 8312 . References [1] [n. d.]. Open API Specification. https://swagger.io/specification/ . [2] Amal Akli, Guillaume Haben, Sarra Habchi, Mike Papadakis, and Yves Le Traon. 2023. FlakyCat: Predicting flaky tests categories using few-shot learning. In 2023 IEEE/ACM International Conference on Automation of Software Test (AST) . IEEE, 140--151. [3] Abdulrahman Alshammari, Christopher Morris, Michael Hilton, and Jonathan Bell. 2021. Flakeflagger: Predicting flakiness without rerunning tests. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) . IEEE, 1572--1584. [4] Andrea Arcuri. 2017. RESTful API Automated Test Case Generation. In IEEE International Conference on Software Quality, Reliability and Security (QRS) . IEEE, 9--20. [5] Andrea Arcuri. 2018. Test suite generation with the Many Independent Objective (MIO) algorithm. Information and Software Technology 104 (2018), 195--206. [6] Andrea Arcuri and Juan P Galeotti. 2020. Handling SQL databases in automated system test generation. ACM Transactions on Software Engineering and Methodology (TOSEM) 29, 4 (2020), 1--31. [7] Andrea Arcuri and Juan P Galeotti. 2021. Enhancing Search-based Testing with Testability Transformations for Existing APIs. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 1 (2021), 1--34. 16 [8] A. Arcuri, A. Poth, and O. Rrjolli. 2025. Introducing Black-Box Fuzz Testing for REST APIs in Indus- try: Challenges and Solutions. In IEEE International Conference on Software Testing, Verification and Validation (ICST) . [9] Andrea Arcuri, Omur Sahin, and Man Zhang. 2025. Fuzzing for Detecting Access Policy Violations in REST APIs. In IEEE International Symposium on Software Reliability Engineering (ISSRE) . [10] Andrea Arcuri, Man Zhang, and Juan Pablo Galeotti. 2023. Advanced White-Box Heuristics for Search-Based Fuzzing of REST APIs. arXiv preprint arXiv:2309.08360 (2023). [11] Andrea Arcuri, Man Zhang, and Juan Pablo Galeotti. 2024. Advanced White-Box Heuristics for Search- Based Fuzzing of REST APIs. ACM Transactions on Software Engineering and Methodology (TOSEM) (2024). doi: 10.1145/3652157 [12] Andrea Arcuri, Man Zhang, Amid Golmohammadi, Asma Belhadi, Juan P Galeotti, Bogdan Marculescu, and Susruthan Seran. 2023. EMB: A curated corpus of web/enterprise applications and library support for software testing research. In 2023 IEEE Conference on Software Testing, Verification and Validation (ICST) . IEEE, 433--442. [13] Andrea Arcuri, Man Zhang, Susruthan Seran, Juan Pablo Galeotti, Amid Golmohammadi, Onur Duman, Agustina Aldasoro, and Hernan Ghianni. 2025. Tool report: EvoMaster—black and white box search-based fuzzing for REST, GraphQL and RPC APIs. Automated Software Engineering 32, 1 (2025), 1--11. [14] Vaggelis Atlidakis, Patrice Godefroid, and Marina Polishchuk. 2019. RESTler: Stateful REST API Fuzzing. In ACM/IEEE International Conference on Software Engineering (ICSE) . 748–758. [15] Asma Belhadi, Man Zhang, and Andrea Arcuri. 2023. Random Testing and Evolutionary Testing for Fuzzing GraphQL APIs. ACM Transactions on the Web (2023). [16] Jonathan Bell, Owolabi Legunsen, Michael Hilton, Lamyaa Eloussi, Tifany Yung, and Darko Marinov. 2018. DeFlaker: Automatically detecting flaky tests. In Proceedings of the 40th international conference on software engineering . 433--444. [17] Alexander Berndt, Thomas Bach, Rainer Gemulla, Marcus Kessel, and Sebastian Baltes. 2026. On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems. arXiv preprint arXiv:2601.08998 (2026). [18] Marcel B¨ ohme, L´ aszl´ o Szekeres, and Jonathan Metzman. 2022. On the reliability of coverage-based fuzzer benchmarking. In Proceedings of the 44th International Conference on Software Engineering . 1621--1633. [19] Davide Corradini, Zeno Montolli, Michele Pasqua, and Mariano Ceccato. 2024. DeepREST: Automated Test Case Generation for REST APIs Exploiting Deep Reinforcement Learning. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering . 1383--1394. [20] Gelei Deng, Zhiyi Zhang, Yuekang Li, Yi Liu, Tianwei Zhang, Yang Liu, Guo Yu, and Dongjin Wang. 2023. { NAUTILUS } : Automated { RESTful }{ API } Vulnerability Detection. In 32nd USENIX Security Symposium (USENIX Security 23) . 5593--5609. [21] Zhen Dong, Abhishek Tiwari, Xiao Liang Yu, and Abhik Roychoudhury. 2020. Concurrency-related flaky test detection in android apps. arXiv preprint arXiv:2005.10762 (2020). [22] Wenlong Du, Jian Li, Yanhao Wang, Libo Chen, Ruijie Zhao, Junmin Zhu, Zhengguang Han, Yijun Wang, and Zhi Xue. 2024. Vulnerability-oriented testing for restful apis. In 33rd USENIX Security Symposium (USENIX Security 24) . USENIX Association, 739--755. [23] Moritz Eck, Fabio Palomba, Marco Castelluccio, and Alberto Bacchelli. 2019. Understanding flaky tests: The developer’s perspective. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 830--840. [24] Sakina Fatima, Taher A Ghaleb, and Lionel Briand. 2022. Flakify: A black-box, language model-based predictor for flaky tests. IEEE Transactions on Software Engineering 49, 4 (2022), 1912--1927. [25] Sakina Fatima, Hadi Hemmati, and Lionel C Briand. 2024. FlakyFix: Using large language models for predicting flaky test fix categories and test code repair. IEEE Transactions on Software Engineering 50, 12 (2024), 3146--3171. [26] Mat ´ u ˇ s Ferech and Pavel Tvrd ´ ık. 2023. Efficient fuzz testing of web services. In 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS) . IEEE, 291--300. 17 [27] Myles Foley and Sergio Maffeis. 2025. APIRL: Deep Reinforcement Learning for REST API Fuzzing. In Thirty-ninth Conference on Artificial Intelligence (AAAI 2025) . [28] Hernan Ghianni, Man Zhang, Juan P Galeotti, and Andrea Arcuri. 2025. Search-Based Fuzzing For RESTful APIs That Use MongoDB. arXiv preprint arXiv:2507.20848 (2025). [29] Amid Golmohammadi, Man Zhang, and Andrea Arcuri. 2023. Testing RESTful APIs: A Survey. ACM Transactions on Software Engineering and Methodology (aug 2023). doi: 10.1145/3617175 [30] Martin Gruber and Gordon Fraser. 2022. A survey on how test flakiness affects developers and what support they need to address it. In 2022 IEEE Conference on Software Testing, Verification and Validation (ICST) . 82--92. doi: 10.1109/ICST53961.2022.00020 [31] Martin Gruber, Muhammad Firhard Roslan, Owain Parry, Fabian Scharnb¨ ock, Phil McMinn, and Gordon Fraser. 2024. Do automatic test generation tools generate flaky tests?. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1--12. [32] Zac Hatfield-Dodds and Dmitry Dygalo. 2022. Deriving Semantics-Aware Fuzzers from Web API Schemas. In 2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) . IEEE, 345--346. [33] Ahmad Hazimeh, Adrian Herrera, and Mathias Payer. 2020. Magma: A ground-truth fuzzing benchmark. Proceedings of the ACM on Measurement and Analysis of Computing Systems 4, 3 (2020), 1--29. [34] Stefan Karlsson, Adnan ˇ Cau ˇ sevi´ c, and Daniel Sundmark. 2020. Automatic Property-based Testing of GraphQL APIs. arXiv preprint arXiv:2012.07380 (2020). [35] Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso. 2023. Adaptive rest api testing with reinforcement learning. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 446--458. [36] Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso. 2025. LlamaRestTest: Effective REST API Testing with Small Language Models. In ACM Symposium on the Foundations of Software Engineering (FSE) . [37] Myeongsoo Kim, Tyler Stennett, Saurabh Sinha, and Alessandro Orso. 2025. A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs. ACM/IEEE International Conference on Software Engineering (ICSE) (2025). [38] Myeongsoo Kim, Qi Xin, Saurabh Sinha, and Alessandro Orso. 2022. Automated Test Generation for REST APIs: No Time to Rest Yet. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual, South Korea) (ISSTA 2022) . Association for Computing Machinery, New York, NY, USA, 289–301. doi: 10.1145/3533767.3534401 [39] Nuno Laranjeiro, Jo˜ ao Agnelo, and Jorge Bernardino. 2021. A black box tool for robustness testing of REST services. IEEE Access 9 (2021), 24738--24754. [40] Tri Le, Thien Tran, Duy Cao, Vy Le, Tien N Nguyen, and Vu Nguyen. 2024. KAT: Dependency-aware automated API testing with large language models. In 2024 IEEE Conference on Software Testing, Verification and Validation (ICST) . IEEE, 82--92. [41] Tanakorn Leesatapornwongsa, Xiang Ren, and Suman Nath. 2022. FlakeRepro: Automated and efficient reproduction of concurrency-related flaky tests. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 1509--1520. [42] Yuwei Li, Shouling Ji, Yuan Chen, Sizhuang Liang, Wei-Han Lee, Yueyao Chen, Chenyang Lyu, Chunming Wu, Raheem Beyah, Peng Cheng, et al . 2021. { UNIFUZZ } : A holistic and pragmatic { Metrics-Driven } platform for evaluating fuzzers. In 30th USENIX Security Symposium (USENIX Security 21) . 2777--2794. [43] Zheyuan Li, Zhenyu Wu, Yan Lei, Huan Xie, Maojin Li, and Jian Hu. 2025. HiFlaky: Hierarchy-Aware Flakiness Classification. Journal of Systems and Software (2025), 112741. [44] Jiangchao Liu, Jierui Liu, Peng Di, Alex X Liu, and Zexin Zhong. 2022. Record and replay of online traffic for microservices with automatic mocking point identification. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice . 221--230. [45] Yi Liu, Yuekang Li, Gelei Deng, Yang Liu, Ruiyuan Wan, Runchao Wu, Dandan Ji, Shiheng Xu, and Minli Bao. 2022. Morest: Model-based RESTful API Testing with Execution Feedback. In ACM/IEEE International Conference on Software Engineering (ICSE) . 18 [46] Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering . 643--653. [47] Chenyang Lyu, Jiacheng Xu, Shouling Ji, Xuhong Zhang, Qinying Wang, Binbin Zhao, Gaoning Pan, Wei Cao, Peng Chen, and Raheem Beyah. 2023. MINER: A Hybrid Data-Driven Approach for REST API Fuzzing. In 32nd USENIX Security Symposium (USENIX Security 23) . 4517--4534. [48] Alberto Martin-Lopez, Sergio Segura, and Antonio Ruiz-Cort´ es. 2021. RESTest: Automated Black-Box Testing of RESTful Web APIs. In ACM Int. Symposium on Software Testing and Analysis (ISSTA) . ACM, 682--685. [49] Jonathan Metzman, L´ aszl´ o Szekeres, Laurent Simon, Read Sprabery, and Abhishek Arya. 2021. FuzzBench: an open fuzzer benchmarking platform and service. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 1393--1403. [50] Miao Miao, Sriteja Kummita, Eric Bodden, and Shiyi Wei. 2025. Program Feature-Based Benchmarking for Fuzz Testing. Proceedings of the ACM on Software Engineering 2, ISSTA (2025), 527--549. [51] Jes ´ us Mor´ an Barb´ on, Cristian Augusto Alonso, Antonia Bertolino, Claudio A Riva ´ Alvarez, Jos´ e Florentino Garc ´ ıa Tuya, et al . 2019. Debugging flaky tests on web applications. In Proceedings of the 15th International Conference on Web Information Systems and Technologies-Volume 1: APMDWE . [52] Kiet Ngo, Vu Nguyen, and Tien Nguyen. 2022. Research on test flakiness: from unit to system testing. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1--4. [53] Olek Osikowicz, Phil McMinn, and Donghwan Shin. 2025. Empirically evaluating flaky tests for autonomous driving systems in simulated environments. In 2025 IEEE/ACM International Flaky Tests Workshop (FTW) . IEEE, 13--20. [54] Jiradet Ounjai, Valentin W ¨ ustholz, and Maria Christakis. 2023. Green fuzzer benchmarking. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis . 1396--1406. [55] Lianglu Pan, Shaanan Cohney, Toby Murray, and Van-Thuan Pham. 2025. Trailblazer: Practical End- to-end Web API Fuzzing (Registered Report). In Proceedings of the 34th ACM SIGSOFT International Symposium on Software Testing and Analysis . 143--152. [56] Owain Parry, Gregory M Kapfhammer, Michael Hilton, and Phil McMinn. 2021. A survey of flaky tests. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 1 (2021), 1--74. [57] Owain Parry, Gregory M Kapfhammer, Michael Hilton, and Phil McMinn. 2025. Test flimsiness: Charac- terizing flakiness induced by mutation to the code under test. In Proceedings of 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE’26) . ACM. [58] Alexander Poth, Olsi Rrjolli, and Andrea Arcuri. 2025. Technology adoption performance evaluation applied to testing industrial REST APIs. Automated Software Engineering 32, 1 (2025), 5. [59] Antonio Qui ˜ na-Mera, Pablo Fernandez, Jos´ e Mar ´ ıa Garc ´ ıa, and Antonio Ruiz-Cort´ es. 2023. GraphQL: A systematic mapping study. ACM computing surveys 55, 10 (2023), 1--35. [60] Shanto Rahman, Abdelrahman Baz, Sasa Misailovic, and August Shi. 2024. Quantizing large-language models for predicting flaky tests. In 2024 IEEE Conference on Software Testing, Verification and Validation (ICST) . IEEE, 93--104. [61] Shanto Rahman and August Shi. 2024. FlakeSync: Automatically repairing async flaky tests. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering . 1--12. [62] Thomas Rooijakkers, Anne Nijsten, Cristian Daniele, Erieke Weitenberg, Ringo Groenewegen, and Arthur Melissen. 2025. WuppieFuzz: Coverage-Guided, Stateful REST API Fuzzing. arXiv preprint arXiv:2512.15554 (2025). [63] Diptikalyan Saha, Devika Sondhi, Swagatam Haldar, and Saurabh Sinha. 2025. Rest API Functional Tester. In Proceedings of the 18th Innovations in Software Engineering Conference . 1--11. [64] Omur Sahin, Man Zhang, and Andrea Arcuri. 2025. WFC/WFD: Web Fuzzing Commons, Dataset and Guidelines to Support Experimentation in REST API Fuzzing. arXiv preprint arXiv:2509.01612 (2025). 19 [65] Tom Schroeder, Minh Phan, and Yang Chen. 2025. A Preliminary Study of Fixed Flaky Tests in Rust Projects on GitHub. In 2025 IEEE/ACM International Flaky Tests Workshop (FTW) . IEEE, 21--22. [66] Susruthan Seran, Man Zhang, Onur Duman, and Andrea Arcuri. 2025. Handling Web Service Interactions in Fuzzing with Search-Based Mock-Generation. ACM Transactions on Software Engineering and Methodology (2025). [67] Devika Sondhi, Ananya Sharma, and Diptikalyan Saha. 2025. Utilizing API Response for Test Refinement. arXiv preprint arXiv:2501.18145 (2025). [68] Emanuele Viglianisi, Michael Dallago, and Mariano Ceccato. 2020. RESTTESTGEN: Automated Black- Box Testing of RESTful APIs. In IEEE International Conference on Software Testing, Verification and Validation (ICST) . IEEE. [69] Wei Wang, Andrei Benea, and Franjo Ivancic. 2023. Zero-Config Fuzzing for Microservices. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 1840--1845. [70] Yu Wang and Yue Xu. 2024. Beyond REST: Introducing APIF for Comprehensive API Vulnerability Fuzzing. In Proceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses . 435--449. [71] Huayao Wu, Lixin Xu, Xintao Niu, and Changhai Nie. 2022. Combinatorial Testing of RESTful APIs. In ACM/IEEE International Conference on Software Engineering (ICSE) . [72] Lixin Xu, Huayao Wu, Zhenyu Pan, Tongtong Xu, Shaohua Wang, Xintao Niu, and Changhai Nie. 2025. Effective REST APIs Testing with Error Message Analysis. Proceedings of the ACM on Software Engineering 2, ISSTA (2025), 1978--2000. [73] Ke Zhang, Chenxi Zhang, Chong Wang, Chi Zhang, YaChen Wu, Zhenchang Xing, Yang Liu, Qingshan Li, and Xin Peng. 2025. LogiAgent: Automated Logical Testing for REST Systems with LLM-Based Multi-Agents. arXiv preprint arXiv:2503.15079 (2025). [74] Man Zhang and Andrea Arcuri. 2021. Adaptive Hypermutation for Search-Based System Test Generation: A Study on REST APIs with EvoMaster. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 1 (2021). [75] Man Zhang and Andrea Arcuri. 2023. Open Problems in Fuzzing RESTful APIs: A Comparison of Tools. ACM Transactions on Software Engineering and Methodology (TOSEM) (may 2023). doi: 10.1145/3597205 [76] Man Zhang, Andrea Arcuri, Yonggang Li, Yang Liu, and Kaiming Xue. 2023. White-Box Fuzzing RPC- Based APIs with EvoMaster: An Industrial Case Study. ACM Transactions on Software Engineering and Methodology 32, 5 (2023), 1--38. [77] Man Zhang, Andrea Arcuri, Yonggang Li, Yang Liu, Kaiming Xue, Zhao Wang, Jian Huo, and Weiwei Huang. 2025. Fuzzing microservices: A series of user studies in industry on industrial systems with evomaster. Science of Computer Programming (2025), 103322. [78] Man Zhang, Andrea Arcuri, Piyun Teng, Kaiming Xue, and Wenhao Wang. 2024. Seeding and Mocking in White-Box Fuzzing Enterprise RPC APIs: An Industrial Case Study. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering . 2024--2034. [79] Man Zhang, Bogdan Marculescu, and Andrea Arcuri. 2021. Resource and dependency based test case generation for RESTful Web services. Empirical Software Engineering 26, 4 (2021), 1--61. 20

Detecting and Mitigating Flakiness in REST API Fuzzing

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment