Cross-Project Flakiness: A Case Study of the OpenStack Ecosystem
Automated regression testing is a cornerstone of modern software development, often contributing directly to code review and Continuous Integration (CI). Yet some tests suffer from flakiness, where their outcomes vary non-deterministically. Flakiness erodes developer trust in test results, wastes computational resources, and undermines CI reliability. While prior research has examined test flakiness within individual projects, its broader ecosystem-wide impact remains largely unexplored. In this paper, we present an empirical study of test flakiness in the OpenStack ecosystem, which focuses on (1) cross-project flakiness, where flaky tests impact multiple projects, and (2) inconsistent flakiness, where a test exhibits flakiness in some projects but remains stable in others. By analyzing 649 OpenStack projects, we identify 1,535 cross-project flaky tests and 1,105 inconsistently flaky tests. We find that cross-project flakiness affects 55% of OpenStack projects and significantly increases both review time and computational costs. Surprisingly, 70% of unit tests exhibit cross-project flakiness, challenging the assumption that unit tests are inherently insulated from issues that span modules like integration and system-level tests. Through qualitative analysis, we observe that race conditions in CI, inconsistent build configurations, and dependency mismatches are the primary causes of inconsistent flakiness. These findings underline the need for better coordination across complex ecosystems, standardized CI configurations, and improved test isolation strategies.
💡 Research Summary
This paper presents the first large‑scale empirical investigation of test flakiness that propagates across projects within a software ecosystem, using OpenStack as a case study. The authors collected a year‑long snapshot (June 2023 – June 2024) of closed Gerrit code reviews from 649 OpenStack repositories, amounting to 29,175 reviews and 73,707 patch sets. By parsing CI build logs and review comments, they identified “flaky builds” (identical patch sets that sometimes pass and sometimes fail) and linked them to the underlying flaky tests. Two phenomena are defined: (1) cross‑project flakiness, where the same flaky test appears in multiple projects, and (2) inconsistent flakiness, where a test is flaky in some projects but stable in others.
The quantitative results are striking: 1,535 distinct tests exhibit cross‑project flakiness and 1,105 tests show inconsistent flakiness, affecting 55 % of all OpenStack projects. The prevalence of cross‑project flakiness has grown over the studied year, indicating a cumulative ecosystem‑wide risk. When broken down by test scope, unit tests—traditionally considered isolated—are implicated in 70 % of cross‑project flaky cases. API tests (64 %) and scenario tests (41 %) also show high propagation rates, while integration and system tests are less dominant but still contribute. This challenges the common assumption that unit tests are immune to ecosystem‑level instability.
A qualitative analysis, supported by a questionnaire answered by 15 core contributors and maintainers, uncovers the root causes of inconsistent flakiness. The dominant factor is race conditions within the CI pipeline (89 % of cases), followed by mismatched CI configurations and dependency‑management issues (each accounting for about 21 %). Race conditions arise when concurrent builds share mutable resources (e.g., shared virtual machines, databases, or temporary files) or when the order of initialization differs across projects. Configuration mismatches include divergent environment variables, differing versions of common libraries, and divergent job definitions inherited from shared templates (e.g., devstack‑tempest). Dependency mismatches stem from projects pinning different versions of the same library, leading to divergent test outcomes even when the test code itself is identical.
A concrete example illustrates the problem: two separate code reviews—one in the Cinder project and another in the Glance project—triggered the same Tempest API test. In Cinder the CI job succeeded, while in Glance it failed, despite both builds using the same patch set and inheriting the same job definition. The failing test was later observed to be flaky in 14 additional OpenStack projects, with the underlying cause traced to a Libvirt change that interfered with volume detachment during cleanup. This case exemplifies how a single flaky test can waste CI resources (over an hour of build time) and delay code review across multiple projects.
The authors quantify the impact on development workflow: flaky tests increase review turnaround time and inflate CI resource consumption, echoing prior industry reports that flakiness can consume 2–16 % of computational resources. Survey respondents confirmed that flaky tests are a major barrier to rapid deployment and suggested mitigation strategies, including: (a) moving away from the “recheck‑and‑wait” approach toward early detection and quarantine of flaky tests; (b) standardizing CI configurations across projects to eliminate environment drift; (c) enforcing strict dependency version pinning; and (d) improving test isolation through containerization or sandboxed execution environments.
To enable replication, the authors release a comprehensive dataset comprising raw CI logs, manually labeled flaky test instances, anonymized questionnaire responses, and the analysis scripts. They outline future work directions: (i) extending the study to other large ecosystems such as Kubernetes or Apache projects; (ii) developing automated detection tools that flag potential cross‑project flaky tests during CI; and (iii) formulating ecosystem‑wide CI design guidelines that mitigate race conditions and configuration divergence.
In conclusion, the study demonstrates that test flakiness is not confined to individual repositories; it can cascade through shared CI infrastructure, common test suites, and inter‑project dependencies, affecting a majority of projects in a complex ecosystem like OpenStack. Addressing this requires coordinated effort: standardized CI pipelines, rigorous dependency management, and stronger test isolation mechanisms. By highlighting the scale and cost of cross‑project flakiness, the paper provides a compelling call to action for both researchers and practitioners to treat flaky tests as a systemic, ecosystem‑level quality concern rather than an isolated nuisance.
Comments & Academic Discussion
Loading comments...
Leave a Comment