Understanding and Detecting Flaky Builds in GitHub Actions

Understanding and Detecting Flaky Builds in GitHub Actions
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Continuous Integration (CI) is widely used to provide rapid feedback on code changes; however, CI build outcomes are not always reliable. Builds may fail intermittently due to non-deterministic factors, leading to flaky builds that undermine developers’ trust in CI, waste computational resources, and threaten the validity of CI-related empirical studies. In this paper, we present a large-scale empirical study of flaky builds in GitHub Actions based on rerun data from 1,960 open-source Java projects. Our results show that 3.2% of builds are rerun, and 67.73% of these rerun builds exhibit flaky behavior, affecting 1,055 (51.28%) of the projects. Through an in-depth failure analysis, we identify 15 distinct categories of flaky failures, among which flaky tests, network issues, and dependency resolution issues are the most prevalent. Building on these findings, we propose a machine learning-based approach for detecting flaky failures at the job level. Compared with a state-of-the-art baseline, our approach improves the F1-score by up to 20.3%.


💡 Research Summary

This paper presents a large‑scale empirical investigation of flaky builds in GitHub Actions and proposes a machine‑learning approach for detecting flaky failures at the job level. The authors collected CI data from 1,960 open‑source Java repositories, amounting to 4,861,768 builds and over 15 million jobs, by crawling the GitHub REST API between September 2023 and August 2024. Projects were selected using the SEART GitHub Search Engine, filtered for Java language, a minimum of 150 stars and 250 commits, and recent activity to ensure that build histories were retained (GitHub retains three months of logs by default).

A build is considered a “rerun” if its run_attempt field exceeds 1, and a “flaky job” if it exhibits both success and failure across its attempts. A “flaky build” contains at least one flaky job. Using these definitions, the study found that 155,488 builds (3.2 % of all builds) were rerun, and 67.73 % of the rerun builds were flaky. Consequently, 1,055 projects (51.28 % of the sample) experienced flaky builds, highlighting a substantial reliability problem in CI pipelines.

To understand the underlying causes, the authors designed a semi‑automated log‑analysis pipeline. They manually inspected a seed set of failure logs, identified variable tokens (file paths, versions, timestamps), and replaced them with wildcards to create generalized regular‑expression patterns. These patterns were applied to the full set of flaky jobs, and similar patterns were grouped into higher‑level failure categories. Fifteen distinct categories emerged, with the most prevalent being flaky tests, network connectivity problems, and dependency‑resolution failures. Additional categories included file‑system permission errors, cache corruption, virtual‑machine image updates, and others. The authors performed iterative sampling (five failures per category per iteration) until thematic saturation was reached, ensuring a comprehensive root‑cause analysis.

Building on the empirical findings, the paper proposes a machine‑learning model to predict flaky failures before they manifest. Features were extracted from job‑level metadata (e.g., start/end timestamps, runner OS, Docker image, previous success/failure counts) and from textual log content (using TF‑IDF and n‑gram representations). The classifier is a Gradient Boosting Decision Tree (GBDT) with class‑imbalance handling via SMOTE and cost‑sensitive learning. The model was evaluated against a strong baseline from recent literature that relies on rule‑based detection of dependency and test failures. Results show an improvement of up to 20.3 % in F1‑score, demonstrating that the learned model can more accurately flag flaky jobs, thereby reducing unnecessary reruns, saving computational resources, and shortening developer waiting time.

The authors contribute two publicly released datasets: (1) the full CI build dataset for the 1,960 Java projects, and (2) a curated flaky‑failure dataset derived from ten projects with detailed failure categories and root‑cause annotations. These resources enable future work on CI reliability, flaky‑test detection, and build outcome prediction.

Limitations include the focus on Java projects, which may not generalize to other ecosystems (e.g., Python, JavaScript), and the three‑month log retention policy of GitHub Actions that may bias the sample toward more recent activity. Moreover, the quality of the failure‑pattern extraction and labeling directly impacts model performance; misclassifications in the training data could propagate errors.

Future directions suggested by the authors involve extending the study to multi‑language repositories, integrating the detection model into a real‑time GitHub Actions plugin, and leveraging developer feedback loops to refine labeling and improve model robustness. Overall, the paper provides a thorough quantification of flaky builds in a widely used CI platform, a taxonomy of their causes, and a practical, data‑driven solution to mitigate their impact on software development workflows.


Comments & Academic Discussion

Loading comments...

Leave a Comment