Can We Classify Flaky Tests Using Only Test Code? An LLM-Based Empirical Study

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Flaky tests yield inconsistent results when they are repeatedly executed on the same code revision. They interfere with automated quality assurance of code changes and hinder efficient software testing. Previous work evaluated approaches to train machine learning models to classify flaky tests based on identifiers in the test code. However, the resulting classifiers have been shown to lack generalizability, hindering their applicability in practical environments. Recently, pre-trained Large Language Models (LLMs) have shown the capability to generalize across various tasks. Thus, they represent a promising approach to address the generalizability problem of previous approaches. In this study, we evaluated three LLMs (two general-purpose models, one code-specific model) using three prompting techniques on two benchmark datasets from prior studies on flaky test classification. Furthermore, we manually investigated 50 samples from the given datasets to determine whether classifying flaky tests based only on test code is feasible for humans. Our findings indicate that LLMs struggle to classify flaky tests given only the test code. The results of our best prompt-model combination were only marginally better than random guessing. In our manual analysis, we found that the test code does not necessarily contain sufficient information for a flakiness classification. Our findings motivate future work to evaluate LLMs for flakiness classification with additional context, for example, using retrieval-augmented generation or agentic AI.

💡 Research Summary

**
Background and Motivation
Flaky tests—tests that produce inconsistent outcomes on the same code revision—pose a serious threat to continuous integration pipelines, developer confidence, and overall testing efficiency. Prior work has attempted to predict flakiness directly from test code using machine‑learning classifiers (e.g., bag‑of‑words + random forest). While these models achieve high scores on static benchmark splits, they fail to generalize across projects or future revisions because they over‑fit to the lexical vocabulary of the training set. Recent studies have explored fine‑tuned large language models (LLMs) such as CodeBERT, but some of those results suffered from data leakage, leaving the true potential of LLMs unclear.

Research Questions
The authors formulate three questions:

How well do off‑the‑shelf LLMs (without any task‑specific fine‑tuning) classify flaky vs. non‑flaky tests when only the test code is provided? They explore three prompting strategies: zero‑shot, zero‑shot with chain‑of‑thought (CoT), and few‑shot with CoT.
What degree of non‑determinism (output variability) is observed when the same model is run repeatedly on the same inputs?
Can humans reliably label flakiness based solely on test code?

Methodology
Three LLMs are evaluated: GPT‑4o (proprietary, high‑capacity), GPT‑OSS‑120b (open‑source), and Qwen3‑Coder‑480b (code‑focused). All models are run with temperature 0 (greedy decoding) to suppress stochasticity. The experiments use two publicly available datasets:

IDoFT – 3,813 Java/Python tests (85 % flaky).
FlakeBench – 8,574 Java tests (3 % flaky).

Four metrics are reported: weighted precision, weighted recall, weighted F1, and Matthews Correlation Coefficient (MCC) to account for class imbalance. Baselines include a “always flaky” classifier and a random 50 % guess. For the few‑shot setting, six labeled examples (three flaky, three non‑flaky) are embedded in the prompt together with three explicit reasoning steps per example (semantic description, critical line analysis, final verdict).

To assess non‑determinism, the authors repeat each experiment and compute the normalized Hamming distance between the resulting binary label vectors.

For the human study, 50 test samples from the two datasets are manually inspected by the authors, who record whether the code alone provides enough clues to decide flakiness.

Results

LLM Performance – Across all model‑prompt combinations, performance hovers just above random guessing. The best configuration (GPT‑4o with few‑shot CoT on IDoFT) yields a weighted F1 of ~0.55 and MCC ≈ 0.04, barely better than the 0.5 baseline. On the highly imbalanced FlakeBench, all configurations collapse to near‑random performance (F1 ≈ 0.10, MCC ≈ 0.00). Adding CoT reasoning or few‑shot demonstrations provides only marginal gains; the underlying lexical signal in the test code is insufficient for the models to infer flakiness.
Non‑Determinism – Even with temperature 0, repeated runs produce label vectors that differ by a normalized Hamming distance of 0.07–0.12, indicating that the models are not fully deterministic under the same prompt and hardware conditions. This variability would complicate any production deployment where consistent predictions are required.
Human Judgment – The manual inspection reveals that many flaky tests rely on external factors (e.g., utility functions, timing, environment configuration) not visible in the test body. Consequently, human annotators also struggle to label many samples correctly based solely on the code snippet. Only a subset of flakiness types—such as unordered collection usage or explicit sleep calls—are readily identifiable.

Discussion and Implications
The study demonstrates that “test‑code‑only” flakiness classification is currently beyond the reach of both off‑the‑shelf LLMs and human experts. The authors argue that the missing contextual information (execution logs, dependency versions, recent commits, runtime environment) is essential for accurate detection. They propose several avenues for future work:

Retrieval‑Augmented Generation (RAG) – Dynamically fetch relevant execution artifacts (e.g., recent failure logs, stack traces) and feed them to the LLM alongside the test code.
Agentic AI – Deploy autonomous agents that can query version control, run the test in a sandbox, and synthesize observations before making a flakiness decision.
Dataset Enrichment – Extend existing benchmarks with richer annotations (e.g., flakiness root‑cause taxonomy, environmental dependencies) to enable more nuanced learning.
Determinism Strategies – Explore decoding techniques (multiple‑sample voting, temperature‑0 with top‑k=1) to reduce output variance.

Conclusion
The empirical evidence presented confirms that large language models, when used without fine‑tuning and supplied only with the test source, cannot reliably distinguish flaky from stable tests. The marginal improvement over random guessing, coupled with observable non‑determinism, suggests that test code alone lacks the discriminative signal required for this task. Future research must therefore incorporate additional contextual data—through retrieval‑augmented pipelines, agentic execution, or richer labeled datasets—to unlock the potential of LLMs for flaky‑test detection in real‑world software engineering environments.

Can We Classify Flaky Tests Using Only Test Code? An LLM-Based Empirical Study

💡 Research Summary

Comments & Academic Discussion

Leave a Comment