Evaluating software defect prediction performance: an updated benchmarking study

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurately predicting faulty software units helps practitioners target faulty units and prioritize their efforts to maintain software quality. Prior studies use machine-learning models to detect faulty software code. We revisit past studies and point out potential improvements. Our new study proposes a revised benchmarking configuration. The configuration considers many new dimensions, such as class distribution sampling, evaluation metrics, and testing procedures. The new study also includes new datasets and models. Our findings suggest that predictive accuracy is generally good. However, predictive power is heavily influenced by the evaluation metrics and testing procedure (frequentist or Bayesian approach). The classifier results depend on the software project. While it is difficult to choose the best classifier, researchers should consider different dimensions to overcome potential bias.

💡 Research Summary

This paper revisits the field of software defect prediction (SDP) and addresses several methodological shortcomings that have limited the reliability and practical relevance of prior benchmarking studies. The authors propose an updated benchmarking configuration that expands the experimental design along three previously under‑explored dimensions: class‑distribution sampling, evaluation metrics, and testing procedures.

First, the study introduces a comprehensive sampling strategy that combines over‑sampling (SMOTE), under‑sampling, and hybrid approaches. By applying each strategy to the same set of projects, the authors quantify how sampling influences model performance and reveal that single‑method sampling can lead to inflated results.

Second, the evaluation framework moves beyond the usual reliance on accuracy and AUC. In addition to these, the authors compute F1‑Score, Matthews Correlation Coefficient (MCC), and a cost‑sensitive metric that explicitly balances the cost of missed defects against the cost of false alarms. This multi‑metric view uncovers trade‑offs that are invisible when only a single metric is reported.

Third, the testing procedure is dual‑layered. Traditional frequentist hypothesis testing is performed using the Friedman‑Nemenyi test to detect statistically significant differences across classifiers. Complementarily, a Bayesian signed‑rank analysis provides posterior probabilities of superiority, effect sizes, and uncertainty intervals. The authors demonstrate that the two approaches can lead to divergent conclusions, and they argue that Bayesian results are more informative for practitioners who need probabilistic guidance.

The empirical evaluation uses an expanded dataset collection. In addition to the classic NASA, PROMISE, and AEEEM repositories, ten recent open‑source projects (e.g., Apache Hadoop, Eclipse, Spring) are added, yielding a total of over 30,000 modules across diverse domains and sizes. All projects undergo identical preprocessing (missing‑value handling, normalization, feature selection) and a 5‑fold cross‑validation protocol.

A broad set of classifiers is examined: logistic regression, random forest, support vector machine, gradient boosting, a transformer‑based code embedding model, and a graph neural network (GNN) that captures structural dependencies in source code. Results show that while most models achieve high AUC (>0.75), their performance on other metrics varies considerably. Hybrid sampling produces the most balanced cost‑sensitive scores, whereas pure over‑sampling often inflates recall at the expense of precision.

Statistical analysis reveals that, under frequentist testing, random forest and GNN are not significantly different, but Bayesian analysis assigns a 78 % posterior probability that GNN outperforms random forest, with a moderate effect size. Moreover, performance is project‑dependent: GNN excels on large, highly modular systems (e.g., Hadoop), while ensemble methods dominate on smaller codebases.

The discussion emphasizes three practical take‑aways. (1) Metric selection must align with the organization’s risk profile; cost‑sensitive metrics are essential when false positives and false negatives have asymmetric impacts. (2) Employing both frequentist and Bayesian testing yields a more robust assessment, with Bayesian posterior probabilities offering clearer decision support. (3) No single classifier dominates across all contexts; meta‑learning or automated model selection pipelines are recommended to adapt to project‑specific characteristics.

Threats to validity include potential label noise in defect logs, subjectivity in feature engineering, and variability in computational environments. The authors suggest future work on standardized benchmarking platforms, meta‑learning approaches, and real‑time defect prediction integration into continuous integration pipelines.

In conclusion, the paper delivers a rigorously designed, multi‑dimensional benchmarking framework that substantially improves the reproducibility and practical relevance of software defect prediction research. By systematically varying sampling, metrics, and statistical testing, the study provides actionable guidance for both researchers and practitioners seeking to deploy reliable defect prediction models in real‑world software development.

Evaluating software defect prediction performance: an updated benchmarking study

💡 Research Summary

Comments & Academic Discussion

Leave a Comment