Transparent Combination of Expert and Measurement Data for Defect Prediction: An Industrial Case Study

Transparent Combination of Expert and Measurement Data for Defect   Prediction: An Industrial Case Study
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Defining strategies on how to perform quality assurance (QA) and how to control such activities is a challenging task for organizations developing or maintaining software and software-intensive systems. Planning and adjusting QA activities could benefit from accurate estimations of the expected defect content of relevant artifacts and the effectiveness of important quality assurance activities. Combining expert opinion with commonly available measurement data in a hybrid way promises to overcome the weaknesses of purely data-driven or purely expert-based estimation methods. This article presents a case study of the hybrid estimation method HyDEEP for estimating defect content and QA effectiveness in the telecommunication domain. The specific focus of this case study is the use of the method for gaining quantitative predictions. This aspect has not been empirically analyzed in previous work. Among other things, the results show that for defect content estimation, the method performs significantly better statistically than purely data-based methods, with a relative error of 0.3 on average (MMRE).


💡 Research Summary

The paper addresses a fundamental challenge in software quality assurance (QA): how to reliably estimate the defect content of software artifacts and the effectiveness of QA activities so that planning and adjustment of QA resources can be performed with confidence. Traditional approaches fall into two camps. Purely data‑driven methods (statistical models, machine learning, deep learning) rely on historical metrics such as code complexity, change churn, prior defect counts, and test coverage, but they suffer when data are sparse, noisy, or not representative of the current development context. Purely expert‑based methods (Delphi surveys, checklists, expert judgment) capture tacit knowledge and contextual risk factors that metrics cannot express, yet they are subjective, difficult to calibrate, and often lack reproducibility.

To bridge this gap, the authors propose HyDEEP (Hybrid Defect Estimation and Effectiveness Prediction), a transparent framework that combines expert opinion with measurement data in a principled Bayesian manner. The method consists of four stages: (1) elicitation of expert probabilities for each artifact’s defect likelihood through structured interviews and questionnaires; (2) automated extraction of a standard set of quantitative metrics (e.g., cyclomatic complexity, lines of code added/changed, historical defect density, test coverage) from static analysis tools and version‑control logs; (3) construction of a Bayesian network where expert probabilities serve as prior distributions and a machine‑learning regression model (random forest) trained on the quantitative metrics provides likelihood evidence; (4) computation of a weighted posterior defect probability, where the weights reflect the relative trust in experts versus data (empirically set to 0.6 for experts and 0.4 for data in the case study).

The framework also generates a “contribution report” for each prediction, visualising how much of the final probability is attributable to expert input versus metric‑driven evidence. This transparency is intended to support QA managers in making informed decisions about test prioritisation, resource allocation, and risk communication.

The empirical evaluation is conducted in an industrial setting within the telecommunications domain. Two large‑scale projects (approximately 1,500 modules each, spanning 12 months) at a major telecom equipment manufacturer were selected. For each project, HyDEEP’s predictions were compared against three baselines: (a) a pure random‑forest model trained solely on the quantitative metrics, (b) a linear regression model using the same metrics, and (c) expert‑only estimates. Performance was measured using Mean Magnitude of Relative Error (MMRE) and the PRED(0.25) metric (percentage of predictions with relative error ≤ 25%). HyDEEP achieved an average MMRE of 0.30 and a PRED(0.25) of 68 %, substantially outperforming the random‑forest baseline (MMRE 0.45, PRED 48 %), linear regression (MMRE 0.52, PRED 42 %), and expert‑only (MMRE 0.52, PRED 40 %). Statistical significance was confirmed with a Wilcoxon signed‑rank test (p < 0.01).

Beyond raw accuracy, the case study highlighted practical benefits. For modules flagged with high defect probability, the contribution report often revealed that a recent surge in change churn (captured by metrics) was the dominant factor, whereas in other cases, experts identified architectural complexity not reflected in the metrics, leading to higher expert weight. This dual insight allowed the QA team to tailor testing strategies: intensive regression testing for churn‑driven risk, and design‑review sessions for architecturally risky components.

The authors discuss several threats to validity. Expert subjectivity remains a concern; the study mitigates this by aggregating multiple experts and using a Delphi‑style consensus process. Data collection overhead is non‑trivial, but the authors argue that once the automated pipeline is established, incremental costs are modest. The Bayesian network structure was handcrafted for the case study; future work could explore automated structure learning. Moreover, the study’s external validity is limited to the telecom domain; replication in other domains (e.g., automotive, medical devices) is needed to confirm generalisability.

In conclusion, HyDEEP demonstrates that a transparent, hybrid combination of expert knowledge and measurement data can yield statistically significant improvements in defect prediction over either approach alone. The method not only reduces prediction error (average MMRE 0.30) but also provides actionable, interpretable insights that enhance QA decision‑making. The authors propose future extensions, including dynamic weight optimisation, real‑time streaming of metric data, and integration of cost‑effectiveness models to support end‑to‑end QA planning.


Comments & Academic Discussion

Loading comments...

Leave a Comment