TDSelector: A Training Data Selection Method for Cross-Project Defect Prediction

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

In recent years, cross-project defect prediction (CPDP) attracted much attention and has been validated as a feasible way to address the problem of local data sparsity in newly created or inactive software projects. Unfortunately, the performance of CPDP is usually poor, and low quality training data selection has been regarded as a major obstacle to achieving better prediction results. To the best of our knowledge, most of existing approaches related to this topic are only based on instance similarity. Therefore, the objective of this work is to propose an improved training data selection method for CPDP that considers both similarity and the number of defects each training instance has (denoted by defects), which is referred to as TDSelector, and to demonstrate the effectiveness of the proposed method. Our experiments were conducted on 14 projects (including 15 data sets) collected from two public repositories. The results indicate that, in a specific CPDP scenario, the TDSelector-based bug predictor performs, on average, better than those based on the baseline methods, and the AUC (area under ROC curve) values are increased by up to 10.6 and 4.3%, respectively. Besides, an additional experiment shows that selecting those instances with more bugs directly as training data can further improve the performance of the bug predictor trained by our method.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

TDSelector: A Training Data Selection Method for Cross-Project Defect Prediction

Peng Hea, Yutao Mab,d,∗, Bing Lic,d

aSchool of Computer Science and Information Engineering, Hubei University, Wuhan 430062, China bState Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072, China cInternational School of Software, Wuhan University, Wuhan 430079, China dResearch Center for Complex Network, Wuhan University, Wuhan 430072, China

Abstract: Context: In recent years, cross-project defect prediction (CPDP) attracted much attention and has been validated as a feasible way to address the problem of local data sparsity in newly created or inactive software projects. Unfortunately, the performance of CPDP is usually poor, and low quality training data selection has been regarded as a major obstacle to achieving better prediction results. To the best of our knowledge, most of existing approaches related to this topic are only based on instance similarity. Objective: The objective of this work is to propose an improved training data selection method for CPDP that considers both similarity and the number of defects each training instance has (denoted by defects), which is referred to as TDSelector, and to demonstrate the effectiveness of the proposed method. Method: First, TDSelector is constructed in terms of a linear weighted function of similarity and defects. Second, the basic defect predictor used in our experiments was built by using the Logistic Regression classification algorithm. Third, we analyzed the impacts of different combinations of similarity and the normalization of defects on prediction performance, and then compared our method (with the best combination) and two competing methods using statistical methods. Results: Our experiments were conducted on 14 projects (including 15 data sets) collected from two public repositories. The results indicate that, in a specific CPDP scenario, the TDSelector-based bug predictor performs, on average, better than those based on the baseline methods, and the AUC (area under ROC curve) values are increased by up to 10.6 and 4.3%, respectively. Besides, an additional experiment shows that selecting those instances with more bugs directly as training data can further improve the performance of the bug predictor trained by our method. Conclusion: The findings suggest that (1) the inclusion of defects is indeed helpful to select high quality training instances for CPDP, so as to improve prediction performance, and (2) the combination of Euclidean distance and Linear normalization is the preferred way for TDSelector. Keywords: cross-project defect prediction, training data selection, empirical study, number of defects, scoring scheme.

1 Introduction The past decades have witnessed the development of software defect prediction, which is ∗Corresponding author. Tel: +86 27 68776081 E-mail: penghe@hubu.edu.cn (P. He), {ytma (Y. Ma), bingli (B. Li)}@whu.edu.cn

now one of the most active research topics in the field of Software Engineering. Due to the lack of training data available on the Internet, most early studies usually trained predictors (also known as prediction models) from the historical data on software defects/bugs in the same software project and predicted defects in its upcoming release versions [1]. This type of approaches is referred to as Within-Project Defect Prediction (WPDP). However, WPDP has an obvious drawback when a newly-created or inactive project has little historical data on defects. To address the above issue, researchers in this field have attempted to apply defect predictors built for one project to other projects [2-4]. This type of methods is termed as Cross-Project Defect Prediction (CPDP). The main purpose of CPDP is to predict defect-prone instances (such as classes) in a project based on the defect data collected from other projects on those public software repositories like PROMISE1. The feasibility and potential usefulness of cross-project predictors built with a number of software metrics have been validated [1, 3, 5, 6], but how to improve the performance of CPDP models is still an open issue. Peter et al. [5] argued that selecting appropriate training data from a software repository became a major issue for CPDP. Moreover, some researchers also suggested that the success rate of CPDP models could be drastically improved when using a suitable training data set [1, 7]. That is to say, the selection of training data of quality could be a key breakthrough on the above issue. On the other hand, it is no doubt that labeled defect data available on the Internet will continue to grow smartly. Thus, the construction of an appropriate training data set gathered from a large number of projects on public software repositories is indeed a challenge for CPDP [7]. As far as we know, although previous studies on CPDP

View Original ArXiv

This content is AI-processed based on ArXiv data.

TDSelector: A Training Data Selection Method for Cross-Project Defect Prediction

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found