DBBRBF- Convalesce optimization for software defect prediction problem using hybrid distribution base balance instance selection and radial basis Function classifier

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Software is becoming an indigenous part of human life with the rapid development of software engineering, demands the software to be most reliable. The reliability check can be done by efficient software testing methods using historical software prediction data for development of a quality software system. Machine Learning plays a vital role in optimizing the prediction of defect-prone modules in real life software for its effectiveness. The software defect prediction data has class imbalance problem with a low ratio of defective class to non-defective class, urges an efficient machine learning classification technique which otherwise degrades the performance of the classification. To alleviate this problem, this paper introduces a novel hybrid instance-based classification by combining distribution base balance based instance selection and radial basis function neural network classifier model (DBBRBF) to obtain the best prediction in comparison to the existing research. Class imbalanced data sets of NASA, Promise and Softlab were used for the experimental analysis. The experimental results in terms of Accuracy, F-measure, AUC, Recall, Precision, and Balance show the effectiveness of the proposed approach. Finally, Statistical significance tests are carried out to understand the suitability of the proposed model.

💡 Research Summary

The paper addresses the pervasive class‑imbalance issue in software defect prediction by introducing a hybrid approach that combines a novel instance‑selection technique, Distributed‑Based Balancing (DBB), with a Radial Basis Function Neural Network (RBFNN) classifier, referred to as DBBRBF. The authors argue that traditional machine‑learning models suffer from degraded performance when the defective class is severely under‑represented, and that existing resampling methods such as SMOTE either increase computational cost or fail to preserve the original data distribution.

DBB operates in four steps: (1) it oversamples minority classes using a Poisson‑based approximation, (2) undersamples majority classes to a target count b, (3) removes overlapping regions between classes, and (4) fully balances all classes. To generate synthetic instances, a Gaussian distribution is employed, with a fixed quota of 30 new samples per class per iteration. This design aims to retain statistical properties while achieving rapid sampling.

The classification component is an RBFNN with three layers: a linear input layer, a hidden layer of Gaussian radial basis functions, and a linear output layer. Hidden‑layer centers and spreads are initialized via K‑means clustering. Weight updates are performed using the Moore‑Penrose pseudo‑inverse, providing a closed‑form solution that avoids learning‑rate tuning, epoch selection, and local‑optimum traps. The network uses a single hidden layer, relying on the universal approximation theorem to guarantee sufficient expressive power given enough neurons.

Experimental evaluation employs six publicly available datasets (NASA KC1, KC2, JM1, PC1, MW1, CM1; PROMISE JEdit 4.2/4.3; Softlab AR4‑AR6) and a 10‑fold cross‑validation protocol. Pre‑processing includes global missing‑value imputation and binary encoding of nominal attributes. Six performance metrics are reported: Accuracy, Precision, Recall, F‑measure, AUC, and a “Balance” measure designed for imbalanced data. Statistical significance is assessed with non‑parametric Kruskal‑Wallis tests followed by Mann‑Whitney post‑hoc analysis.

Results show that DBBRBF consistently outperforms state‑of‑the‑art baselines such as Naïve Bayes + MICHAC, Random Forest + MICHAC, and RIPPER + MICHAC. On NASA datasets, DBBRBF reaches accuracies of 98.33 % (KC1) and 100 % (KC2, JM1, PC1), surpassing the next best methods by margins of 10‑15 %. The Balance metric attains values near 1.0 for five of the six NASA datasets, indicating near‑perfect handling of class imbalance. Comparable superiority is observed on JEdit and Softlab datasets, where DBBRBF achieves the highest average accuracy, precision, and recall among Logistic Regression, SVM, and Random Forest.

The authors acknowledge limitations: the experiments were conducted on modest hardware (i5 CPU, 4 GB RAM), raising questions about scalability to larger industrial datasets; DBB’s hyper‑parameters (target count b, number of synthetic samples) are fixed rather than adaptively tuned; and the RBFNN architecture relies on empirically chosen hidden‑neuron counts without automated optimization.

Future work is suggested in three directions: (1) developing adaptive or meta‑learning strategies for DBB parameter selection, (2) exploiting parallel or GPU‑accelerated training to handle larger data volumes, and (3) benchmarking DBBRBF against deep learning classifiers (e.g., CNNs, LSTMs) and ensemble methods to further validate its robustness.

In summary, the DBBRBF framework demonstrates that a carefully designed instance‑balancing pre‑processing step, coupled with a mathematically tractable RBF neural network, can substantially mitigate the detrimental effects of class imbalance and deliver superior defect‑prediction performance across a range of benchmark datasets.

DBBRBF- Convalesce optimization for software defect prediction problem using hybrid distribution base balance instance selection and radial basis Function classifier

💡 Research Summary

Comments & Academic Discussion

Leave a Comment