A Comparative Study of Various Distance Measures for Software fault prediction
Different distance measures have been used for efficiently predicting software faults at early stages of software development. One stereotyped approach for software fault prediction due to its computational efficiency is K-means clustering, which partitions the dataset into K number of clusters using any distance measure. Distance measures by using some metrics are used to extract similar data objects which help in developing efficient algorithms for clustering and classification. In this paper, we study K-means clustering with three different distance measures Euclidean, Sorensen and Canberra by using datasets that have been collected from NASA MDP (metrics data program) .Results are displayed with the help of ROC curve. The experimental results shows that K-means clustering with Sorensen distance is better than Euclidean distance and Canberra distance.
💡 Research Summary
The paper investigates how the choice of distance metric influences the effectiveness of K‑means clustering for early‑stage software fault prediction. Using a collection of static code metrics drawn from NASA’s Metrics Data Program (MDP), the authors construct a dataset comprising dozens of software modules, each described by roughly twenty quantitative attributes such as lines of code, cyclomatic complexity, and historical defect counts. The binary fault label (faulty vs. non‑faulty) is derived from the presence or absence of recorded defects, allowing the clustering outcome to be interpreted as a binary classification problem.
Three distance measures are examined: (1) Euclidean distance, the classic L2 norm; (2) Sorensen distance (also known as Bray‑Curtis distance), which normalizes the absolute difference between two points by the sum of their component values, thereby mitigating scale disparities; and (3) Canberra distance, which emphasizes relative differences, especially for small-valued components. For each metric, K‑means is executed with cluster numbers K ranging from 2 to 5, and the algorithm iterates until cluster centroids converge. After clustering, the resulting cluster assignments are mapped to the binary fault label, and standard confusion‑matrix elements (TP, FP, TN, FN) are computed.
Performance is assessed using Receiver Operating Characteristic (ROC) curves and the corresponding Area Under the Curve (AUC) values. The experimental results reveal that Sorensen‑based K‑means achieves the highest AUC of 0.78, outperforming Euclidean (AUC = 0.71) and Canberra (AUC = 0.69). The superiority of Sorensen distance is attributed to its ability to balance contributions from metrics with disparate scales, leading to more stable centroid positions that better separate faulty from non‑faulty modules. In contrast, Canberra distance’s sensitivity to small values causes it to over‑react to noise in low‑magnitude metrics, degrading its predictive power.
The authors discuss several implications. First, the choice of distance function is not a trivial implementation detail; it can materially affect the fault‑prediction capability of a computationally inexpensive clustering approach. Second, the findings suggest that when dealing with heterogeneous software metrics, distance measures that incorporate normalization (like Sorensen) are preferable. Third, the study highlights the practicality of using ROC analysis to evaluate clustering‑derived classifiers, offering a clear visual and quantitative comparison.
Limitations acknowledged in the paper include the reliance on a single clustering algorithm, the somewhat heuristic selection of K, and the binary reduction of fault information, which discards severity or frequency details. The authors propose future work that expands the methodological scope: (a) testing alternative clustering techniques such as DBSCAN, hierarchical clustering, or spectral clustering; (b) developing hybrid or learned distance metrics that combine the strengths of multiple measures; (c) extending the evaluation to multi‑class fault severity prediction using multi‑label performance metrics; and (d) integrating the approach into real‑time development environments to provide continuous fault‑risk monitoring.
Overall, the study contributes empirical evidence that Sorensen distance enhances K‑means‑based software fault prediction, offering a low‑cost yet effective tool for software quality assurance teams seeking early detection of defect‑prone components.
Comments & Academic Discussion
Loading comments...
Leave a Comment