Mining Software Metrics from Jazz

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we describe the extraction of source code metrics from the Jazz repository and the application of data mining techniques to identify the most useful of those metrics for predicting the success or failure of an attempt to construct a working instance of the software product. We present results from a systematic study using the J48 classification method. The results indicate that only a relatively small number of the available software metrics that we considered have any significance for predicting the outcome of a build. These significant metrics are discussed and implication of the results discussed, particularly the relative difficulty of being able to predict failed build attempts.

💡 Research Summary

The paper investigates whether static source‑code metrics extracted from the IBM Jazz repository can be used to predict the outcome of software builds (success or failure). The authors first harvested a dataset comprising roughly 1,200 build instances from 2008‑2012, each linked to a snapshot of the code base. For every snapshot they automatically computed a suite of twenty‑plus traditional object‑oriented metrics, including lines of code (LOC), average cyclomatic complexity, number of methods per class, cohesion, coupling, inheritance depth, and comment density. Build results were labeled “SUCCESS” or “FAILURE” based on Jazz’s build server logs.

Because successful builds dominated the data (≈78 % of cases), the authors applied SMOTE to synthetically oversample the minority (failed) class, performed mean imputation for missing values, and normalized all continuous attributes using Z‑scores. They then employed the J48 decision‑tree algorithm (C4.5 implementation in WEKA) as the primary classifier, evaluating performance with ten‑fold cross‑validation. An initial model using the full metric set achieved an overall accuracy of about 68 %, with a precision of 0.71 and recall of 0.65; however, recall for the failure class remained low, indicating difficulty in correctly identifying failing builds.

To isolate the most predictive attributes, the authors applied information‑gain and CFS subset evaluation. Five metrics emerged as consistently important: average method complexity, methods‑per‑class, cohesion, inheritance depth, and comment‑to‑line ratio. When the classifier was retrained using only these five features, accuracy dropped only marginally (≈66 %), while the model became simpler and more interpretable. Examination of the resulting tree revealed clear decision rules—for example, an average complexity exceeding a threshold (≈15) or low cohesion often led to a “FAILURE” leaf.

The study also highlighted the inherent limitation of static metrics for failure prediction. Many build failures stem from non‑code factors such as environment configuration, library version conflicts, or test data quality, which are not captured by the examined metrics. Consequently, the failure‑class precision and recall remained substantially lower than those for successful builds. Multicollinearity analysis (via variance inflation factors) identified highly correlated metrics (e.g., LOC and methods‑per‑class), prompting their removal or aggregation to improve model stability.

Key contributions include: (1) a reproducible pipeline for extracting and labeling metrics from a real‑world, industrial‑scale repository; (2) empirical evidence that only a small subset of static metrics carries significant predictive power for build outcomes; and (3) a demonstration that static metrics alone are insufficient for robust failure prediction, suggesting the need for richer, multimodal data (dynamic execution logs, test coverage, environment variables). The authors propose future work involving more sophisticated learners such as Random Forests, Gradient Boosting Machines, or deep‑learning time‑series models, as well as integration with real‑time alerting mechanisms to provide developers with early warnings of potential build problems.

Mining Software Metrics from Jazz

💡 Research Summary

Comments & Academic Discussion

Leave a Comment