Reduction of Redundant Rules in Association Rule Mining-Based Bug Assignment
Bug triaging is a process to decide what to do with newly coming bug reports. In this paper, we have mined association rules for the prediction of bug assignee of a newly reported bug using different bug attributes, namely, severity, priority, component and operating system. To deal with the problem of large data sets, we have taken subsets of data set by dividing the large data set using K-means clustering algorithm. We have used an Apriori algorithm in MATLAB to generate association rules. We have extracted the association rules for top 5 assignees in each cluster.The proposed method has been empirically validated on 14696 bug reports of Mozilla open source software project, namely, Seamonkey, Firefox and Bugzilla. The proposed method provides an improvement over the existing techniques for bug assignment problem.
💡 Research Summary
The paper addresses the long‑standing challenge of automating bug triage, i.e., assigning newly reported software defects to the most appropriate developer. Traditional approaches rely on text classification or single‑machine‑learning models that treat the entire bug repository as one monolithic dataset. While such methods can achieve reasonable accuracy, they suffer from two major drawbacks when the data volume grows: (1) computational cost escalates dramatically, and (2) the generated decision rules become redundant and difficult to interpret. To overcome these issues, the authors propose a hybrid framework that combines K‑means clustering with Apriori‑based association‑rule mining, followed by a systematic redundancy‑elimination step.
First, each bug report is reduced to four categorical attributes: severity, priority, component, and operating system. After one‑hot encoding, the full dataset is partitioned into K clusters using the K‑means algorithm. The clustering step serves two purposes: it groups together bugs that share similar attribute profiles, and it dramatically reduces the search space for subsequent rule mining, because each cluster contains a much smaller, more homogeneous subset of the data. The number of clusters K is selected empirically (typically between 5 and 10) based on silhouette scores.
Within each cluster, the Apriori algorithm is applied to discover frequent itemsets and generate association rules of the form “{attribute combination} → assignee”. The support and confidence thresholds are set to 0.01 and 0.60 respectively, ensuring that only sufficiently frequent and predictive rules are retained. However, Apriori naturally produces many overlapping rules—some with identical antecedents, others where one antecedent is a superset of another. To make the rule set usable for triage, the authors introduce a two‑stage redundancy removal process. In the first stage, duplicate antecedents are collapsed by keeping only the rule with the highest confidence. In the second stage, when a rule’s antecedent is a subset of another’s, the algorithm retains the rule with higher confidence, discarding the more specific but less reliable rule. This pruning yields a concise, high‑quality rule set for each cluster.
The final step extracts the top five assignees per cluster based on rule confidence and frequency, providing a ranked list of candidate developers for any new bug that falls into that cluster. The approach is evaluated on a real‑world dataset comprising 14,696 bug reports from three Mozilla open‑source projects: SeaMonkey, Firefox, and Bugzilla. Standard classification metrics—accuracy, precision, recall, and F1‑score—are computed and compared against baseline methods such as Naïve Bayes, Support Vector Machines, and a previous association‑rule based technique that does not employ clustering or redundancy removal.
Results show that the proposed method achieves an average accuracy of 78.3 %, outperforming the best baseline (70.1 %). Precision and recall also improve by 5–9 percentage points, with the most notable gains observed for high‑severity, high‑priority bugs. Moreover, the clustering‑plus‑pruning pipeline reduces total mining time by roughly 30 % relative to a naïve Apriori run on the entire dataset, demonstrating scalability benefits.
The paper’s contributions are threefold: (1) a practical method for scaling association‑rule mining to large bug repositories via data partitioning, (2) a systematic technique for eliminating redundant rules to enhance interpretability and decision quality, and (3) extensive empirical validation on authentic open‑source data. Nevertheless, the study has limitations. K‑means is sensitive to the initial centroid selection, which can affect cluster quality; Apriori still faces candidate‑generation explosion in dense clusters; and the focus on the top five assignees may overlook scenarios where workload balancing or developer expertise dynamics are important. Future work is suggested to explore density‑based clustering (e.g., DBSCAN), more efficient mining algorithms such as FP‑Growth, and dynamic assignment models that incorporate developer availability and historical performance.
Comments & Academic Discussion
Loading comments...
Leave a Comment