Using Categorical Features in Mining Bug Tracking Systems to Assign Bug Reports
Most bug assignment approaches utilize text classification and information retrieval techniques. These approaches use the textual contents of bug reports to build recommendation models. The textual contents of bug reports are usually of high dimension and noisy source of information. These approaches suffer from low accuracy and high computational needs. In this paper, we investigate whether using categorical fields of bug reports, such as component to which the bug belongs, are appropriate to represent bug reports instead of textual description. We build a classification model by utilizing the categorical features, as a representation, for the bug report. The experimental evaluation is conducted using three projects namely NetBeans, Freedesktop, and Firefox. We compared this approach with two machine learning based bug assignment approaches. The evaluation shows that using the textual contents of bug reports is important. In addition, it shows that the categorical features can improve the classification accuracy.
💡 Research Summary
The paper addresses the problem of automatically assigning bug reports to developers, a task traditionally tackled with text‑based machine‑learning approaches. Existing methods typically extract the title, description, and reproduction steps of a bug report, convert them into high‑dimensional TF‑IDF vectors, and feed these vectors into classifiers such as SVM or Naïve Bayes. While effective in many cases, text‑only models suffer from two major drawbacks: (1) the textual data are noisy, containing misspellings, informal language, and redundant information, which reduces classification accuracy; and (2) the high dimensionality of the feature space leads to substantial computational overhead during training and inference.
To explore an alternative, the authors propose using the structured categorical fields that are already present in most bug‑tracking systems. These fields include component, product, version, operating system, and other metadata that developers must fill out when filing a bug. Because the possible values for each field are limited, they can be encoded efficiently (e.g., one‑hot encoding) and provide a low‑dimensional, semantically stable representation of the report.
The experimental methodology consists of three open‑source projects: NetBeans, Freedesktop, and Firefox. For each project the authors collected more than 5,000 bug reports, extracting both the textual content and the categorical metadata, as well as the ground‑truth assignee. Missing categorical values were replaced with an “unknown” token, and all categorical attributes were transformed into binary indicator vectors. Three classic classifiers—Naïve Bayes, Decision Tree, and Random Forest—were trained on the categorical representation. A 10‑fold cross‑validation scheme was used to obtain reliable performance estimates. As baselines, the authors implemented a state‑of‑the‑art text‑only model that uses TF‑IDF weighting and a linear SVM classifier.
Results show that a model built solely on categorical features can achieve competitive performance. The Random Forest classifier trained on categorical data reached an average accuracy of about 71 %, compared with 68 % for the text‑only SVM baseline. An analysis of feature importance revealed that the component and product fields contributed the most to predictive power, while version information had a variable impact across projects. When the categorical features were combined with the textual TF‑IDF vectors in a hybrid model, accuracy rose to roughly 77 %, a statistically significant improvement over either single‑source approach. This demonstrates that the structured metadata capture complementary information that text alone misses.
Despite the promising results, the authors acknowledge several limitations. Categorical data alone cannot distinguish subtle semantic differences that are expressed in the description (e.g., two bugs in the same component but caused by different underlying mechanisms). Moreover, the categorical schema is not static: new components or products may be added over time, requiring updates to the one‑hot encoding and potentially retraining the model. Some projects also suffer from ambiguous or inconsistently named components, which adds preprocessing overhead.
The paper’s contributions are threefold. First, it highlights the under‑exploited value of bug‑tracking metadata as low‑cost, low‑dimensional features for automated bug assignment. Second, it empirically demonstrates that categorical features can match or surpass text‑only models in accuracy while reducing computational demands. Third, it proposes a hybrid approach that integrates both textual and categorical information, achieving the best performance and suggesting a practical direction for future research. The discussion also provides actionable guidance on handling schema evolution and data quality issues when deploying categorical‑based models in real‑world development environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment