Towards Effective Bug Triage with Towards Effective Bug Triage with Software Data Reduction Techniques
📝 Abstract
Software companies spend over 45 percent of cost in dealing with software bugs. An inevitable step of fixing bugs is bug triage, which aims to correctly assign a developer to a new bug. To decrease the time cost in manual work, text classification techniques are applied to conduct automatic bug triage. In this paper, we address the problem of data reduction for bug triage, i.e., how to reduce the scale and improve the quality of bug data. We combine instance selection with feature selection to simultaneously reduce data scale on the bug dimension and the word dimension. To determine the order of applying instance selection and feature selection, we extract attributes from historical bug data sets and build a predictive model for a new bug data set. We empirically investigate the performance of data reduction on totally 600,000 bug reports of two large open source projects, namely Eclipse and Mozilla. The results show that our data reduction can effectively reduce the data scale and improve the accuracy of bug triage. Our work provides an approach to leveraging techniques on data processing to form reduced and high-quality bug data in software development and maintenance.
💡 Analysis
Software companies spend over 45 percent of cost in dealing with software bugs. An inevitable step of fixing bugs is bug triage, which aims to correctly assign a developer to a new bug. To decrease the time cost in manual work, text classification techniques are applied to conduct automatic bug triage. In this paper, we address the problem of data reduction for bug triage, i.e., how to reduce the scale and improve the quality of bug data. We combine instance selection with feature selection to simultaneously reduce data scale on the bug dimension and the word dimension. To determine the order of applying instance selection and feature selection, we extract attributes from historical bug data sets and build a predictive model for a new bug data set. We empirically investigate the performance of data reduction on totally 600,000 bug reports of two large open source projects, namely Eclipse and Mozilla. The results show that our data reduction can effectively reduce the data scale and improve the accuracy of bug triage. Our work provides an approach to leveraging techniques on data processing to form reduced and high-quality bug data in software development and maintenance.
📄 Content
IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID 1
Towards Effective Bug Triage with
Software Data Reduction Techniques
Jifeng Xuan, He Jiang, Member, IEEE, Yan Hu, Zhilei Ren,
Weiqin Zou, Zhongxuan Luo, Xindong Wu, Fellow, IEEE
Abstract—Software companies spend over 45 percent of cost in dealing with software bugs. An inevitable step of fixing bugs is
bug triage, which aims to correctly assign a developer to a new bug. To decrease the time cost in manual work, text
classification techniques are applied to conduct automatic bug triage. In this paper, we address the problem of data reduction for
bug triage, i.e., how to reduce the scale and improve the quality of bug data. We combine instance selection with feature
selection to simultaneously reduce data scale on the bug dimension and the word dimension. To determine the order of applying
instance selection and feature selection, we extract attributes from historical bug data sets and build a predictive model for a
new bug data set. We empirically investigate the performance of data reduction on totally 600,000 bug reports of two large open
source projects, namely Eclipse and Mozilla. The results show that our data reduction can effectively reduce the data scale and
improve the accuracy of bug triage. Our work provides an approach to leveraging techniques on data processing to form
reduced and high-quality bug data in software development and maintenance.
Index Terms—Mining software repositories, application of data preprocessing, data management in bug repositories, bug data
reduction, feature selection, instance selection, bug triage, prediction for reduction orders.
—————————— ——————————
1 INTRODUCTION
INING software repositories is an interdisciplinary
domain, which aims to employ data mining to deal
with software engineering problems [22]. In modern soft-
ware development, software repositories are large-scale
databases for storing the output of software development,
e.g., source code, bugs, emails, and specifications. Tradi-
tional software analysis is not completely suitable for the
large-scale and complex data in software repositories [58].
Data mining has emerged as a promising means to handle
software data (e.g., [7], [32]). By leveraging data mining
techniques, mining software repositories can uncover in-
teresting information in software repositories and solve
real-world software problems.
A bug repository (a typical software repository, for stor-
ing details of bugs), plays an important role in managing
software bugs. Software bugs are inevitable and fixing
bugs is expensive in software development. Software
companies spend over 45 percent of cost in fixing bugs [39].
Large software projects deploy bug repositories (also called
bug tracking systems) to support information collection and
to assist developers to handle bugs [14], [9]. In a bug repos-
itory, a bug is maintained as a bug report, which records the
textual description of reproducing the bug and updates
according to the status of bug fixing [64]. A bug repository
provides a data platform to support many types of tasks on
bugs, e.g., fault prediction [7], [49], bug localization [2], and
reopened-bug analysis [63]. In this paper, bug reports in a
bug repository are called bug data.
There are two challenges related to bug data that may
affect the effective use of bug repositories in software de-
velopment tasks, namely the large scale and the low quali-
ty. On one hand, due to the daily-reported bugs, a large
number of new bugs are stored in bug repositories. Taking
an open source project, Eclipse [13], as an example, an av-
erage of 30 new bugs are reported to bug repositories per
day in 2007 [3]; from 2001 to 2010, 333,371 bugs have been
reported to Eclipse by over 34,917 developers and users
[57]. It is a challenge to manually examine such large-scale
bug data in software development. On the other hand,
software techniques suffer from the low quality of bug data.
Two typical characteristics of low-quality bugs are noise
and redundancy. Noisy bugs may mislead related devel-
opers [64] while redundant bugs waste the limited time of
bug handling [54].
A time-consuming step of handling software bugs is
bug triage, which aims to assign a correct developer to fix a
new bug [1], [25], [3], [40]. In traditional software devel-
opment, new bugs are manually triaged by an expert de-
veloper, i.e., a human triager. Due to the large number of
daily bugs and the lack of expertise of all the bugs, manual
bug triage is expensive in time cost and low in accuracy. In
manual bug triage in Eclipse, 44 percent of bugs are as-
signed by mistake while the time cost between opening one
bug and its first triaging is 19.3 days on average [25]. To
avoid the expensive cost of manual bug triage, existing
work [1] has proposed an automatic bug triage approach,
which applies text classification techniques to predict de-
velopers for
This content is AI-processed based on ArXiv data.