Towards Better Summarizing Bug Reports with Crowdsourcing Elicited Attributes

Towards Better Summarizing Bug Reports with Crowdsourcing Elicited   Attributes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent years have witnessed the growing demands for resolving numerous bug reports in software maintenance. Aiming to reduce the time testers/developers take in perusing bug reports, the task of bug report summarization has attracted a lot of research efforts in the literature. However, no systematic analysis has been conducted on attribute construction which heavily impacts the performance of supervised algorithms for bug report summarization. In this study, we first conduct a survey to reveal the existing methods for attribute construction in mining software repositories. Then, we propose a new method named Crowd-Attribute to infer new effective attributes from the crowdgenerated data in crowdsourcing and develop a new tool named Crowdsourcing Software Engineering Platform to facilitate this method. With Crowd-Attribute, we successfully construct 11 new attributes and propose a new supervised algorithm named Logistic Regression with Crowdsourced Attributes (LRCA). To evaluate the effectiveness of LRCA, we build a series of large scale data sets with 105,177 bug reports. Experiments over both the public data set SDS with 36 manually annotated bug reports and new large-scale data sets demonstrate that LRCA can consistently outperform the state-of-the-art algorithms for bug report summarization.


💡 Research Summary

The paper tackles the problem of automatically summarizing bug reports, a task that has become increasingly important as software projects generate large volumes of defect tickets. While many prior works have focused on natural‑language‑processing techniques—such as centroid‑based extraction, graph‑based ranking (LexRank), statistical summarizers (SumBasic), and more recently neural sequence‑to‑sequence models—the authors observe that the performance of supervised summarizers is heavily dependent on the quality and relevance of the features (attributes) used to train them. Existing literature, however, lacks a systematic investigation into how these attributes are constructed; most approaches rely on ad‑hoc expert intuition or simple statistical measures (e.g., TF‑IDF, sentence position).

To fill this gap, the authors first conduct a survey of attribute‑construction practices in mining software repositories, confirming that the community has not yet established a principled methodology. They then introduce Crowd‑Attribute, a novel framework that leverages crowdsourcing to elicit human‑perceived importance signals from a large pool of non‑expert contributors. The process consists of (1) extracting individual sentences from bug reports, (2) presenting them on a dedicated crowdsourcing platform (the Crowdsourcing Software Engineering Platform) where workers rate each sentence on dimensions such as relevance, type (reproduction steps, expected behavior, actual behavior, etc.), presence of code snippets, and overall importance, and (3) aggregating these noisy judgments using a Dawid‑Skene‑style consensus model to produce reliable scores.

From this aggregated data the authors derive eleven new attributes, including:

  1. Keyword density normalized by sentence length.
  2. Structural location within the report (e.g., introduction, reproduction, result).
  3. Crowd‑derived importance score (vote‑based weight).
  4. Temporal proximity between the report’s creation time and related commits/patches.
  5. Presence of code snippets.
  6. Syntactic complexity (parse‑tree depth).
  7. Agreement level among crowd labels.
  8. Position of keywords inside the sentence.
  9. Raw sentence length.
  10. Relative length of the sentence compared to the whole report.
  11. Frequency of similar sentences across the corpus.

These attributes are deliberately designed to capture human intuition about what makes a sentence “summary‑worthy” in the bug‑reporting context, something that pure lexical statistics often miss.

The next contribution is a supervised summarization model named Logistic Regression with Crowdsourced Attributes (LRCA). Logistic regression is chosen for its interpretability and its ability to expose the contribution of each attribute directly. LRCA is trained on a binary classification task: each sentence is labeled as “summary” or “non‑summary” based on existing gold‑standard annotations (or, for the large dataset, proxy labels derived from linked commit messages). The model outputs a probability for each sentence, and the top‑k sentences are selected as the final summary.

Experimental evaluation proceeds in two stages. First, the authors use the public SDS dataset, which contains 36 bug reports manually annotated by experts. On this small benchmark, LRCA outperforms baseline extractive methods (Centroid, LexRank, SumBasic) and a state‑of‑the‑art neural summarizer (Seq2Seq‑Attention) across ROUGE‑1, ROUGE‑2, F‑score, and MAP. Second, they construct a massive dataset of 105,177 bug reports collected from GitHub, Bugzilla, and other issue‑tracking systems. Since manual annotation at this scale is infeasible, they generate proxy labels by aligning bug reports with subsequent fixing commits and using the commit messages as a weak supervision signal. Using 5‑fold cross‑validation, LRCA consistently achieves higher scores than all baselines, with an average ROUGE‑1 improvement of roughly 6–8 percentage points.

A detailed feature‑importance analysis reveals that the crowd‑derived importance score and temporal proximity are the most predictive attributes, confirming the hypothesis that human perception of relevance and the timing relationship between bug reports and fixes are crucial for summarization. The authors also discuss the cost of crowdsourcing: quality control mechanisms (gold questions, consensus modeling) are required to mitigate noisy contributions, but the overall expense is justified by the performance gains.

Limitations and future work are openly acknowledged. The reliance on non‑expert crowd workers may introduce bias, especially for highly technical bug reports where domain knowledge matters. The authors propose integrating a small expert verification step and exploring more sophisticated label‑trust models. They also suggest extending Crowd‑Attribute to other SE artifacts such as code review comments, design documents, and requirement specifications, where similar summarization challenges exist.

In summary, this paper makes three primary contributions: (1) a systematic survey of attribute‑construction methods in software repository mining, (2) the Crowd‑Attribute framework that extracts eleven effective, human‑centred features via crowdsourcing, and (3) the LRCA model that demonstrates, on both a small expert‑annotated benchmark and a large‑scale weakly supervised corpus, superior bug‑report summarization performance compared to existing extractive and neural baselines. The work bridges the gap between human intuition and machine learning feature engineering, offering a scalable path toward more accurate and useful bug‑report summaries in real‑world software maintenance.


Comments & Academic Discussion

Loading comments...

Leave a Comment