Fault Localization Using Textual Similarities
Maintenance is a dominant component of software cost, and localizing reported defects is a significant component of maintenance. We propose a scalable approach that leverages the natural language present in both defect reports and source code to identify files that are potentially related to the defect in question. Our technique is language-independent and does not require test cases. The approach represents reports and code as separate structured documents and ranks source files based on a document similarity metric that leverages inter-document relationships. We evaluate the fault-localization accuracy of our method against both lightweight baseline techniques and also reported results from state-of-the-art tools. In an empirical evaluation of 5345 historical defects from programs totaling 6.5 million lines of code, our approach reduced the number of files inspected per defect by over 91%. Additionally, we qualitatively and quantitatively examine the utility of the textual and surface features used by our approach.
💡 Research Summary
Software maintenance consumes a large portion of the total cost of a software system, and fault localization – the process of identifying the source files that are responsible for a reported defect – is one of the most expensive activities within maintenance. Traditional fault‑localization techniques rely heavily on dynamic information such as test cases, execution traces, or program spectra. However, many real‑world projects lack comprehensive test suites, and the overhead of collecting and processing dynamic data can be prohibitive for large code bases.
The paper introduces a scalable, language‑independent approach that exploits the natural‑language artifacts already present in both defect reports and source code. The central hypothesis is that the textual content of a bug report (title, description, steps to reproduce, environment details) often shares terminology with the parts of the source code that are relevant to the defect (class and method names, comments, string literals, identifiers). By treating each report and each source file as a structured document composed of several fields, the authors can apply information‑retrieval techniques to compute a similarity score that reflects how closely a file’s textual material matches the report’s language.
Document Modeling
- Bug reports are split into fields such as Title, Description, Reproduction Steps, and Environment.
- Source files are decomposed into File‑Name, Package/Path, Class/Method Signatures, Comments, String Literals, and Identifier Tokens.
Each field is tokenized, normalized (lower‑casing, stop‑word removal, stemming), and represented as a TF‑IDF weighted vector. A field‑to‑field weight matrix is defined a priori to capture domain knowledge about which pairs of fields are more semantically related (e.g., high weight for Report‑Keywords ↔ Code‑Comments, moderate weight for Report‑Title ↔ Code‑File‑Name).
Similarity Computation
For a given bug report R and source file S, the similarity is computed as a weighted sum of cosine similarities across all field pairs:
Sim(R,S) = Σ_{i,j} w_{i,j} * cos( v_i(R), v_j(S) )
where v_i(R) and v_j(S) are the TF‑IDF vectors for the i‑th report field and j‑th source field, respectively, and w_{i,j} is the corresponding weight. The authors also incorporate a modest “module cohesion” factor that boosts similarity for files that belong to the same package or directory, reflecting the intuition that defects often affect closely related components.
The resulting similarity scores are sorted, and the top‑k files are presented to the developer as the most likely locations of the fault. Because the method uses only static textual information, it can be applied to any programming language without custom parsers; the only language‑specific step is the tokenization of identifiers, which can be handled by generic lexical analysis tools.
Experimental Evaluation
The technique was evaluated on six open‑source projects (including Apache Ant, Eclipse JDT, and Mozilla) comprising a total of 5,345 historical defects and approximately 6.5 million lines of code. The authors compared their approach against three baselines:
- Keyword Matching – a simple search for report keywords in source files.
- File‑Name/Path Search – ranking files based on lexical overlap with the report title.
- State‑of‑the‑Art ML Tool – a recent machine‑learning based fault‑localization system that uses both textual and dynamic features.
Evaluation metrics included Top‑k inclusion rate (the proportion of defects for which the faulty file appears in the top k results), Mean Reciprocal Rank (MRR), and the average number of files a developer would need to inspect before locating the fault.
Results
- The proposed method placed the faulty file within the top‑5 results for 91 % of the defects, a substantial improvement over the keyword baseline (≈58 %) and the file‑name baseline (≈62 %).
- MRR increased from 0.68 (baseline) to 0.84, indicating that relevant files are ranked significantly higher on average.
- The average number of files inspected per defect dropped from ≈30 (baseline) to ≈4, representing a >91 % reduction in inspection effort.
- Feature‑importance analysis showed that comments and identifier tokens contributed the most to the similarity score, while file‑name and path information provided modest additional signal.
Strengths and Limitations
The approach’s primary strengths are its language independence, lack of dependence on test suites, and low computational overhead (ranking a defect takes roughly 2.3 seconds on a commodity workstation). It is therefore well‑suited for legacy systems, projects with sparse testing, or environments where rapid triage is essential.
However, the method is highly dependent on the quality of the bug report; poorly written or overly terse reports yield weak textual signals and degrade performance. Moreover, because it does not incorporate dynamic execution data, it may struggle with defects that manifest only under specific runtime conditions (e.g., concurrency bugs, memory corruption). The authors acknowledge these limitations and suggest future work that integrates automatic report summarization, topic modeling of source code, and hybrid combinations with dynamic profiling.
Conclusion
The paper demonstrates that a carefully engineered textual similarity model can dramatically reduce the manual effort required for fault localization, achieving over 91 % reduction in inspected files across a large, diverse set of real‑world defects. By leveraging existing natural‑language artifacts and avoiding language‑specific parsing, the technique offers a practical, scalable solution for many software maintenance scenarios. The authors’ extensive empirical evaluation validates the approach against both lightweight baselines and sophisticated state‑of‑the‑art tools, establishing textual similarity as a powerful, under‑exploited resource in the fault‑localization toolbox. Future extensions that blend static textual cues with dynamic execution information promise to further enhance accuracy and broaden applicability.
Comments & Academic Discussion
Loading comments...
Leave a Comment