Predicting Defective Lines Using a Model-Agnostic Technique

Predicting Defective Lines Using a Model-Agnostic Technique
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

šŸ’” Research Summary

Software Quality Assurance (SQA) teams must prioritize limited resources to locate post‑release defects efficiently. While many studies have built defect prediction models at the file, method, or package level, empirical analysis shows that only 1 %–3 % of the lines in a defective file actually contain defects. Consequently, file‑level predictions cause developers to inspect up to 99 % of clean code, wasting effort. To address this granularity gap, the authors propose LINE‑DP, a novel line‑level defect prediction framework that leverages a model‑agnostic Explainable AI technique (LIME).

LINE‑DP operates in three stages. First, a conventional file‑level defect classifier is trained using code‑token features (e.g., token frequency, TF‑IDF). The classifier can be any standard learner such as logistic regression or random forest; the authors use both in their experiments. Second, LIME is applied to each file predicted as defective. LIME perturbs the token vector, fits a locally linear surrogate model, and ranks tokens by their contribution to the defective prediction. These ā€œrisky tokensā€ are interpreted as the linguistic elements that most strongly drive the model’s decision. Third, any source‑code line that contains at least one risky token is flagged as a defective line. The intuition is that tokens frequently appearing in historically defective files are likely to appear in the specific lines that will later be changed by bug‑fixing commits.

The empirical evaluation covers 32 releases of nine popular Java open‑source projects (Activemq, Camel, Derby, Groovy, Hbase, Hive, Jruby, Lucene, Wicket). Defective lines are identified as lines removed by post‑release bug‑fix commits. The study adopts both within‑release (training and testing on the same release) and cross‑release (training on earlier releases, testing on later ones) validation settings. Six baselines are compared: random guessing, an NLP‑based approach, two static analysis tools (Google’s ErrorProne and PMD), and two traditional model‑interpretation methods using logistic regression and random forest.

Results show that LINE‑DP achieves an average recall of 0.61, a false‑alarm rate of 0.47, a top‑20 % LOC recall of 0.27, and an initial false‑alarm count of 16. All these metrics are statistically better than the baselines. Moreover, 63 % of the lines identified by LINE‑DP belong to common defect categories such as argument changes or condition changes, indicating that the risky tokens indeed capture semantically meaningful defect patterns. The total computation time—including model construction and line identification—is roughly 10 seconds per release (10.68 s within‑release, 8.46 s cross‑release), demonstrating practical feasibility.

The authors discuss several threats to validity. Token‑based features may be language‑specific and could miss structural cues; LIME provides an approximation that may not perfectly reflect the true causal factors; and the line‑level ground truth relies solely on bug‑fix commits, potentially overlooking latent defects. Future work is suggested to incorporate richer syntactic/semantic representations (e.g., abstract syntax trees, data‑flow graphs) and to explore alternative XAI methods such as SHAP or Integrated Gradients for more robust explanations.

In summary, this paper introduces the first use of a model‑agnostic XAI technique for line‑level defect prediction. By translating a file‑level classifier’s decisions into token‑level explanations and mapping them back to source lines, LINE‑DP substantially reduces the inspection effort required by SQA teams while maintaining competitive predictive performance. The work represents a significant step toward finer‑grained, explainable defect prediction and opens avenues for integrating explainability into automated code review and maintenance pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment