Predicting Defective Lines Using a Model-Agnostic Technique
š” Research Summary
Software Quality Assurance (SQA) teams must prioritize limited resources to locate postārelease defects efficiently. While many studies have built defect prediction models at the file, method, or package level, empirical analysis shows that only 1āÆ%ā3āÆ% of the lines in a defective file actually contain defects. Consequently, fileālevel predictions cause developers to inspect up to 99āÆ% of clean code, wasting effort. To address this granularity gap, the authors propose LINEāDP, a novel lineālevel defect prediction framework that leverages a modelāagnostic Explainable AI technique (LIME).
LINEāDP operates in three stages. First, a conventional fileālevel defect classifier is trained using codeātoken features (e.g., token frequency, TFāIDF). The classifier can be any standard learner such as logistic regression or random forest; the authors use both in their experiments. Second, LIME is applied to each file predicted as defective. LIME perturbs the token vector, fits a locally linear surrogate model, and ranks tokens by their contribution to the defective prediction. These ārisky tokensā are interpreted as the linguistic elements that most strongly drive the modelās decision. Third, any sourceācode line that contains at least one risky token is flagged as a defective line. The intuition is that tokens frequently appearing in historically defective files are likely to appear in the specific lines that will later be changed by bugāfixing commits.
The empirical evaluation covers 32 releases of nine popular Java openāsource projects (Activemq, Camel, Derby, Groovy, Hbase, Hive, Jruby, Lucene, Wicket). Defective lines are identified as lines removed by postārelease bugāfix commits. The study adopts both withinārelease (training and testing on the same release) and crossārelease (training on earlier releases, testing on later ones) validation settings. Six baselines are compared: random guessing, an NLPābased approach, two static analysis tools (Googleās ErrorProne and PMD), and two traditional modelāinterpretation methods using logistic regression and random forest.
Results show that LINEāDP achieves an average recall of 0.61, a falseāalarm rate of 0.47, a topā20āÆ% LOC recall of 0.27, and an initial falseāalarm count of 16. All these metrics are statistically better than the baselines. Moreover, 63āÆ% of the lines identified by LINEāDP belong to common defect categories such as argument changes or condition changes, indicating that the risky tokens indeed capture semantically meaningful defect patterns. The total computation timeāincluding model construction and line identificationāis roughly 10āÆseconds per release (10.68āÆs withinārelease, 8.46āÆs crossārelease), demonstrating practical feasibility.
The authors discuss several threats to validity. Tokenābased features may be languageāspecific and could miss structural cues; LIME provides an approximation that may not perfectly reflect the true causal factors; and the lineālevel ground truth relies solely on bugāfix commits, potentially overlooking latent defects. Future work is suggested to incorporate richer syntactic/semantic representations (e.g., abstract syntax trees, dataāflow graphs) and to explore alternative XAI methods such as SHAP or Integrated Gradients for more robust explanations.
In summary, this paper introduces the first use of a modelāagnostic XAI technique for lineālevel defect prediction. By translating a fileālevel classifierās decisions into tokenālevel explanations and mapping them back to source lines, LINEāDP substantially reduces the inspection effort required by SQA teams while maintaining competitive predictive performance. The work represents a significant step toward finerāgrained, explainable defect prediction and opens avenues for integrating explainability into automated code review and maintenance pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment