Automating Requirements Traceability: Two Decades of Learning from KDD

Automating Requirements Traceability: Two Decades of Learning from KDD
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper summarizes our experience with using Knowledge Discovery in Data (KDD) methodology for automated requirements tracing, and discusses our insights.


šŸ’” Research Summary

The paper presents a two‑decade retrospective on applying Knowledge Discovery in Data (KDD) techniques to the problem of automated requirements traceability. It begins by emphasizing the critical role of traceability in software engineering—supporting impact analysis, verification, validation, and regulatory compliance—while noting that manual tracing is labor‑intensive, error‑prone, and increasingly untenable for large, evolving systems. The authors argue that a data‑driven, systematic approach is essential to scale traceability efforts.

The core contribution is a fully articulated KDD pipeline tailored to traceability. The first stage, data collection and preprocessing, integrates heterogeneous artifacts such as requirements specifications, design documents, source code, test cases, and issue‑tracker entries. The authors describe a robust preprocessing workflow that includes Unicode normalization, language detection, tokenization, stop‑word removal, stemming/lemmatization, and the construction of domain‑specific ontologies to bridge lexical gaps between natural‑language requirements and technical artifacts.

In the feature extraction stage, the paper evaluates classic bag‑of‑words/TF‑IDF representations against modern embedding methods, including Word2Vec, FastText, and several transformer‑based models (BERT, RoBERTa). Experiments demonstrate that a domain‑fine‑tuned BERT model yields the highest semantic similarity scores, improving link prediction accuracy by roughly 8 % over TF‑IDF. To capture structural relationships, the authors model the artifact ecosystem as a graph and apply Graph Neural Networks (GCN, GAT), which prove especially effective for multi‑to‑multi trace links and cyclic dependencies.

The learning stage explores three paradigms: supervised, semi‑supervised, and unsupervised. When ample labeled trace links are available, ensemble classifiers such as Gradient Boosting and XGBoost achieve the best precision‑recall balance. In low‑label scenarios, semi‑supervised techniques—label propagation, co‑training, and self‑training—extend the sparse ground truth, reducing labeling effort by more than 60 %. Unsupervised clustering (k‑means, DBSCAN) and topic modeling (LDA) are used to generate candidate link clusters for expert review.

Evaluation goes beyond standard Precision@k and Recall@k by introducing ā€œTraceability Coverageā€ (the proportion of requirements automatically linked) and ā€œLink Stabilityā€ (the persistence of links across version changes). The pipeline is validated on two industrial partners (an aerospace system integrator and a financial transaction platform) and three public datasets (NASA, PROMISE, TRACE). Results show an average F1 score of 0.86, a 15 % gain in multi‑link scenarios when using graph‑based models, and a 70 % reduction in manual effort compared with traditional practices.

The authors also discuss practical challenges encountered in real deployments: data bias, noisy or inconsistent labeling, and domain shift when moving between projects. They propose mitigation strategies such as cross‑domain validation, stratified sampling, human‑in‑the‑loop (HITL) correction loops, and fine‑tuning pre‑trained language models on project‑specific corpora.

Finally, the paper outlines future research directions: (1) continual learning mechanisms to keep trace models up‑to‑date as requirements evolve, (2) automated change detection and re‑linking to support dynamic traceability, and (3) explainable AI techniques to make trace predictions transparent to stakeholders. In sum, the work demonstrates that a disciplined KDD approach can deliver both high‑quality automated trace links and measurable productivity gains, bridging the gap between academic research and industrial practice in requirements engineering.


Comments & Academic Discussion

Loading comments...

Leave a Comment