Automating Requirements Traceability: Two Decades of Learning from KDD
This paper summarizes our experience with using Knowledge Discovery in Data (KDD) methodology for automated requirements tracing, and discusses our insights.
š” Research Summary
The paper presents a twoādecade retrospective on applying Knowledge Discovery in Data (KDD) techniques to the problem of automated requirements traceability. It begins by emphasizing the critical role of traceability in software engineeringāsupporting impact analysis, verification, validation, and regulatory complianceāwhile noting that manual tracing is laborāintensive, errorāprone, and increasingly untenable for large, evolving systems. The authors argue that a dataādriven, systematic approach is essential to scale traceability efforts.
The core contribution is a fully articulated KDD pipeline tailored to traceability. The first stage, data collection and preprocessing, integrates heterogeneous artifacts such as requirements specifications, design documents, source code, test cases, and issueātracker entries. The authors describe a robust preprocessing workflow that includes Unicode normalization, language detection, tokenization, stopāword removal, stemming/lemmatization, and the construction of domaināspecific ontologies to bridge lexical gaps between naturalālanguage requirements and technical artifacts.
In the feature extraction stage, the paper evaluates classic bagāofāwords/TFāIDF representations against modern embedding methods, including Word2Vec, FastText, and several transformerābased models (BERT, RoBERTa). Experiments demonstrate that a domaināfineātuned BERT model yields the highest semantic similarity scores, improving link prediction accuracy by roughly 8āÆ% over TFāIDF. To capture structural relationships, the authors model the artifact ecosystem as a graph and apply Graph Neural Networks (GCN, GAT), which prove especially effective for multiātoāmulti trace links and cyclic dependencies.
The learning stage explores three paradigms: supervised, semiāsupervised, and unsupervised. When ample labeled trace links are available, ensemble classifiers such as Gradient Boosting and XGBoost achieve the best precisionārecall balance. In lowālabel scenarios, semiāsupervised techniquesālabel propagation, coātraining, and selfātrainingāextend the sparse ground truth, reducing labeling effort by more than 60āÆ%. Unsupervised clustering (kāmeans, DBSCAN) and topic modeling (LDA) are used to generate candidate link clusters for expert review.
Evaluation goes beyond standard Precision@k and Recall@k by introducing āTraceability Coverageā (the proportion of requirements automatically linked) and āLink Stabilityā (the persistence of links across version changes). The pipeline is validated on two industrial partners (an aerospace system integrator and a financial transaction platform) and three public datasets (NASA, PROMISE, TRACE). Results show an average F1 score of 0.86, a 15āÆ% gain in multiālink scenarios when using graphābased models, and a 70āÆ% reduction in manual effort compared with traditional practices.
The authors also discuss practical challenges encountered in real deployments: data bias, noisy or inconsistent labeling, and domain shift when moving between projects. They propose mitigation strategies such as crossādomain validation, stratified sampling, humanāinātheāloop (HITL) correction loops, and fineātuning preātrained language models on projectāspecific corpora.
Finally, the paper outlines future research directions: (1) continual learning mechanisms to keep trace models upātoādate as requirements evolve, (2) automated change detection and reālinking to support dynamic traceability, and (3) explainable AI techniques to make trace predictions transparent to stakeholders. In sum, the work demonstrates that a disciplined KDD approach can deliver both highāquality automated trace links and measurable productivity gains, bridging the gap between academic research and industrial practice in requirements engineering.
Comments & Academic Discussion
Loading comments...
Leave a Comment