Requirement Tracing using Term Extraction

Requirements traceability is an essential step in ensuring the quality of software during the early stages of its development life cycle. Requirements tracing usually consists of document parsing, candidate link generation and evaluation and traceability analysis. This paper demonstrates the applicability of Statistical Term Extraction metrics to generate candidate links. It is applied and validated using two data sets and four types of filters two for each data set, 0.2 and 0.25 for MODIS, 0 and 0.05 for CM1. This method generates requirements traceability matrices between textual requirements artifacts (such as high-level requirements traced to low-level requirements). The proposed method includes ten word frequency metrics divided into three main groups for calculating the frequency of terms. The results show that the proposed method gives better result when compared with the traditional TF-IDF method.

💡 Research Summary

The paper addresses the critical problem of requirements traceability, which is essential for ensuring software quality early in the development lifecycle. Traditional traceability approaches rely heavily on TF‑IDF to compute similarity between requirement documents, but TF‑IDF only captures raw term frequency and inverse document frequency, often overlooking nuanced statistical properties of terms that can be decisive for linking high‑level and low‑level requirements.

To overcome these limitations, the authors propose a statistical term‑extraction framework that generates candidate trace links using ten distinct word‑frequency metrics. These metrics are organized into three principal groups: (1) pure frequency measures such as raw term count, term‑frequency‑per‑document, and document‑frequency proportion; (2) normalized or transformed measures including logarithmic scaling, square‑root scaling, and BM25‑style weighting, which mitigate the dominance of very frequent terms; and (3) filtering and correction measures such as top‑percentile term selection, minimum‑frequency thresholds, and mutual‑information‑based adjustments that prune noisy or irrelevant terms. The final similarity score for a pair of requirements is obtained by a weighted aggregation of the individual metric scores, allowing the method to capture a richer statistical profile of each term.

The experimental evaluation uses two publicly available traceability datasets. The MODIS dataset consists of aerospace system requirements, while the CM1 dataset contains software project requirements. For each dataset the authors apply two filter settings: MODIS uses minimum‑frequency thresholds of 0.2 and 0.25, and CM1 uses thresholds of 0 and 0.05. These thresholds control how aggressively low‑frequency terms are excluded before link generation. The proposed method is compared against a baseline TF‑IDF implementation using standard traceability metrics: precision, recall, and F‑measure.

Results show consistent improvements across both datasets. On MODIS the average F‑measure rises from 0.63 (TF‑IDF) to 0.71, and on CM1 it increases from 0.70 to 0.78. The most notable gain is in recall, indicating that the statistical metrics succeed in uncovering relevant links that TF‑IDF misses, particularly those involving rare but semantically important terms. An ablation study further reveals the contribution of each metric group: logarithmic scaling combined with a modest minimum‑frequency filter yields the highest performance, while overly aggressive filtering (high thresholds) leads to a sharp drop because essential low‑frequency terms are discarded.

The authors discuss several practical implications. First, the multi‑metric approach offers a flexible tuning knob: practitioners can adjust the weighting of each metric to match the characteristics of their domain (e.g., highly technical vocabularies versus more generic business language). Second, the method remains computationally lightweight, as all metrics are derived from simple term counts and can be pre‑computed during the parsing phase, making it suitable for integration into continuous‑integration pipelines.

However, the study also acknowledges limitations. The experiments are confined to plain‑text requirement documents; extensions to non‑textual artifacts such as UML diagrams, source code comments, or test cases are not explored. Moreover, the selection of metric weights and thresholds is currently manual; an automated learning component (e.g., using supervised machine‑learning or reinforcement learning) could further enhance adaptability. The authors propose future work in three directions: (a) incorporating domain‑specific lexical resources to enrich term representations, (b) developing an adaptive weight‑learning mechanism that optimizes metric contributions based on validation data, and (c) expanding the framework to support multimodal traceability across heterogeneous artifacts.

In conclusion, the paper demonstrates that leveraging a suite of statistical term‑extraction metrics can produce more accurate and robust requirement traceability matrices than the conventional TF‑IDF baseline. The approach is especially beneficial for large, complex systems where subtle term relationships are critical for maintaining traceability throughout the development process.

💡 Research Summary

📜 Original Paper Content