TraceLLM: Leveraging Large Language Models with Prompt Engineering for Enhanced Requirements Traceability

TraceLLM: Leveraging Large Language Models with Prompt Engineering for Enhanced Requirements Traceability
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Requirements traceability, the process of establishing and maintaining relationships between requirements and various software development artifacts, is paramount for ensuring system integrity and fulfilling requirements throughout the Software Development Life Cycle (SDLC). Traditional methods, including manual and information retrieval models, are labor-intensive, error-prone, and limited by low precision. Recently, Large Language Models (LLMs) have demonstrated potential for supporting software engineering tasks through advanced language comprehension. However, a substantial gap exists in the systematic design and evaluation of prompts tailored to extract accurate trace links. This paper introduces TraceLLM, a systematic framework for enhancing requirements traceability through prompt engineering and demonstration selection. Our approach incorporates rigorous dataset splitting, iterative prompt refinement, enrichment with contextual roles and domain knowledge, and evaluation across zero- and few-shot settings. We assess prompt generalization and robustness using eight state-of-the-art LLMs on four benchmark datasets representing diverse domains (aerospace, healthcare) and artifact types (requirements, design elements, test cases, regulations). TraceLLM achieves state-of-the-art F2 scores, outperforming traditional IR baselines, fine-tuned models, and prior LLM-based methods. We also explore the impact of demonstration selection strategies, identifying label-aware, diversity-based sampling as particularly effective. Overall, our findings highlight that traceability performance depends not only on model capacity but also critically on the quality of prompt engineering. In addition, the achieved performance suggests that TraceLLM can support semi-automated traceability workflows in which candidate links are reviewed and validated by human analysts.


💡 Research Summary

The paper introduces TraceLLM, a systematic framework that leverages large language models (LLMs) together with carefully engineered prompts to improve automated requirements traceability. Requirements traceability—linking requirements to design elements, source code, test cases, and regulatory documents—is essential for safety‑critical domains such as aerospace and healthcare, yet traditional manual, information‑retrieval (IR), and conventional machine‑learning approaches suffer from high labor costs, low precision, and heavy reliance on handcrafted features.

TraceLLM addresses three intertwined challenges: (1) the lack of a rigorous methodology for prompt design, (2) the sensitivity of few‑shot performance to the selection of demonstration examples (DSS), and (3) the need to evaluate prompt generalization across diverse LLMs, domains, and artifact types. To this end, the authors formulate four research questions (RQ1–RQ4) focusing on the impact of prompt engineering, cross‑model generalization, demonstration selection, and domain‑artifact transferability.

Methodologically, the study follows a strict data‑splitting protocol (70 % train, 15 % validation, 15 % test) to avoid leakage, and it evaluates eight state‑of‑the‑art LLMs (GPT‑4o, Claude‑3, Gemini‑Pro, LLaMA‑2‑70B, Mistral‑7B, Falcon‑180B, and two open‑source alternatives). Four benchmark datasets are used: (i) aerospace requirements‑design pairs, (ii) healthcare regulatory‑requirements pairs, (iii) test‑case‑requirement pairs, and (iv) code‑documentation mappings.

Prompt engineering proceeds in a five‑component template: Role, Instruction, Context, Constraints, and Examples. The authors start with a minimal “you are a requirements analyst” instruction, then iteratively enrich the prompt with domain‑specific vocabularies, regulatory excerpts, and explicit output format constraints (JSON). Demonstrations are added in a human‑LLM feedback loop, refining wording until stable performance is observed.

For few‑shot settings, four demonstration selection strategies are compared: random, similarity‑based (nearest‑neighbor in embedding space), label‑aware (balanced positive/negative examples), and diversity‑based (cluster‑wise representatives). Empirical results show that a hybrid label‑aware + diversity strategy yields the highest F2 scores (up to 0.71), especially on the regulatory‑requirements dataset where semantic nuance is critical.

Performance evaluation demonstrates that TraceLLM consistently outperforms traditional baselines (VSM, LSI, LDA), conventional ML models (SVM, Naïve Bayes), deep‑learning approaches (LSTM, S2Trace), and fine‑tuned transformer models (BERT‑large, RoBERTa‑large). In zero‑shot mode, the best LLM reaches an F2 of 0.62; with five demonstrations, the same model achieves 0.71. Notably, these gains are obtained without any parameter updates, highlighting the power of prompt engineering alone.

The authors discuss threats to validity, including LLM temperature and token‑limit effects, prompt length constraints, and potential overfitting of demonstrations to specific projects. Mitigation strategies involve modular prompt design, publicly released domain‑specific role templates, and balanced demonstration sampling across domains.

Contributions are threefold: (1) a reproducible LLM‑based traceability pipeline with standardized splits and evaluation metrics, (2) a comprehensive study of prompt designs and DSSs that identifies best‑practice patterns, and (3) an open‑source replication package containing code, prompts, and data splits.

In conclusion, TraceLLM empirically validates the hypothesis that “prompt quality is the primary driver of LLM‑based traceability performance.” The framework enables semi‑automated workflows where LLMs generate candidate trace links that are subsequently reviewed by human analysts, thereby reducing manual effort while maintaining high recall and precision. Future work will explore meta‑learning for automatic prompt optimization, continual learning across evolving codebases, and real‑time adaptation to regulatory changes.


Comments & Academic Discussion

Loading comments...

Leave a Comment