Large Language Models in Software Documentation and Modeling: A Literature Review and Findings
Generative artificial intelligence attracts significant attention, especially with the introduction of large language models. Its capabilities are being exploited to solve various software engineering tasks. Thanks to their ability to understand natural language and generate natural language responses, large language models are great for processing various software documentation artifacts. At the same time, large language models excel at understanding structured languages, having the potential for working with software programs and models. We conduct a literature review on the usage of large language models for software engineering tasks related to documentation and modeling. We analyze articles from four major venues in the area, organize them per tasks they solve, and provide an overview of used prompt techniques, metrics, approaches to human-based evaluation, and major datasets.
💡 Research Summary
This paper presents a systematic literature review of how large language models (LLMs) are employed for software documentation and modeling tasks. The authors focus on publications from four leading software engineering venues—IEEE Transactions on Software Engineering (TSE), ACM Transactions on Software Engineering and Methodology (TOSEM), Empirical Software Engineering (EMSE), and the International Conference on Software Engineering (ICSE)—covering the years 2024 and 2025. After a keyword‑driven search (LLM, language model, GPT, BERT, etc.) and manual screening of titles, abstracts, and full texts, 57 papers were selected for analysis.
The review first situates itself within the broader context of LLM‑for‑software‑engineering (LLM4SE) systematic reviews, noting that while many prior surveys cover the entire software development life‑cycle, relatively few concentrate specifically on documentation and modeling. The authors therefore adopt a broad definition of documentation, including tasks such as code summarization, which produce textual artifacts that serve as documentation.
The selected papers are organized into eleven task categories:
-
Commit Message Generation (CMG) – Generating natural‑language descriptions of code changes. Notable works include KADE, CommitBART, and OMEGA, as well as numerous studies benchmarking GPT‑4 and other LLMs with zero‑shot prompts. The MCMD dataset is the most frequently used benchmark.
-
Issue Tracker Utilization – Classifying or extracting information from GitHub issues, BPMN models, or code‑review comments. Studies employ zero‑shot GPT‑4, prompt‑less models, and multi‑task learning approaches.
-
StackOverflow Title and Tag Generation – Generating appropriate titles for posts and recommending tags. Approaches involve fine‑tuned CodeT5, the SOTitle+ method, and PTM4Tag+ for tag recommendation, largely using zero‑shot prompting.
-
Sentiment and Emotion Analysis – Detecting sentiment or emotions in developer communications. Comparisons are made between traditional sentiment tools, large LLMs (bLLMs), and smaller fine‑tuned LLMs (sLLMs) using zero‑ and few‑shot prompts.
-
Source Code Analysis – Tasks such as data‑flow graph generation, design‑pattern detection, scientific notebook assessment, and binary functionality classification. Various chain‑of‑thought and few‑shot prompting strategies are reported.
-
Code Summarization – Producing concise natural‑language summaries of code snippets. A rich set of models (MODE‑X, EA‑CS, StructCodeSum, etc.) and prompting techniques (zero‑shot, few‑shot, chain‑of‑thought) are evaluated on CodeSearchNet and PCSD datasets, with BLEU, ROUGE, METEOR, BERTScore, and SIDE as evaluation metrics.
-
Code Commenting, Annotating, Reviewing, and Logging – Generating comments, annotations, review suggestions, or log messages. Works include TG‑CUP, SCGen, SpecGen, and LogEnt, employing both model‑centric and prompt‑centric designs.
-
Software Security – Enhancing vulnerability descriptions and generating symbolic models from security protocols using few‑shot prompting.
-
Requirements Engineering – Transforming informal requirements (e.g., user stories) into formal specifications, classifying sensitive features, and automating COSMIC size measurement. Approaches such as RECO‑VER, ReFair, CrUISE‑AC, and Fine‑SE are highlighted.
-
Technical Document Analysis – Classifying exception‑handling bugs, extracting API relations, summarizing bug reports, and answering questions about library choices. Custom datasets are often used, and evaluation combines automatic metrics with human judgments.
-
Software Modeling – Predicting missing model changes and exploring AI‑augmented model‑driven engineering. Only a few papers address this area, suggesting early‑stage research.
Across all categories, the dominant prompting technique is zero‑shot prompting, often justified by practical API‑cost considerations or the desire to mimic real‑world usage. Few‑shot and chain‑of‑thought prompting appear in a minority of studies, typically as baselines or experimental enhancements. Classification tasks rely on standard metrics (accuracy, precision, recall, F1, ROC‑AUC), regression tasks on MAE, RMSE, NRMSE, and generative tasks on BLEU, ROUGE, METEOR, with semantic similarity measures (BERTScore, SIDE) used as supplementary indicators. Human evaluation is common: experts, students, or mixed groups of 2–42 participants assess a subset of outputs, often complemented by automatic LLM‑based evaluators (frequently GPT‑4).
The authors observe that LLM‑based solutions generally outperform prior state‑of‑the‑art methods, delivering statistically significant improvements in most tasks. However, they caution that the “revolution” promised by LLMs may be overstated; many studies merely replace existing pipelines with LLM APIs without fundamentally redesigning the underlying software engineering processes. Moreover, advanced techniques such as multi‑agent systems, domain‑specific language modeling, and integrated end‑to‑end modeling pipelines receive limited attention.
Limitations of the review include its focus on only four venues and a two‑year window, potentially missing relevant preprints or industry reports. The lack of detailed prompt specifications in many papers hampers reproducibility. The authors recommend future work on multi‑modal inputs (code, diagrams, documentation), continuous human‑in‑the‑loop workflows, cost‑effective LLM deployment strategies, and systematic evaluation of LLMs for emerging domain‑specific modeling languages.
Comments & Academic Discussion
Loading comments...
Leave a Comment