A Survey on Large Language Model Impact on Software Evolvability and Maintainability: the Good, the Bad, the Ugly, and the Remedy

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Context. Large Language Models (LLMs) are increasingly embedded in software engineering workflows for tasks including code generation, summarization, repair, and testing. Empirical studies report productivity gains, improved comprehension, and reduced cognitive load. However, evidence remains fragmented, and concerns persist about hallucinations, unstable outputs, methodological limitations, and emerging forms of technical debt. How these mixed effects shape long-term software maintainability and evolvability remains unclear. Objectives. This study systematically examines how LLMs influence the maintainability and evolvability of software systems. We identify which quality attributes are addressed in existing research, the positive impacts LLMs provide, the risks and weaknesses they introduce, and the mitigation strategies proposed in the literature. Method. We conducted a systematic literature review. Searches across ACM DL, IEEE Xplore, and Scopus (2020 to 2024) yielded 87 primary studies. Qualitative evidence was extracted through a calibrated multi-researcher process. Attributes were analyzed descriptively, while impacts, risks, weaknesses, and mitigation strategies were synthesized using a hybrid thematic approach supported by an LLM-assisted analysis tool with human-in-the-loop validation. Results. LLMs provide benefits such as improved analyzability, testability, code comprehension, debugging support, and automated repair. However, they also introduce risks, including hallucinated or incorrect outputs, brittleness to context, limited domain reasoning, unstable performance, and flaws in current evaluations, which threaten long-term evolvability. Conclusion. LLMs can strengthen maintainability and evolvability, but they also pose nontrivial risks to long-term sustainability. Responsible adoption requires safeguards, rigorous evaluation, and structured human oversight.

💡 Research Summary

This paper presents a systematic literature review (SLR) that investigates how large language models (LLMs) affect two fundamental software quality attributes: maintainability and evolvability. The authors searched ACM Digital Library, IEEE Xplore, and Scopus for papers published between 2020 and the end of 2024, initially retrieving 711 records. After applying rigorous inclusion and exclusion criteria, 87 primary studies were selected for in-depth analysis. Evidence extraction was performed by a calibrated multi‑researcher team, and a hybrid thematic analysis—combining human coding with an LLM‑assisted analysis tool—was used to identify themes related to quality attributes, positive impacts, risks, structural weaknesses, and mitigation strategies.

Background and Definitions
Maintainability refers to post‑delivery activities such as bug fixing, performance tuning, and minor enhancements, emphasizing stability and compliance. Evolvability captures a system’s capacity to accommodate substantial, continuous change over its lifetime, focusing on integrity, extensibility, and architectural adaptability. Although the two share several sub‑attributes (changeability, testability, portability), they differ in emphasis: maintainability stresses short‑term correctness, while evolvability stresses long‑term adaptability.

LLM Adoption Landscape
The review shows that LLMs (e.g., GPT‑4, LLaMA‑2, CodeLlama, StarCoder) are employed across the software lifecycle: code generation/completion, bug fixing, summarization/documentation, test‑case generation, and even requirements traceability. Empirical studies report productivity gains, reduced cognitive load, and improved code comprehension. However, most investigations focus on isolated tasks rather than holistic, long‑term quality impacts.

Positive Impacts (“The Good”)

Analyzability & Debugging Support: LLMs can explain code logic in natural language, helping developers locate defects faster.
Automated Repair: In automated program repair (APR) contexts, LLM‑generated patches outperform several traditional techniques.
Testability: LLMs can synthesize test cases with competitive coverage and fault‑detection rates.
Documentation & Summarization: Automatic generation of comments and high‑level summaries mitigates documentation decay, a known maintainability risk.
Changeability & Portability: By automating boilerplate generation, LLMs free developers to focus on core logic, facilitating easier modifications and cross‑environment migration.

Negative Impacts (“The Bad”)

Hallucinations & Incorrect Code: LLMs sometimes produce syntactically plausible but semantically wrong code, introducing hidden defects.
Context & Prompt Sensitivity: Small variations in prompts or surrounding code lead to divergent outputs, undermining consistency.
Unstable Performance: Model updates or parameter changes can cause output quality to fluctuate, threatening long‑term reliability.
Methodological Gaps in Existing Evaluations: Many studies assess only short‑term accuracy or productivity, overlooking maintenance cost, technical debt accumulation, and evolvability implications.

Structural Weaknesses (“The Ugly”)

Domain Knowledge Deficiency: LLMs lack deep understanding of specialized domains (e.g., safety‑critical, regulated industries), leading to design‑intent violations.
Reasoning Instability: Complex algorithmic reasoning often fails, compromising integrity and extensibility.
Training‑Data Biases & Security Risks: Models may reproduce copyrighted code, insecure patterns, or privacy‑violating snippets present in their training corpora.
Technical Debt Accumulation: Unchecked LLM‑generated artifacts can embed subtle bugs and architectural drift, creating new forms of debt that erode evolvability over time.

Mitigation Strategies (“The Remedy”)

Human‑in‑the‑Loop Validation: Pair LLM output with static analysis, formal verification, and mandatory code‑review steps before integration.
Prompt Engineering & Context Management: Develop standardized prompt templates and maintain project‑specific context stores to reduce variability.
Hybrid Pipelines: Use LLMs as assistive components while retaining traditional, proven techniques for critical path code.
Model Governance: Implement versioned model deployment, regression testing of model updates, and rigorous data‑curation pipelines to address bias and security concerns.
Evaluation Standardization: Propose multi‑dimensional benchmarks that include long‑term maintenance scenarios, cross‑domain test suites, and quantitative technical‑debt metrics.

Methodological Rigor & Threats to Validity
The authors detail steps taken to limit bias: multiple coders, calibration sessions, cross‑checking of extracted data, and validation of LLM‑assisted thematic clusters by human experts. Threats acknowledged include selection bias (search limited to three databases), publication bias (positive results more likely to appear), and potential inaccuracies introduced by the LLM analysis tool itself. Mitigation measures such as duplicate screening and sensitivity analyses are reported.

Conclusions
The review concludes that while LLMs clearly enhance short‑term productivity, analyzability, and certain aspects of maintainability, they also introduce non‑trivial risks that can undermine long‑term software sustainability. Responsible adoption requires systematic safeguards, rigorous, longitudinal evaluation, and a strong human oversight component. The authors call for future research on governance frameworks, longitudinal studies of technical debt evolution, and standardized metrics that capture both maintainability and evolvability in LLM‑augmented development environments.

A Survey on Large Language Model Impact on Software Evolvability and Maintainability: the Good, the Bad, the Ugly, and the Remedy

💡 Research Summary

Comments & Academic Discussion

Leave a Comment