Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation
Software languages evolve over time for reasons such as feature additions. When grammars evolve, textual instances that originally conformed to them may become outdated. While model-driven engineering provides many techniques for co-evolving models with metamodel changes, these approaches are not designed for textual DSLs and may lose human-relevant information such as layout and comments. This study systematically evaluates the potential of large language models (LLMs) for co-evolving grammars and instances of textual DSLs. Using Claude Sonnet 4.5 and GPT-5.2 across ten case languages with ten runs each, we assess both correctness and preservation of human-oriented information. Results show strong performance on small-scale cases ($\geq$94% precision and recall for instances requiring fewer than 20 modified lines), but performance degraded with scale: Claude maintains 85% recall at 40 lines, while GPT fails on the largest instances. Response time increases substantially with instance size, and grammar evolution complexity and deletion granularity affect performance more than change type. These findings clarify when LLM-based co-evolution is effective and where current limitations remain.
💡 Research Summary
This paper investigates the use of large language models (LLMs) to automatically co‑evolve textual domain‑specific language (DSL) grammars and their concrete instances, a problem that has received little systematic attention. Traditional model‑based approaches require parsing the original instance into an intermediate model (e.g., XMI) and then regenerating text, which inevitably discards human‑oriented information such as comments, whitespace, and indentation. To address this loss, the authors employ two state‑of‑the‑art LLMs—Claude Sonnet 4.5 and GPT‑5.2—to directly analyze grammar differences and rewrite instances while preserving formatting and comments. The experimental protocol uses ten real‑world Xtext‑based DSLs collected from GitHub, each evaluated with ten independent runs to mitigate nondeterminism. Evaluation metrics include precision and recall for functional correctness, a preservation score for comments and layout, and runtime performance. Results show that for small‑scale modifications (fewer than 20 lines to change) both models achieve ≥94 % precision and recall and retain over 90 % of human‑oriented information, dramatically outperforming traditional model‑transformation pipelines that lose all such information. As instance size grows, performance degrades: Claude maintains about 85 % recall at 40 lines, while GPT‑5.2’s accuracy drops sharply, failing on the largest cases. Grammar evolution complexity, especially deletion granularity, impacts results more than simple addition or modification. Processing time rises sharply with instance size, with Claude’s response time increasing up to 18‑fold for the biggest cases; GPT‑5.2 exhibits even greater variability. Prompt transferability tests reveal that the same prompt can be used across models, but optimal performance still requires model‑specific prompt tuning. The study concludes that LLM‑driven co‑evolution is highly effective for small to medium DSL changes, preserving valuable developer context, but current models struggle with large‑scale, complex grammar evolutions and exhibit scalability bottlenecks. Future work is suggested on improving model efficiency, automated prompt optimization, and hybrid multi‑model strategies to broaden applicability.
Comments & Academic Discussion
Loading comments...
Leave a Comment