Do Software Languages Engineers Evaluate their Languages?
Domain Specific Languages (DSLs) can contribute to increment productivity, while reducing the required maintenance and programming expertise. We hypothesize that Software Languages Engineering (SLE) developers consistently skip, or relax, Language Evaluation. Based on the experience of engineering other types of software products, we assume that this may potentially lead to the deployment of inadequate languages. The fact that the languages already deal with concepts from the problem domain, and not the solution domain, is not enough to validate several issues at stake, such as its expressiveness, usability, effectiveness, maintainability, or even the domain expert’s productivity while using them. We present a systematic review on articles published in top ranked venues, from 2001 to 2008, which report DSLs’ construction, to characterize the common practice. This work confirms our initial hypothesis and lays the ground for the discussion on how to include a systematic approach to DSL evaluation in the SLE process.
💡 Research Summary
The paper investigates whether practitioners of Software Languages Engineering (SLE) systematically evaluate the Domain‑Specific Languages (DSLs) they create. The authors hypothesize that DSL evaluation is routinely omitted or performed only superficially, potentially leading to the deployment of inadequate languages. To test this hypothesis, they conduct a systematic literature review covering the period 2001‑2008, focusing on the most prestigious venues that publish DSL research: one journal (Journal of Visual Languages and Computing), two conferences (International Conference on Software Language Engineering and International Conference on Model‑Driven Engineering Languages and Systems), and ten workshops dedicated to language engineering and domain‑driven development.
From an initial pool of 641 papers, 242 were identified as potentially relevant based on abstract and conclusion screening. A full‑text analysis then narrowed the set to 36 papers that satisfied the inclusion criteria: (i) reporting the development of at least one DSL, (ii) providing some description of the DSL development process, or (iii) presenting experimental evaluation or usability‑testing techniques for DSLs. The authors explicitly excluded papers that dealt only with supporting infrastructure (e.g., frameworks, code generators, model transformations) without any DSL‑centric evaluation component.
The review is organized around five research questions (RQs). RQ1 asks whether a paper reports the development of a DSL; 33 of the 36 selected papers (91.7 %) do so. RQ2 probes the level of detail about the DSL construction; only 16 papers (44 %) provide substantive information such as meta‑model specifications, tool chains, or design rationales. RQ3‑RQ5 focus on evaluation: RQ3 checks for any experimental assessment, RQ4 for involvement of end‑users (domain experts) in the assessment, and RQ5 for explicit usability evaluation. The findings are stark: only 5 papers (≈14 %) report any form of experiment, merely 3 involve domain experts directly, and just 2 conduct a formal usability study. Most “evaluation” statements are anecdotal claims of productivity gains or improved maintainability, lacking quantitative data, statistical analysis, or reproducible methodology.
The authors discuss several reasons for this pervasive neglect of evaluation. First, there is a cultural assumption that close collaboration with domain experts during DSL design implicitly validates the language, even though those experts may not be the ultimate users of the generated artifacts. Second, conducting rigorous experiments demands expertise in experimental design, data collection, and statistical analysis—skills that many DSL developers lack and that are perceived as costly. Third, DSLs are often highly specialized, built for a single project or organization, making it difficult to devise generic evaluation frameworks or to compare results across contexts. The paper also notes that many reported productivity improvements (e.g., “3‑10× speed‑up”) are based on testimonials from project managers rather than on independently verified measurements, rendering meta‑analysis impossible.
The review acknowledges its own limitations. The time window ends in 2008, so recent advances in DSL evaluation (e.g., model‑based usability testing, automated metrics for expressiveness) are not captured. The heavy reliance on workshop papers may introduce variability in peer‑review rigor, and the definition of “DSL evaluation” is somewhat broad, possibly excluding relevant studies that use alternative terminology.
Despite these constraints, the systematic approach—transparent keyword search, two‑stage paper selection, and clearly defined inclusion/exclusion criteria—provides a reproducible baseline for future meta‑studies. The authors conclude that the SLE community has largely overlooked systematic DSL evaluation, and they call for concrete actions: (1) development of standardized evaluation models and metrics (e.g., usability scales, productivity benchmarks), (2) guidelines for designing controlled experiments with domain experts and end‑users, and (3) integration of evaluation checkpoints into the DSL lifecycle (design, implementation, deployment, evolution). By institutionalizing such practices, the community can move from anecdotal claims to evidence‑based assertions about DSL benefits, thereby supporting more informed decision‑making regarding DSL adoption, maintenance costs, and return‑on‑investment. The paper thus serves both as a diagnostic of current practice and a roadmap for elevating the rigor of DSL engineering.
Comments & Academic Discussion
Loading comments...
Leave a Comment