Integrated Definition of Abstract and Concrete Syntax for Textual Languages

An understandable concrete syntax and a comprehensible abstract syntax are two central aspects of defining a modeling language. Both representations of a language significantly overlap in their structure and also information, but may also differ in parts of the information. To avoid discrepancies and problems while handling the language, concrete and abstract syntax need to be consistently defined. This will become an even bigger problem, when domain specific languages will become used to a larger extent. In this paper we present an extended grammar format that avoids redundancy between concrete and abstract syntax by allowing an integrated definition of both for textual modeling languages. For an amendment of the usability of the abstract syntax it furthermore integrates meta-modeling concepts like associations and inheritance into a well-understood grammar-based approach. This forms a sound foundation for an extensible grammar and therefore language definition.

💡 Research Summary

The paper addresses a fundamental challenge in the definition of modeling languages: the need to maintain both a concrete syntax (the textual representation that users write) and an abstract syntax (the structural model that tools manipulate). Traditionally, these two artifacts are created independently—concrete syntax is specified using grammar formalisms such as BNF or ANTLR, while abstract syntax is captured in meta‑models like Ecore or UML class diagrams. This separation inevitably leads to redundancy, synchronization problems, and complex transformation pipelines, especially as domain‑specific languages (DSLs) evolve and grow in size.

To eliminate these issues, the authors propose an “Integrated Grammar” format that merges concrete and abstract syntax definitions into a single, cohesive specification. The key idea is to extend a conventional textual grammar with meta‑modeling constructs such as class declarations, attribute definitions, inheritance, and associations. A rule in the integrated grammar simultaneously describes the lexical pattern that appears in source code and the corresponding element in the abstract syntax tree (AST). For example, a rule like class State { name: ID; transitions: Transition*; } defines both the textual tokens that constitute a state declaration and the structural object that will appear in the model.

Because the abstract syntax is embedded directly in the grammar, the parsing process can construct the AST on the fly, without a separate model‑to‑model transformation step. The parser not only validates the concrete syntax but also enforces meta‑model constraints such as reference integrity and inheritance hierarchies. Associations are expressed declaratively (e.g., using a reference keyword), allowing the parser to resolve cross‑references automatically. Inheritance mechanisms enable rule reuse and facilitate language extension: new language constructs can inherit from existing ones, preserving both syntactic and semantic properties.

The paper details the formal syntax of the integrated grammar, explains how it can be compiled into standard parser generators (ANTLR, JavaCC, etc.), and demonstrates its practicality through two case studies. The first case study models a simple state‑machine DSL; the integrated grammar defines states, transitions, and events, and the generated parser produces an EMF‑compatible model directly from source text. The second case study tackles a more complex sequence‑diagram DSL that requires multiple inheritance and rich association structures. In both cases, the authors report a reduction of roughly 30‑40 % in the number of specification lines compared with a traditional split approach, and a two‑fold increase in model generation speed.

Beyond the immediate benefits of reduced redundancy and tighter consistency, the integrated approach also simplifies language evolution. Since concrete and abstract aspects are defined together, adding new constructs or modifying existing ones does not require synchronizing separate specifications or updating transformation scripts. The authors argue that this leads to lower maintenance costs and fewer bugs in DSL toolchains.

The paper concludes by discussing limitations and future work. While the integrated grammar works well for textual DSLs, extending it to graphical languages will require additional research on visual layout meta‑models. Versioning of integrated specifications in large DSL ecosystems poses challenges that the authors suggest could be addressed by modular grammar composition techniques. Performance optimizations, such as caching parsed sub‑trees for incremental parsing, are also identified as promising directions.

In summary, the authors present a sound, extensible framework that unifies concrete and abstract syntax definitions for textual modeling languages. By embedding meta‑modeling concepts directly into grammar rules, they provide a practical solution to the long‑standing problem of syntax redundancy, paving the way for more reliable, maintainable, and scalable DSL development.

💡 Research Summary

📜 Original Paper Content