Guided Grammar Convergence

Reading time: 14 minute
...

📝 Original Info

  • Title: Guided Grammar Convergence
  • ArXiv ID: 1503.08476
  • Date: 2015-03-31
  • Authors: Vadim Zaytsev

📝 Abstract

Relating formal grammars is a hard problem that balances between language equivalence (which is known to be undecidable) and grammar identity (which is trivial). In this paper, we investigate several milestones between those two extremes and propose a methodology for inconsistency management in grammar engineering. While conventional grammar convergence is a practical approach relying on human experts to encode differences as transformation steps, guided grammar convergence is a more narrowly applicable technique that infers such transformation steps automatically by normalising the grammars and establishing a structural equivalence relation between them. This allows us to perform a case study with automatically inferring bidirectional transformations between 11 grammars (in a broad sense) of the same artificial functional language: parser specifications with different combinator libraries, definite clause grammars, concrete syntax definitions, algebraic data types, metamodels, XML schemata, object models.

📄 Full Content

# Introduction

Modern grammar theory has shifted its focus from general purpose programming languages to a broader scope of software languages that comprise programming languages, domain specific languages, markup languages, API libraries, interaction protocols, etc . Such software languages are specified by grammars in a broad sense that still rely on the familiar infrastructure of terminals, nonterminals and production rules, but specify general commitment to grammatical structure found in software systems. In that sense, a type safe program commits to a particular type system; a program that uses a library, commits to using its exposed interface; an XML document commits to the structure defined by its schema — failure to commit in any of these cases would mean errors in interpretation of the language entity. These, and many other, scenarios can be expressed and resolved in terms of grammar technology, but not all structural commitments profit from grammatical approach (as the most remarkably problematic ones we can note indentation policies and naming conventions).

One of the problems of multiple implementations of the same language, which is known for many years, is having an abstract syntax definition and a concrete syntax definition . Basically, the abstract syntax defines the kind of entities that inhabit the language and must be handled by semantics specification. A concrete syntax shows how to write down language entities and how to read them back. It is not uncommon for a programming language to have several possible concrete syntaxes: for example, any binary operation may use prefix, infix or postfix notation, without any changes to the language semantics. Indeed, we have seen infix dialects of postfix Forth (Forthwrite, InfixForth) and prefix dialects of infix REBOL (Boron). For software languages, the problem is broader: we can speak of one intended language specification and a variety of abstract and concrete syntaxes, data models, class dictionaries, metamodels, ontologies and similar contracts that conform to it.

Our definition of the intended language relies on bidirectional transformations  and in particular on their notation by Meertens , which we redefine here for the sake of completeness and clarity:

For a relation $`R \subseteq S \times T`$, a semi-maintainer is a function $`\updr:S \times T \to T`$, such that $`\forall x\in S, \forall y \in T, \langle x, x \updr y \rangle \in R`$, and $`\forall x\in S, \forall y \in T, \langle x, y \rangle \in R \Rightarrow x \updr y = y`$.

The first property is called correctness and ensures that the update caused by the semi-maintainer restores the relation. The second property is hippocraticness and states that an update has no effect (“does no harm”), if the original pair is already in the relation . Other properties of bidirectional transformations such as undoability are often unachievable. A maintainer is a pair of semi-maintainers $`\updr`$ and $`\updl`$. A bidirectional mapping is a relation and its maintainer.

A grammar $`G`$ conforms to the language intended by the master grammar $`M`$, if there exists a bidirectional mapping between instances of their languages.

MATH
\begin{align*}
            G \models L(M) \iff\:
            &\exists R \subseteq L(G) \times L(M)\\
            &\exists \updr:L(G)\times L(M) \to L(M)\\
            &\exists \updl:L(G)\times L(M) \to L(G)
\end{align*}
Click to expand and view more

Naturally, for any grammar holds $`G\models L(G)`$.

For example, consider a concrete syntax $`G_c`$ of a programming language used by programmers and an abstract syntax $`M=G_a`$ used by a software reengineering tool. We would need  to produce abstract syntax trees from parse trees and  to propagate changes done by a reengineering tool, back to parse trees. If those can be constructed — examples of algorithms have been seen , — then $`G_c`$ conforms to the language intended by $`G_a`$. As another example, consider an object model used in a tool that stores its objects in an external database (XML or relational): the existence of a bidirectional mapping between entries (trees or tables) in the database and the objects in memory, means that they represent the same intended language, even though they use very different ways to describe it and one may be a superlanguage of the other. For a more detailed formalisation and discussion of bidirectional mappings and grammars, a reader is redirected elsewhere .

Roadmap. In the following sections, we will briefly present the following milestones of relationships between languages:

§7. Grammar identity: structural equality of grammars

§14. Nominal equivalence: name-based equivalence of grammars

§13. Structural equivalence: name-agnostic footprint-matching equivalence

§10. Abstract normalisation: structural equivalence of normalised grammars

Then, §11 summarises the proposed method and discusses its evaluation.

Finally, §12 concludes the paper by establishing context and contributions.

Discussion

To summarise, grammar convergence is a technique of relating different grammars in a broad sense of the same intended software language . It relies on the transformations being programmed by an experienced grammar engineer: even beside the required expertise, the process is not incremental — the transformation steps need to be considered carefully and constructed for each new grammar added to the mix. With the definitions from previous sections, we have described the process of guided grammar convergence, where the master grammar of the intended language is constructed once, and the transformations are inferred for any directly available grammars as well for the ones possibly added in the future. The process works as follows.

Extract pure grammatical knowledge from the grammar source.

Use grammar mutations to preprocess your grammars, if necessary.

Normalise the grammar by removing all problematic/ambiguous constructs.

Start by matching the roots of the connected normalised grammar.

Match multiple production rules by prodsig-equivalence; infer new nominal matches by matching equivalent prodsigs. Repeat for all nonterminals.

If several matches are possible, explore all and fallback in case of failure. If global nominal resolution scheme was impossible to infer, fail.

Resolve structural differences in the production rules that matched nominally.

To evaluate the method of guided grammar convergence, we have applied it to a case study of 11 different grammars of the same intended functional language that was defined and used earlier in order to demonstrate the original grammar convergence method that converged 5 of these grammars. The following grammar sources were used (all of them are available in the repository of SLPS  together with their evolution history and authorship attribution):

adt:
an algebraic data type1 in Rascal ;

antlr:
a parser description in the input language of ANTLR , with semantic actions (in Java) intertwined with EBNF-like productions;

dcg:
a logic program written in the style of definite clause grammars ;

emf:
an Ecore model, automatically generated by Eclipse  from the XML Schema of the xsd source;

jaxb:
an object model obtained by a data binding framework, generated automatically by JAXB  from the XML schema for FL;

om:
a hand-crafted object model (Java classes) for the abstract syntax of FL;

python:
a parser specification in a scripting language, using the PyParsing library ;

rascal:
a concrete syntax specification in the metaprogramming language of Rascal language workbench ;

sdf:
a concrete syntax definition in the notation of SDF  with scannerless generalized LR parsing as parsing model.

txl:
a concrete syntax definition in the notation of TXL (Turing eXtender Language) transformational framework , which, unlike SDF, uses a combination of pattern matching and term rewriting).

xsd:
an XML schema  for the abstract syntax of FL.

The complete case study is too big to be presented here, interested readers are redirected to a 40+ page long report , containing all production rules, signatures, matchings and transformations. The case study was successful: on average ANF was achieved after 20–30 transformation steps, nominal resolution took up to 9 (proportional to the number of nonterminals) and structural resolution needed 0–5 more steps, after which all 11 grammars were converged. ANTLR, DCG and PyParsing used layered definitions and therefore were the only three grammars to require mutation (another 2–6 steps). The case study is available for investigation and replication both in the form of Rascal metaprograms at http://github.com/grammarware/slps   (the main algorithm is located in the converge::Guided module which can be observed and modified at shared/rascal/src/converge/Guided.rsc ) and as a PDF report with all grammars and transformations pretty-printed automatically .

Guided grammar convergence is a methodology stemming from the grammarware “technological space” . When looking for similar techniques in other spaces (engaging in “space travel”), the obvious candidates are schema matching and data integration in the field of data modeling and databases ; comparison of UML models or metamodels in the context of model-driven engineering ; model weaving for product line software development ; computation of refactorings from different OO program versions ; etc. For example, Del Fabro and Valduriez  utilised metamodel properties for automatically producing weaving models. The core difference is that (meta)model weaving ultimately aims at incorporating all the changes into the resulting (meta)model, while guided grammar convergence also makes complete sense when some changes in the details are disregarded. The lowest limit in this process is needed (otherwise additional claims on the minimality of inferred transformations are required), and we specify this lowest limit as the master grammar. Another difference is that model weaving rarely involves a number of models bigger than two, and even our little case study of guided convergence had 10+ grammars in it. In general, prodsig-based matching is more lightweight than those methods, since it in fact compares straightforwardly structured prodsigs, thus easily wins in performance and implementability but loses in applicability to complex scenarios.

Grammar identity

Let us assume that grammars are traditionally defined as quadruples $`G=\langle\fancy{N},{\cal T},\fancy{P},\fancy{S}\rangle`$ where their elements are respectively the sets of nonterminal symbols, terminal symbols, production rules and starting nonterminal symbols.

Grammars $`G`$ and $`G'`$ are identical, if and only if all their components are identical: $`G = G' \iff \fancy{N}=\fancy{N}' \wedge \fancy{T}=\fancy{T}' \wedge \fancy{P}=\fancy{P}' \wedge \fancy{S}=\fancy{S}'`$.

The definition is trivial, and in practice is commonly weakened somehow. For example, many metalanguages allow the right hand sides of rules from $`\fancy{P}`$ to contain disjunction (inner choice), which is known to be commutative, so it is natural to disregard the order of disjunctive clauses when comparing grammars: gdt, the “grammar diff tool” used in convergence case studies implements that. However, many grammar manipulation technologies such as PEG  or TXL , use ordered choices, so this optimisation can be perceived as premature. For this reason, we will explicitly abandon disjunction in later sections.

Abstract normalisation

In order to apply the methodology based on nonterminal footprints, production signatures and their equivalence relations, we need the input grammars to comply with some assumptions that have been left informal so far. In particular, we can foresee possible problems with names/labels for production rules and subexpressions, terminal symbols (often not a part of the abstract syntax), disjunction (inner choices, also non-factored), separator lists and other metasyntactic sugar, non-connected nonterminal call graph, inconsistent style of production rules, etc. If by $`{\cal P}_n\subset{\cal P}`$ we denote the subset of production rules concerning one particular nonterminal: $`{\cal P}_n=\{p\in{\cal P} \: | \: p = n ::= \alpha,\: \alpha\in({\cal N\cup T})^*\}, % \qquad \text{where }n\in{\cal N}`$ then we can define the Abstract Normal Form as follows:

A grammar $`G=\langle{\cal N},{\cal T},{\cal P},{\cal S}\rangle`$, where $`{\cal T}=\varnothing`$ and $`{\cal S}=\{s\}`$, is said to be in Abstract Normal Form, if and only if:

$`{\cal N}`$ is decomposable to disjoint sets, such that $`{\cal N}={\cal N}_{+}\cup{\cal N}_{-}\cup{\cal N}_\bot`$

One of them is not empty and includes the root: $`s\in{\cal N}_{+}\cup{\cal N}_{-}`$

Nonterminals from one subset are undefined: $`n\in{\cal N}_\bot \Rightarrow {\cal P}_n=\varnothing`$

Nonterminals from one other subset are defined with exactly one rule:
$`n\in{\cal N}_{-} \Rightarrow |{\cal P}_n|=1,\: {\cal P}_n = \{n::=\alpha\},\: % p= n\to\alpha,\: \alpha\in{\cal N}^{+} %\setminus {\cal N}`$

Nonterminals from the other subset are defined with chain rules:
$`n\in{\cal N}_{+} \Rightarrow \forall p\in{\cal P}_n,\: p= (n::= x),\: x\in{\cal N}`$

In fact, any grammar can be rewritten to assume this form: in our prototype implementation, this is done by programming a grammar mutation . (A grammar mutation is a general intentional grammar change that cannot be expressed independently of the grammar to which it will be applied. Thus, if “rename” is a parametric grammar transformation operator, then “rename A to B” is a transformation, but “rename all nonterminals to uppercase” is a mutation that is equivalent to transformations like “rename a to A” and “rename b to B” depending on the input grammar). Our prototype, normal::ANF, is a metaprogram in Rascal  that is available for inspection as open source . It is in fact a superposition of mutations that address the items from the definition individually: remove labels, desugar separator lists, fold/unfold chain production rules, etc.

All the rewritings performed by transforming a grammar to its ANF, are assumed to be monadic in the sense of not only normalising the grammar, but also yielding a bidirectional grammar transformation chain which execution would normalise the grammar. (In our implementation, these steps are specified in the  language, primarily because no other bidirectional grammar transformation operator suite exists ). The bidirectional grammar transformation chain can then be coupled to the bidirectional mapping between language instances from [def:intended], with the methodology described by . This is required for traceability: the conversion to ANF is one of the steps to achieve automated convergence, not a one-way preprocessing.

Conclusion

We knew that language equivalence is undecidable and that grammar identity is trivial. In this paper, we have attempted to reach a useful level of reasoning about language relationships by departing from grammar identity as the “easy” side of the spectrum. This was done in the scope of grammar convergence, when several implementations of the same software language are inspected for compatibility.

A definition of an intended language was provided ([def:intended]) based on a bidirectional transformation between language entities. Then we revisited existing and possible techniques of structural matching that assumed nominal identity ([def:equality]). In order to automatically infer nominal matching, we introduced nonterminal footprints ([def:footprint]), production signatures ([def:signature]) and various degrees of equivalence among them. An extensive normalisation scheme ([def:anf]) was proposed to transform any given grammar into the form most suitable for nominal and then structural matching. It has been explained that when such normalisation is not enough, a more targeted yet still automated approach is needed with grammar mutation strategies making the method robust with respect to different grammar design decisions, such as the use of layers instead of priorities or recursion instead of iteration. Just as all other parts of the proposed process, these mutations operate automatically and do not require human intervention.

A case study was used to evaluate the proposed method of guided grammar convergence. The experiment concerned several implementations of a simple functional language in ANTLR, DCG, Ecore, Java, Python, Rascal, SDF, TXL, XML Schema. The diversity in language processing frameworks — metaprogramming languages, declarative specifications, syntax definitions, algebraic data types, parsing libraries, transformation frameworks, software models, parser definitions — was intentional and aimed at stressing the definition of the intended language and the guided convergence method. Casting all grammars from our case study to ANF allowed us to make inference quicker and with less obstacles, as well as to explain the process more clearly.

All artifacts discussed on the pages of this paper, are transparently available to the public through a GitHub repository . For each of the sources of the case study, one could inspect the original file, the extracted grammar, the extractor itself, the mutations that have been derived and applied, the normalisations to ANF, the normalised grammar, the nominal resolution and reasons for each match, as well as the structural resolution steps. One could also investigate the implementation of the method of guided grammar convergence, the algorithm for calculating prodsigs and the process of convergence. Supplementary material contains 40+ pages of the full report, also generated by our prototype .

On the practical side, guided grammar convergence provides a balanced method of grammar manipulation, positioned right between unstructured inline editing (which makes grammar development very much like software development but lacks important properties such as traceability and reproducibility) and strictly exogenous functional transformation (which requires substantially more effort but is robust, repeatable and exposes the intended semantics). Its future role can be seen as a support for grammar product lines that allows both steady adaptation plans for deriving secondary artifacts from the reference grammar, and occasional inline editing of the derived artifacts with subsequent automated restoration of the adaptation scripts. This is a contribution to the field of engineering discipline for grammarware .

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut