A Common XML-based Framework for Syntactic Annotations
It is widely recognized that the proliferation of annotation schemes runs counter to the need to re-use language resources, and that standards for linguistic annotation are becoming increasingly mandatory. To answer this need, we have developed a framework comprised of an abstract model for a variety of different annotation types (e.g., morpho-syntactic tagging, syntactic annotation, co-reference annotation, etc.), which can be instantiated in different ways depending on the annotator’s approach and goals. In this paper we provide an overview of the framework, demonstrate its applicability to syntactic annotation, and show how it can contribute to comparative evaluation of parser output and diverse syntactic annotation schemes.
💡 Research Summary
The paper addresses the growing problem of linguistic annotation proliferation, which hampers the reuse of language resources and calls for standardized annotation practices. To meet this need, the authors propose a comprehensive framework that separates an abstract model from concrete instantiations. The abstract model is defined using RDF schemas and consists of two main components: the Data Category Registry (DCR) and the Data Category Specification (DCS). The DCR provides a hierarchical taxonomy of generic linguistic concepts such as “dependent,” “argument,” “subject,” and “object,” while the DCS defines project‑specific subsets, constraints on values, and permissible locations within the annotation hierarchy.
Concrete implementations are realized through a “Dialect Specification” that employs XML schemas, XSLT scripts, and XSL style sheets. This specification allows each project to adopt its own labeling conventions and structural choices while still mapping to the abstract model. Central to the framework is the notion of a Virtual Annotation Markup Language (Virtual AML), which serves as a pivot format for all annotations. The authors introduce a structural skeleton for syntactic markup that uses a small set of XML elements:
<struct>represents a node in the syntactic tree and can be recursively nested.<feat>attaches feature information (e.g., category labels) to a<struct>node; thetypeattribute identifies the data category.<rel>explicitly encodes dependency relations, with attributes such astype,head,dependent,introducer, andinitial.<alt>and<brack>allow alternative analyses and complex feature structures.<seg>points to the primary text using stand‑off annotation (XLink + XPath).
By defining these elements in the abstract AML, any concrete annotation scheme can be automatically transformed into the pivot format via XSLT. The paper demonstrates this process using two well‑known syntactic representations: the phrase‑structure trees of the Penn Treebank (PTB) and a pure dependency annotation scheme described by Carroll, Minnen, and Briscoe. The PTB’s LISP‑like bracketed notation is first converted into the abstract XML skeleton, making implicit relations explicit through <rel> elements. Conversely, the dependency representation is also mapped to the same virtual AML. An XSLT filter can then generate a dependency view from the PTB‑derived AML, enabling direct, label‑agnostic comparison of the two schemes.
This capability resolves several limitations of traditional evaluation methods such as PARSEVAL, which only handle phrase‑structure trees. With the pivot format, parsers that output dependency structures or lexical‑style analyses can be evaluated on the same footing, and the evaluation scripts can extract precisely the needed information without “dumbing down” richer outputs. Moreover, because the framework encourages explicit encoding of all relations—either directly in the XML or via an accompanying schema—gaps and inconsistencies in existing annotation guidelines become visible, prompting more coherent and interoperable scheme design.
The Data Category Registry plays a pivotal role in achieving interoperability. By mapping project‑specific categories onto standardized definitions in the DCR, equivalence (or non‑equivalence) between different label sets is made explicit. This mapping not only facilitates automatic conversion but also supports the development of generic tools for visualization, editing, and extraction across diverse corpora.
Finally, the authors note that the XCES (XML Corpus Encoding Standard) component of the framework supplies ready‑made XML support—schemas, XSLT scripts, and documentation—so that individual research groups need not develop XML expertise from scratch. The overall contribution is a flexible yet rigorous architecture that accommodates theoretical diversity while providing a common ground for comparison, evaluation, and tool reuse in syntactic annotation and, by extension, other linguistic annotation types.
Comments & Academic Discussion
Loading comments...
Leave a Comment