Overview of Annotation Creation: Processes & Tools
📝 Abstract
Creating linguistic annotations requires more than just a reliable annotation scheme. Annotation can be a complex endeavour potentially involving many people, stages, and tools. This chapter outlines the process of creating end-to-end linguistic annotations, identifying specific tasks that researchers often perform. Because tool support is so central to achieving high quality, reusable annotations with low cost, the focus is on identifying capabilities that are necessary or useful for annotation tools, as well as common problems these tools present that reduce their utility. Although examples of specific tools are provided in many cases, this chapter concentrates more on abstract capabilities and problems because new tools appear continuously, while old tools disappear into disuse or disrepair. The two core capabilities tools must have are support for the chosen annotation scheme and the ability to work on the language under study. Additional capabilities are organized into three categories: those that are widely provided; those that often useful but found in only a few tools; and those that have as yet little or no available tool support.
💡 Analysis
Creating linguistic annotations requires more than just a reliable annotation scheme. Annotation can be a complex endeavour potentially involving many people, stages, and tools. This chapter outlines the process of creating end-to-end linguistic annotations, identifying specific tasks that researchers often perform. Because tool support is so central to achieving high quality, reusable annotations with low cost, the focus is on identifying capabilities that are necessary or useful for annotation tools, as well as common problems these tools present that reduce their utility. Although examples of specific tools are provided in many cases, this chapter concentrates more on abstract capabilities and problems because new tools appear continuously, while old tools disappear into disuse or disrepair. The two core capabilities tools must have are support for the chosen annotation scheme and the ability to work on the language under study. Additional capabilities are organized into three categories: those that are widely provided; those that often useful but found in only a few tools; and those that have as yet little or no available tool support.
📄 Content
To appear in James Pustejovsky & Nancy Ide (2016) “Handbook of Linguistic Annotation.” New York: Springer Overview of Annotation Creation: Processes & Tools Mark A. Finlayson and Tomaž Erjavec
Abstract
Creating linguistic annotations requires more than just a reliable annotation scheme. Annotation can be a complex
endeavour potentially involving many people, stages, and tools. This chapter outlines the process of creating end-to-
end linguistic annotations, identifying specific tasks that researchers often perform. Because tool support is so
central to achieving high quality, reusable annotations with low cost, the focus is on identifying capabilities that are
necessary or useful for annotation tools, as well as common problems these tools present that reduce their utility.
Although examples of specific tools are provided in many cases, this chapter concentrates more on abstract
capabilities and problems because new tools appear continuously, while old tools disappear into disuse or disrepair.
The two core capabilities tools must have are support for the chosen annotation scheme and the ability to work on
the language under study. Additional capabilities are organized into three categories: those that are widely provided;
those that often useful but found in only a few tools; and those that have as yet little or no available tool support.
1 Annotation: More than just a scheme
Creating manually annotated linguistic corpora requires more than just a reliable annotation scheme. A reliable
scheme, of course, is a central ingredient to successful annotation; but even the most carefully designed scheme will
not answer a number of practical questions about how to actually create the annotations, progressing from raw
linguistic data to annotated linguistic artifacts that can be used to answer interesting questions or do interesting
things. Annotation, especially high-quality annotation of large language datasets, can be a complex process
potentially involving many people, stages, and tools, and the scheme only specifies the conceptual content of the
annotation. By way of example, the following questions are relevant to a text annotation project and are not
answered by a scheme:
How should linguistic artifacts be prepared? Will the originals be annotated directly, or will their
textual content be extracted into separate files for annotation? In the latter case, what layout or
formatting will be kept (lines, paragraphs page breaks, section headings, highlighted text)? What file
format will be used? How will typographical errors be handled? Will typos be ignored, changed in the
original, changed in extracted content, or encoded as an additional annotation? Who will be allowed to
make corrections: the annotators themselves, adjudicators, or perhaps only the project manager?
How will annotators be provided artifacts to annotate? How will the order of annotation be specified (if
at all), and how will this order be enforced? How will the project manager ensure that each document is
annotated the appropriate number of times (e.g., by two different people for double annotation).
What inter-annotator agreement measures (IAAs) will be measured, and when? Will IAAs be measured
continuously, on batches, or on other subsets of the corpus? How will their measurement at the right
time be enforced? Will IAAs be used to track annotator training? If so, what level of IAA will be
considered to indicate that training has succeeded?
These questions are only a small selection of those that arise during the practical process of conducting
annotation. The first goal of this chapter is to give an overview of the process of annotation from start to finish,
pointing out these sorts of questions and subtasks for each stage. We will start with a known conceptual framework
for the annotation process, the MATTER framework (Pustejovsky & Stubbs, 2013) and expand upon it. Our
expanded framework is not guaranteed to be complete, but it will give a reader a very strong flavor of the kind of
issues that arise so that they can start to anticipate them in the design of their own annotation project.
The second goal is to explore the capabilities required by annotation tools. Tool support is central to effecting
high quality, reusable annotations with low cost. The focus will be on identifying capabilities that are necessary or
useful for annotation tools. Again, this list will not be exhaustive but it will be fairly representative, as the majority
of it was generated by surveying a number of annotation experts about their opinions of available tools. Also listed
are common problems that reduce tool utility (gathered during the same survey). Although specific examples of
tools will be provided in many cases, the focus will be on more abstract capabilities and problems because new tools
appear all the time while old tools disappear into disuse or disrepair.
Before beginning, it is well to first i
This content is AI-processed based on ArXiv data.