Font Identification in Historical Documents Using Active Learning
📝 Abstract
Identifying the type of font (e.g., Roman, Blackletter) used in historical documents can help optical character recognition (OCR) systems produce more accurate text transcriptions. Towards this end, we present an active-learning strategy that can significantly reduce the number of labeled samples needed to train a font classifier. Our approach extracts image-based features that exploit geometric differences between fonts at the word level, and combines them into a bag-of-word representation for each page in a document. We evaluate six sampling strategies based on uncertainty, dissimilarity and diversity criteria, and test them on a database containing over 3,000 historical documents with Blackletter, Roman and Mixed fonts. Our results show that a combination of uncertainty and diversity achieves the highest predictive accuracy (89% of test cases correctly classified) while requiring only a small fraction of the data (17%) to be labeled. We discuss the implications of this result for mass digitization projects of historical documents.
💡 Analysis
Identifying the type of font (e.g., Roman, Blackletter) used in historical documents can help optical character recognition (OCR) systems produce more accurate text transcriptions. Towards this end, we present an active-learning strategy that can significantly reduce the number of labeled samples needed to train a font classifier. Our approach extracts image-based features that exploit geometric differences between fonts at the word level, and combines them into a bag-of-word representation for each page in a document. We evaluate six sampling strategies based on uncertainty, dissimilarity and diversity criteria, and test them on a database containing over 3,000 historical documents with Blackletter, Roman and Mixed fonts. Our results show that a combination of uncertainty and diversity achieves the highest predictive accuracy (89% of test cases correctly classified) while requiring only a small fraction of the data (17%) to be labeled. We discuss the implications of this result for mass digitization projects of historical documents.
📄 Content
Font Identification in Historical Documents Using Active Learning
Anshul Gupta1, Ricardo Gutierrez-Osuna1, Matthew Christy2, Richard Furuta1, Laura Mandell2
1Department of Computer Science and Engineering, Texas A&M University
2Initiative for Digital Humanities, Media, and Culture, Texas A&M University
{anshulg, rgutier, mchristy, furuta, mandell}@tamu.edu
Abstract
Identifying the type of font (e.g., Roman, Blackletter) used
in historical documents can help optical character
recognition (OCR) systems produce more accurate text
transcriptions. Towards this end, we present an active-
learning strategy that can significantly reduce the number of
labeled samples needed to train a font classifier. Our
approach extracts image-based features that exploit
geometric differences between fonts at the word level, and
combines them into a bag-of-word representation for each
page in a document. We evaluate six sampling strategies
based on uncertainty, dissimilarity and diversity criteria, and
test them on a database containing over 3,000 historical
documents with Blackletter, Roman and Mixed fonts. Our
results show that a combination of uncertainty and diversity
achieves the highest predictive accuracy (89% of test cases
correctly classified) while requiring only a small fraction of
the data (17%) to be labeled. We discuss the implications of
this result for mass digitization projects of historical
documents.
Introduction
Digitization provides easy access to most of the documents
published in the modern era, from texts to images and vid-
eo. By comparison, printed historical documents—
everything from pamphlets to ballads to multi-volume po-
etry collections in the hand-press period (roughly 1475-
1800)—are difficult to access by all but the most devoted
scholars. The need to create machine-searchable collec-
tions has accelerated work on Optical Character Recogni-
tion (OCR) of historical documents.
OCR of historical documents is a challenging task, part-
ly due to the physical integrity of the documents and the
quality of the scanned images, but also because of the font
characteristics. Historical documents in the hand-press
period have irregular fonts, and show large variations with-
in a single font class since the early printing processes had
not been standardized. Blackletter (or Gothic) and Roman
font classes are the two main font types used in early mod-
ern printing, but these two font classes have evolved into
multiple subclasses since the first printed book. Knowing
the font type and characteristics for each document in a
collection can substantially improve the performance of
OCR systems (La Manna et al. 1999, Imani et al. 2011). In
large collections, however, hand-tagging each individual
document, page and text region according to its font be-
comes prohibitive. As an example, the Eighteenth Century
Collections Online (ECCO) and Early English Books
Online (EEBO) databases –two of the largest collections
available—contain over 45 million pages.
To address this problem, we present a font-identification
system that can be used to automatically tag individual
documents within a large collection according to their
fonts1. Font identification is best formulated as a super-
vised classification problem, and as such it requires labeled
data for model building. Classification models work best
when they have sufficient labelled data that represent the
diversity of exemplars in the corpus. In our case, however,
obtaining large amounts of labeled data from a corpus of
45 million page images, with varied font types, is a daunt-
ing task. For this reason, we propose an active-learning
approach to optimize the hand labeling process. Active
learning is a mixed-initiative paradigm where a machine
learning (ML) algorithm and a human work together dur-
ing model building: the ML algorithm suggests a few high-
value unlabeled exemplars, these are passed to the human
to obtain labels, the model is adapted based on these new-
ly-labeled exemplars, and the process is repeated until the
model converges.
The remaining parts of this document are organized as
follows. First, we review the characteristics of historical
fonts and how they may be exploited for automated classi-
fication. Next, we describe the proposed active-learning
1 This work is part of the Early Modern OCR Project (eMOP) at Texas A&M University (http://emop.tamu.edu ), whose overarching goals are to produce accurate transcriptions for the ECCO/EEBO collections and create tools (dictionaries, workflows, and databases) that can be used for OCR’ing other collections of early modern texts at libraries and museums elsewhere. methodology, including the feature extraction process, the sampling strategies used to select informative unlabeled instances, and the classification model. Then, we p
This content is AI-processed based on ArXiv data.