Font Identification in Historical Documents Using Active Learning

Reading time: 5 minute
...

📝 Abstract

Identifying the type of font (e.g., Roman, Blackletter) used in historical documents can help optical character recognition (OCR) systems produce more accurate text transcriptions. Towards this end, we present an active-learning strategy that can significantly reduce the number of labeled samples needed to train a font classifier. Our approach extracts image-based features that exploit geometric differences between fonts at the word level, and combines them into a bag-of-word representation for each page in a document. We evaluate six sampling strategies based on uncertainty, dissimilarity and diversity criteria, and test them on a database containing over 3,000 historical documents with Blackletter, Roman and Mixed fonts. Our results show that a combination of uncertainty and diversity achieves the highest predictive accuracy (89% of test cases correctly classified) while requiring only a small fraction of the data (17%) to be labeled. We discuss the implications of this result for mass digitization projects of historical documents.

💡 Analysis

Identifying the type of font (e.g., Roman, Blackletter) used in historical documents can help optical character recognition (OCR) systems produce more accurate text transcriptions. Towards this end, we present an active-learning strategy that can significantly reduce the number of labeled samples needed to train a font classifier. Our approach extracts image-based features that exploit geometric differences between fonts at the word level, and combines them into a bag-of-word representation for each page in a document. We evaluate six sampling strategies based on uncertainty, dissimilarity and diversity criteria, and test them on a database containing over 3,000 historical documents with Blackletter, Roman and Mixed fonts. Our results show that a combination of uncertainty and diversity achieves the highest predictive accuracy (89% of test cases correctly classified) while requiring only a small fraction of the data (17%) to be labeled. We discuss the implications of this result for mass digitization projects of historical documents.

📄 Content

Font Identification in Historical Documents Using Active Learning Anshul Gupta1, Ricardo Gutierrez-Osuna1, Matthew Christy2, Richard Furuta1, Laura Mandell2 1Department of Computer Science and Engineering, Texas A&M University 2Initiative for Digital Humanities, Media, and Culture, Texas A&M University
{anshulg, rgutier, mchristy, furuta, mandell}@tamu.edu

Abstract

Identifying the type of font (e.g., Roman, Blackletter) used in historical documents can help optical character recognition (OCR) systems produce more accurate text transcriptions. Towards this end, we present an active- learning strategy that can significantly reduce the number of labeled samples needed to train a font classifier. Our approach extracts image-based features that exploit geometric differences between fonts at the word level, and combines them into a bag-of-word representation for each page in a document. We evaluate six sampling strategies based on uncertainty, dissimilarity and diversity criteria, and test them on a database containing over 3,000 historical documents with Blackletter, Roman and Mixed fonts. Our results show that a combination of uncertainty and diversity achieves the highest predictive accuracy (89% of test cases correctly classified) while requiring only a small fraction of the data (17%) to be labeled. We discuss the implications of this result for mass digitization projects of historical documents.
Introduction Digitization provides easy access to most of the documents published in the modern era, from texts to images and vid- eo. By comparison, printed historical documents— everything from pamphlets to ballads to multi-volume po- etry collections in the hand-press period (roughly 1475- 1800)—are difficult to access by all but the most devoted scholars. The need to create machine-searchable collec- tions has accelerated work on Optical Character Recogni- tion (OCR) of historical documents.
OCR of historical documents is a challenging task, part- ly due to the physical integrity of the documents and the quality of the scanned images, but also because of the font characteristics. Historical documents in the hand-press period have irregular fonts, and show large variations with- in a single font class since the early printing processes had not been standardized. Blackletter (or Gothic) and Roman font classes are the two main font types used in early mod-

ern printing, but these two font classes have evolved into multiple subclasses since the first printed book. Knowing the font type and characteristics for each document in a collection can substantially improve the performance of OCR systems (La Manna et al. 1999, Imani et al. 2011). In large collections, however, hand-tagging each individual document, page and text region according to its font be- comes prohibitive. As an example, the Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO) databases –two of the largest collections available—contain over 45 million pages.
To address this problem, we present a font-identification system that can be used to automatically tag individual documents within a large collection according to their fonts1. Font identification is best formulated as a super- vised classification problem, and as such it requires labeled data for model building. Classification models work best when they have sufficient labelled data that represent the diversity of exemplars in the corpus. In our case, however, obtaining large amounts of labeled data from a corpus of 45 million page images, with varied font types, is a daunt- ing task. For this reason, we propose an active-learning approach to optimize the hand labeling process. Active learning is a mixed-initiative paradigm where a machine learning (ML) algorithm and a human work together dur- ing model building: the ML algorithm suggests a few high- value unlabeled exemplars, these are passed to the human to obtain labels, the model is adapted based on these new- ly-labeled exemplars, and the process is repeated until the model converges.
The remaining parts of this document are organized as follows. First, we review the characteristics of historical fonts and how they may be exploited for automated classi- fication. Next, we describe the proposed active-learning

1 This work is part of the Early Modern OCR Project (eMOP) at Texas A&M University (http://emop.tamu.edu ), whose overarching goals are to produce accurate transcriptions for the ECCO/EEBO collections and create tools (dictionaries, workflows, and databases) that can be used for OCR’ing other collections of early modern texts at libraries and museums elsewhere. methodology, including the feature extraction process, the sampling strategies used to select informative unlabeled instances, and the classification model. Then, we p

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut