Recognition of Handwritten Roman Script Using Tesseract Open source OCR Engine

Reading time: 5 minute
...

📝 Original Info

  • Title: Recognition of Handwritten Roman Script Using Tesseract Open source OCR Engine
  • ArXiv ID: 1003.5891
  • Date: 2010-03-31
  • Authors: Researchers from original ArXiv paper

📝 Abstract

In the present work, we have used Tesseract 2.01 open source Optical Character Recognition (OCR) Engine under Apache License 2.0 for recognition of handwriting samples of lower case Roman script. Handwritten isolated and free-flow text samples were collected from multiple users. Tesseract is trained to recognize user-specific handwriting samples of both the categories of document pages. On a single user model, the system is trained with 1844 isolated handwritten characters and the performance is tested on 1133 characters, taken form the test set. The overall character-level accuracy of the system is observed as 83.5%. The system fails to segment 5.56% characters and erroneously classifies 10.94% characters.

💡 Deep Analysis

Deep Dive into Recognition of Handwritten Roman Script Using Tesseract Open source OCR Engine.

In the present work, we have used Tesseract 2.01 open source Optical Character Recognition (OCR) Engine under Apache License 2.0 for recognition of handwriting samples of lower case Roman script. Handwritten isolated and free-flow text samples were collected from multiple users. Tesseract is trained to recognize user-specific handwriting samples of both the categories of document pages. On a single user model, the system is trained with 1844 isolated handwritten characters and the performance is tested on 1133 characters, taken form the test set. The overall character-level accuracy of the system is observed as 83.5%. The system fails to segment 5.56% characters and erroneously classifies 10.94% characters.

📄 Full Content

Optical Character Recognition (OCR) systems ease the barrier of the keyboard interface between man & machine to a great extent, and help in office automation with huge saving of time and human effort. Such systems allow desired manipulation of the scanned text as the output is coded with ASCII or some other character code from the paper based input text. For a specific language based on some alphabet, OCR techniques are either aimed at printed text or handwritten text. The present work is aimed at the later.

Machine recognition of handwritten text is one of the challenging areas of research for the pattern recognition community. In general, OCR systems have potential applications in extracting data from filled in forms, interpreting handwritten addresses from postal documents for automatic routing, automatic reading of bank cheques etc. The core component of such application softwares is an OCR engine, equipped with the key functional modules like line extraction, line-to-word segmentation, word-to-character segmentation, character recognition and word-level lexicon analysis using standard dictionaries. Development of a handwritten OCR engine with high recognition accuracy is a still an open problem for the research community. Lot of research efforts have already been reported [1][2][3][4][5][6][7][8] on different key aspects of handwritten character recognition systems. In the current work, instead of developing a new handwritten OCR engine from scratch, we have used Tesseract 2.01 [9], an open source OCR Engine under Apache License 2.0, for recognition of handwritten pages consisting of lower case characters of Roman script. Tesseract OCR engine provides high level of character recognition accuracy on poorly printed or poorly copied dense text. But the performance of this OCR engine is not extensively tested on recognition of handwritten characters. This has been one of the major motivations behind the current work, presented in this paper.

In the current work, we have used Tesseract to perform user specific training on handwriting samples of both isolated and free-flow texts, written using lower case Roman script. The performance is evaluated on both the categories of document pages for observation of character level and word level accuracies.

Tesseract is an open source (under Apache License 2.0) offline optical character recognition engine, originally developed at Hewlett Packard from 1984 to 1994. Tesseract was first started as a PhD research project in HPLabs, Bristol [10]. In the year 1995 it is sent to UNLV where it proved its worth against the commercial engines of the time [11] Like any standard OCR engine, Tesseract is developed on top of the key functional modules like, line and word finder, word recognizer, static character classifier, linguistic analyzer and an adaptive classifier. However, it does not support document layout analysis, output formatting and graphical user interface. Currently, Tesseract can recognize printed text written in English, Spanish, French, Italian, Dutch, German and various other languages.

To train Tesseract in English language 8 data files are required in tessdata sub directory. The 8 files used for English are to be generated as follows: tessdata/eng.freq-dawg tessdata/eng.word-dawg tessdata/eng.user-words tessdata/eng.inttemp tessdata/eng.normproto tessdata/eng.pffmtable tessdata/eng.unicharset tessdata/eng.DangAmbigs

In the present work, we have used Tesseract version 2.01for recognition of handwriting samples of both isolated and free-flow texts, written using lower case Roman script. Key functional modules of the developed system are discussed the following subsections.

For collection of the dataset for the current experiment, we have concentrated on lower case characters of Roman script. Six handwritten document pages were collected from each of the four different users in two types of datasets. In the first set, four pages of isolated handwritten lower case Roman characters were collected, as shown in Fig. 1

For labeling the training samples using Tesseract we have taken help of a tool named bbTesseract [12]. To generate the training files for a specific user, we need to prepare the box files for each training images using the following command:

The box file is a text file that includes the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image. The new Tesseract 2.01 has a mode in which it will output a text file of the required format. Some times the character set is different to its current training, it will naturally have the text incorrect. In that case we have to manually edit the file (using bbTesseract) to correct the incorrect characters in it. Then we have to rename fontfile.txt to fontfile.box. Fig. 3 shows a screenshot of the bbTesseract tool, used for labeling the training set.

For training a new handwritten character set for any user, we have to put in the effort to get one

…(Full text truncated)…

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut