Telugu OCR Framework using Deep Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we address the task of Optical Character Recognition(OCR) for the Telugu script. We present an end-to-end framework that segments the text image, classifies the characters and extracts lines using a language model. The segmentation is based on mathematical morphology. The classification module, which is the most challenging task of the three, is a deep convolutional neural network. The language is modelled as a third degree markov chain at the glyph level. Telugu script is a complex alphasyllabary and the language is agglutinative, making the problem hard. In this paper we apply the latest advances in neural networks to achieve state-of-the-art error rates. We also review convolutional neural networks in great detail and expound the statistical justification behind the many tricks needed to make Deep Learning work.

💡 Research Summary

This paper presents a complete optical character recognition (OCR) system specifically designed for the Telugu script, a complex alphasyllabary used by over 80 million speakers. The authors address the three classic OCR sub‑tasks—line detection, glyph segmentation, and character classification—by integrating modern deep‑learning techniques with a language model that captures contextual dependencies.

Problem definition and data characteristics
Telugu characters consist of 16 vowels, 37 consonants, and a large set of ligatures, yielding roughly 460 distinct glyphs that appear in printed and handwritten documents. A single glyph may correspond to a whole syllable, and its visual appearance can change depending on its vertical position relative to the baseline (e.g., a vowel‑bearing form versus a vowel‑less form). Consequently, an OCR pipeline must preserve positional metadata while distinguishing a relatively large number of classes.

Line and glyph segmentation
The authors start from a binary image where ink pixels are 1. They compute a row‑ink marginal (the count of ink pixels per row) and rotate the image to maximize the variance of its first derivative, thereby correcting skew. To estimate line spacing, they subtract the mean from the marginal and apply a discrete Fourier transform; the wavelength of the dominant harmonic provides a robust estimate of the baseline‑to‑baseline distance, even in the presence of descenders. Sudden drops in the marginal are marked as baselines, and peaks between baselines are taken as toplines. This yields line‑level sub‑images. Within each line, connected components are extracted using the Leptonica library, normalized to 48 × 48 binary patches, and annotated with their vertical offset from the baseline.

Training data generation
Because no public Telugu OCR dataset exists, the authors synthesize one. They collect a 150 MB corpus of Unicode Telugu text from the web, generate sentences that contain every possible glyph, and render these sentences in four styles (normal, bold, italic, bold‑italic) across fifty different fonts. The same segmentation pipeline is applied to the rendered pages, producing roughly 73 000 labeled glyph images (≈160 renderings per class × 460 classes). Both the images and the code for generating them are released, providing a benchmark for future work.

CNN‑based character classification
The classification module receives the 48 × 48 binary glyph and processes it through a deep convolutional neural network. The network consists of several convolutional layers with small 3 × 3 kernels, zero‑padding to preserve spatial dimensions, and interleaved 2 × 2 max‑pooling layers that provide translational invariance. Regularization techniques—data augmentation (rotation, translation, scaling), dropout, and weight decay—are employed to combat over‑fitting on the synthetic dataset. The final fully‑connected layer has 460 units followed by a softmax, yielding a probability distribution over all possible glyphs. The architecture is described in detail, including the mathematical formulation of convolution, pooling, and activation functions.

Language model for post‑processing
To resolve ambiguities and recover broken glyphs, the authors train a third‑order Markov chain (a 3‑gram model) over glyph sequences using the same web corpus. During inference, the CNN provides per‑glyph posterior probabilities, which are combined with the Markov transition probabilities via Bayesian inference. The most likely glyph sequence for each line is obtained with the Viterbi algorithm, effectively correcting mis‑classifications that are inconsistent with Telugu phonotactics.

Experiments and results
The system is evaluated on a held‑out test set that includes a variety of fonts, styles, and simulated noise. Reported character error rates (CER) are substantially lower than those achieved by a Google‑augmented Tesseract OCR engine, and the authors claim performance “near human level.” While exact numerical results are not reproduced in the excerpt, the paper includes comparative plots and discusses the impact of each component (segmentation, CNN, language model) on overall accuracy.

Strengths, limitations, and future work
Key strengths are the fully reproducible data pipeline, the thorough exposition of CNN architecture, and the integration of a language model that leverages Telugu’s syllabic structure. Limitations include a lack of extensive testing on real‑world scanned documents with complex backgrounds, limited discussion of computational efficiency, and no exploration of end‑to‑end training that would jointly optimize segmentation, classification, and language modeling. The authors suggest future directions such as multi‑script extensions, lightweight models for mobile deployment, and unified end‑to‑end networks.

Conclusion
By combining mathematically grounded segmentation, a large synthetically generated training set, a deep convolutional classifier, and a third‑order Markov language model, the authors deliver a Telugu OCR system that outperforms existing commercial solutions. The public release of data and code establishes a valuable benchmark for the community and demonstrates that modern deep‑learning techniques can effectively handle the challenges posed by complex alphasyllabic scripts.

Telugu OCR Framework using Deep Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment