Optimizing Nepali PDF Extraction: A Comparative Study of Parser and OCR Technologies

Optimizing Nepali PDF Extraction: A Comparative Study of Parser and OCR Technologies
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This research compares PDF parsing and Optical Character Recognition (OCR) methods for extracting Nepali content from PDFs. PDF parsing offers fast and accurate extraction but faces challenges with non-Unicode Nepali fonts. OCR, specifically PyTesseract, overcomes these challenges, providing versatility for both digital and scanned PDFs. The study reveals that while PDF parsers are faster, their accuracy fluctuates based on PDF types. In contrast, OCRs, with a focus on PyTesseract, demonstrate consistent accuracy at the expense of slightly longer extraction times. Considering the project’s emphasis on Nepali PDFs, PyTesseract emerges as the most suitable library, balancing extraction speed and accuracy.


💡 Research Summary

The paper presents a systematic comparison between PDF parsing and Optical Character Recognition (OCR) techniques for extracting Nepali text from PDF documents. The authors motivate the study by noting that most existing PDF extraction benchmarks focus on high‑resource languages such as English, leaving a gap for low‑resource scripts like Devanagari‑based Nepali, especially when non‑Unicode fonts (e.g., Preeti, Sagarmatha) are used. Their overarching goal is to support a Nepali text‑to‑speech (TTS) pipeline, which requires reliable text extraction from a variety of PDF sources.

Methodology
Five PDFs were selected to represent three distinct categories: (1) digitally created PDFs containing Unicode Nepali characters (PDF 1 and PDF 2), (2) PDFs that use non‑Unicode Nepali fonts, requiring post‑extraction transliteration (PDF 3 and PDF 4), and (3) a scanned PDF where the text is embedded as an image (PDF 5). Four open‑source Python libraries were evaluated: two parsers—PyMuPDF and PyPDF2—and two OCR engines—PyTesseract and EasyOCR. For OCR, the workflow involved converting each PDF page to an image using PyMuPDF, then feeding the image to the OCR engine. Extraction time (seconds) and character‑level accuracy (percentage of correctly recognized Unicode characters) were measured against manually verified ground truth.

Results
Unicode PDFs: Both parsers achieved near‑perfect accuracy (≈100 %) with execution times under half a second (PyMuPDF: 0.006–0.007 s, PyPDF2: 0.370–0.474 s). PyTesseract was slightly slower (≈0.7 s) and yielded 98 % accuracy, while EasyOCR required more than 14 s and produced ~97 % accuracy.

Non‑Unicode Font PDFs: Parsers extracted Latin‑character equivalents, necessitating a one‑to‑one mapping step that reduced overall accuracy to 86–96 %. PyTesseract maintained 99.8 % accuracy with a modest time penalty (≈1 s), and EasyOCR achieved comparable accuracy (≈97–99 %) but at a much higher cost (≈17–20 s).

Image‑Embedded PDF: Parsers failed entirely. PyTesseract extracted text with 97.7 % accuracy in 1.75 s, whereas EasyOCR took 16.6 s for a similar accuracy (97.4 %).

The authors visualized these findings in tables and a comparative graph, highlighting the clear trade‑off: parsers are extremely fast but their performance degrades when faced with non‑Unicode fonts or image‑based content; OCR provides consistent accuracy across all PDF types but incurs additional processing time.

Discussion
The study emphasizes that for a Nepali‑centric project, the robustness of OCR outweighs the speed advantage of parsers. PyTesseract emerges as the most balanced solution, delivering >99 % accuracy on problematic fonts and images while keeping extraction time within a few seconds—acceptable for batch processing in a TTS pipeline. EasyOCR, while supporting GPU acceleration, proved significantly slower on CPU‑only environments, making it a secondary option. The authors also note that a hybrid strategy (using parsers for clean Unicode PDFs and OCR for the rest) could further optimize overall throughput.

Conclusion and Future Work
The paper concludes that PDF parsing is suitable only for a narrow subset of Nepali PDFs (Unicode, digitally generated). OCR, particularly PyTesseract, offers a universal approach capable of handling non‑Unicode fonts and scanned documents with high fidelity. The modest increase in processing time is justified by the dramatic gain in reliability. The authors release their code and dataset publicly, facilitating reproducibility and encouraging further research. Future directions include training a Nepali‑specific OCR model, exploring feature‑based font recognition, and integrating parser‑OCR pipelines to achieve both speed and accuracy.

Overall, the work provides a valuable benchmark for developers working with low‑resource language PDFs and makes a strong case for adopting PyTesseract as the default extraction engine in Nepali text‑to‑speech and related applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment