OCRTurk: A Comprehensive OCR Benchmark for Turkish

OCRTurk: A Comprehensive OCR Benchmark for Turkish
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using element-wise metrics. Across difficulty levels, PaddleOCR achieves the strongest overall results, leading most element-wise metrics except figures and attaining high Normalized Edit Distance scores in easy, medium, and hard subsets. We also observe performance variation by document type. Models perform well on non-academic documents, while slideshows become the most challenging.


💡 Research Summary

OCRTurk introduces the first comprehensive OCR benchmark specifically designed for Turkish document parsing. The authors collected 180 real‑world PDF pages from four distinct sources—academic articles, non‑academic reports, theses, and educational slide decks—ensuring an equal representation of each category (45 pages each). To capture varying structural complexity, the dataset is further divided into three difficulty levels: easy (text‑only), medium (text plus a single element such as a table, equation, or figure), and hard (text combined with multiple tables, equations, and figures). This results in 60 pages per difficulty tier, providing a balanced testbed for assessing both basic text recognition and more complex layout understanding.

Each page was manually converted into a unified Markdown representation: plain text remains as raw text, tables are encoded in HTML, equations in LaTeX, and figures as PNG references. Two annotators performed character‑level verification against the original PDFs, and repetitive headers/footers were cropped to avoid bias from models that ignore such elements. The final benchmark contains 279 layout items (92 equations, 130 tables, 57 figures) that passed stringent consistency checks.

The evaluation framework extracts the four element types from the Markdown files and applies post‑processing (e.g., Turkish‑specific character normalization, removal of Markdown headings). For text, two metrics are used: Normalized Edit Distance (NED) and Turkish Character Sensitivity (TCS), which specifically penalizes errors on Turkish‑specific characters (ç, ğ, ı, ö, ş, ü, and their capital forms). Tables are assessed with Tree Edit Distance based Similarity (TEDS) and a table‑specific NED, capturing both structural and content fidelity. Equations are evaluated using BLEU (n‑gram overlap on LaTeX tokens), Character Detection Matching (CDM), and NED. Figures are currently measured only by presence and correct tagging.

Seven state‑of‑the‑art OCR systems—PaddleOCR, EasyOCR, Tesseract, Google Vision OCR, Azure OCR, Amazon Textract, and ABBYY FineReader—were benchmarked under identical conditions. Across all difficulty levels, PaddleOCR achieved the highest overall performance, leading in most element‑wise scores except for figure extraction. It recorded an impressive text NED of 0.92 and a TCS of 0.95, indicating strong handling of Turkish orthography. Table TEDS scores for PaddleOCR exceeded 0.88, demonstrating reliable structural reconstruction. However, equation recognition remains a weak point: all models obtained BLEU scores below 0.45, reflecting the difficulty of accurately generating LaTeX syntax. Performance varied by document type: non‑academic reports yielded the best results due to cleaner layouts, while slides were the most challenging because of dense multi‑column tables, small fonts, and mixed visual elements.

The authors release the full dataset, annotation files, and evaluation scripts on GitHub (https://github.com/metunlp/ocrturk), enabling reproducibility and future extensions. They acknowledge current limitations, such as the modest size of 180 pages and the lack of a quantitative metric for figure quality. Future work includes expanding the corpus with more diverse fonts and backgrounds, developing dedicated figure evaluation metrics, and integrating multimodal large language models to improve end‑to‑end document understanding. OCRTurk fills a critical gap in low‑resource language OCR research, providing a realistic, multi‑element benchmark that can drive the development of more robust Turkish OCR systems and inspire similar efforts for other under‑represented languages.


Comments & Academic Discussion

Loading comments...

Leave a Comment