Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs

February 23, 2026

Reading time: 5 minute

...

📝 Original Info

Title: Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs
ArXiv ID: 2512.09874
Date: 2025-12-10
Authors: Researchers from original ArXiv paper

📝 Abstract

Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench

💡 Deep Analysis

Deep Dive into Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs.

📄 Full Content

Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs Pius Horn1[0009−0004−1911−1138] and Janis Keuper1,2[0000−0002−1327−1243] 1 Institute for Machine Learning and Analytics (IMLA), Oﬀenburg University, Oﬀenburg, Germany pius.horn@hs-offenburg.de 2 University of Mannheim, Mannheim, Germany Abstract. Correctly parsing mathematical formulas from PDFs is crit- ical for training large language models and building scientiﬁc knowl- edge bases from academic literature, yet existing benchmarks either ex- clude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on syntheti- cally generated PDFs with precise LaTeX ground truth, enabling sys- tematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with hu- man judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r≈0). Evaluating 20+ contemporary PDF parsers—including specialized OCR models, vision-language models, and rule-based ap- proaches—across 100 synthetic documents with 2,000+ formulas reveals signiﬁcant performance disparities. Our ﬁndings provide crucial insights for practitioners selecting parsers for downstream applications and estab- lish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Keywords: PDF Document Parsing · Mathematical Formula Extrac- tion · LLM-based Evaluation · OCR Benchmarking. 1 Introduction Text extraction from PDFs is critical for training LLMs and building scientiﬁc knowledge bases. Major training datasets rely heavily on parsed PDFs: S2ORC contains 8.1 million PDF-parsed papers (using Grobid [20]) versus only 1.5 mil- lion with LaTeX sources [19], forming the foundation of scientiﬁc components in corpora like Dolma [36]. However, parsing quality signiﬁcantly impacts down- stream performance [35]. Creating a scientiﬁc corpus from 625K papers required over 5,000 A100 GPU hours solely for correcting PDF parsing errors [16], without even evaluating formula accuracy. Moreover, parsing limitations have left 80% of papers from major publishers such as ACM absent from widely-used corpora like PILE and S2ORC [35]. Beyond LLM training, robust PDF parsing would enable arXiv:2512.09874v1 [cs.CV] 10 Dec 2025 2 P. Horn and J. Keuper better knowledge base construction, RAG systems, and semantic search [44], and could make scientiﬁc content accessible to assistive technologies given that only 3.2% of scholarly PDFs meet accessibility standards [12]. Despite this im- pact, systematic evaluation of PDF parsers for mathematical formula extraction remains understudied. The diﬃculty of PDF parsing stems from fundamental design characteristics of the format. PDFs were designed primarily for visual presentation and print- ing, not semantic content representation. Most PDF documents are untagged and lack basic high-level logical structural information [6], making content extrac- tion and reuse particularly challenging. Mathematical formulas are particularly challenging due to their extensive symbol sets, two-dimensional structure where spatial positioning conveys meaning (superscripts, subscripts, fractions), and the need to convert visual arrangements into structured formats like LaTeX [44]. Consequently, no publicly available PDF parser can consistently convert docu- ments into plain text without errors, especially for mathematical content. To address this need, a growing number of approaches exist for converting PDFs to text [44], ranging from classic rule-based parsers to general-purpose vision-language models (VLMs) and specialized document OCR models. These approaches diﬀer in model size (aﬀecting hardware requirements and runtime), accessibility (open versus closed models), and architectural design (rule-based versus vision-based). Comparative studies have shown that parsing quality varies signiﬁcantly across diﬀerent document types and content characteristics [1], with formulas and symbolic expressions posing particular challenges. The choice of parser is therefore critical for applications such as LLM training and scientiﬁc knowledge extraction. Despite the critical importance of parser selection for downstream applica- tions, systematic benchmarking of PDF parsers for mathematical content ex- traction remains lacking. This paper addresses this gap through the following key contributions: – We introduce a synthetic PDF generation approach with precise ground truth, overcoming limitations of manually annotated or source-derived bench- marks. – We develop an LLM-based matching pipeline that reliably aligns parsed formulas with groun

…(Full text truncated)…

📄 Read Full PDF on ArXiv