📝 Original Info
- Title: Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs
- ArXiv ID: 2512.09874
- Date: 2025-12-10
- Authors: Researchers from original ArXiv paper
📝 Abstract
Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench
💡 Deep Analysis
Deep Dive into Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs.
Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language m
📄 Full Content
Benchmarking Document Parsers on
Mathematical Formula Extraction from PDFs
Pius Horn1[0009−0004−1911−1138] and Janis Keuper1,2[0000−0002−1327−1243]
1 Institute for Machine Learning and Analytics (IMLA), Offenburg University,
Offenburg, Germany pius.horn@hs-offenburg.de
2 University of Mannheim, Mannheim, Germany
Abstract. Correctly parsing mathematical formulas from PDFs is crit-
ical for training large language models and building scientific knowl-
edge bases from academic literature, yet existing benchmarks either ex-
clude formulas entirely or lack semantically-aware evaluation metrics.
We introduce a novel benchmarking framework centered on syntheti-
cally generated PDFs with precise LaTeX ground truth, enabling sys-
tematic control over layout, formulas, and content characteristics. A key
methodological contribution is pioneering LLM-as-a-judge for semantic
formula assessment, combined with a robust two-stage matching pipeline
that handles parser output inconsistencies. Through human validation on
250 formula pairs (750 ratings from 30 evaluators), we demonstrate that
LLM-based evaluation achieves substantially higher correlation with hu-
man judgment (Pearson r=0.78) compared to CDM (r=0.34) and text
similarity (r≈0). Evaluating 20+ contemporary PDF parsers—including
specialized OCR models, vision-language models, and rule-based ap-
proaches—across 100 synthetic documents with 2,000+ formulas reveals
significant performance disparities. Our findings provide crucial insights
for practitioners selecting parsers for downstream applications and estab-
lish a robust, scalable methodology that enables reproducible evaluation
of PDF formula extraction quality.
Keywords: PDF Document Parsing · Mathematical Formula Extrac-
tion · LLM-based Evaluation · OCR Benchmarking.
1
Introduction
Text extraction from PDFs is critical for training LLMs and building scientific
knowledge bases. Major training datasets rely heavily on parsed PDFs: S2ORC
contains 8.1 million PDF-parsed papers (using Grobid [20]) versus only 1.5 mil-
lion with LaTeX sources [19], forming the foundation of scientific components
in corpora like Dolma [36]. However, parsing quality significantly impacts down-
stream performance [35]. Creating a scientific corpus from 625K papers required
over 5,000 A100 GPU hours solely for correcting PDF parsing errors [16], without
even evaluating formula accuracy. Moreover, parsing limitations have left 80% of
papers from major publishers such as ACM absent from widely-used corpora like
PILE and S2ORC [35]. Beyond LLM training, robust PDF parsing would enable
arXiv:2512.09874v1 [cs.CV] 10 Dec 2025
2
P. Horn and J. Keuper
better knowledge base construction, RAG systems, and semantic search [44],
and could make scientific content accessible to assistive technologies given that
only 3.2% of scholarly PDFs meet accessibility standards [12]. Despite this im-
pact, systematic evaluation of PDF parsers for mathematical formula extraction
remains understudied.
The difficulty of PDF parsing stems from fundamental design characteristics
of the format. PDFs were designed primarily for visual presentation and print-
ing, not semantic content representation. Most PDF documents are untagged and
lack basic high-level logical structural information [6], making content extrac-
tion and reuse particularly challenging. Mathematical formulas are particularly
challenging due to their extensive symbol sets, two-dimensional structure where
spatial positioning conveys meaning (superscripts, subscripts, fractions), and the
need to convert visual arrangements into structured formats like LaTeX [44].
Consequently, no publicly available PDF parser can consistently convert docu-
ments into plain text without errors, especially for mathematical content.
To address this need, a growing number of approaches exist for converting
PDFs to text [44], ranging from classic rule-based parsers to general-purpose
vision-language models (VLMs) and specialized document OCR models. These
approaches differ in model size (affecting hardware requirements and runtime),
accessibility (open versus closed models), and architectural design (rule-based
versus vision-based). Comparative studies have shown that parsing quality varies
significantly across different document types and content characteristics [1], with
formulas and symbolic expressions posing particular challenges. The choice of
parser is therefore critical for applications such as LLM training and scientific
knowledge extraction.
Despite the critical importance of parser selection for downstream applica-
tions, systematic benchmarking of PDF parsers for mathematical content ex-
traction remains lacking. This paper addresses this gap through the following
key contributions:
– We introduce a synthetic PDF generation approach with precise ground
truth, overcoming limitations of manually annotated or source-derived bench-
marks.
– We develop an LLM-based matching pipeline that reliably aligns parsed
formulas with groun
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.