Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion
This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use. We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60{,}000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts.
💡 Research Summary
The paper presents a comprehensive benchmark for converting French PDF pages into Markdown using recent Vision‑Language Models (VLMs). Recognizing that PDF‑to‑Markdown conversion is a critical preprocessing step for Retrieval‑Augmented Generation (RAG) pipelines—where transcription and layout errors can cascade downstream—the authors identify a gap in existing evaluation suites, which largely focus on English or Chinese and rely on global string‑based metrics that over‑penalize harmless formatting variations.
To address this, the authors construct a French‑focused dataset of “hard” pages drawn from a corpus of roughly 60 000 documents sourced from CCPDF and Gallica. Difficulty is defined operationally: two baseline VLMs (dots‑ocr and mineru2.5) are run on every page, and the edit distance between their outputs is used as a proxy for disagreement. Pages with the highest disagreement are selected, ensuring a concentration of operational failure modes such as missing text, wrong reading order, and structural degradation. The final benchmark covers six challenge categories: tiny long text, multi‑column layouts, dense tables, handwritten manuscripts, semi‑structured forms (with manually written entries), and graphics‑rich pages (excluding mathematical formulas).
Instead of conventional character error rate or Levenshtein distance, the evaluation adopts a unit‑test‑style framework. Three test families are defined: TextPresenceTest (asserts that a specific span appears somewhere in the Markdown), TextOrderTest (checks that a sequence of spans follows the expected reading order), and TableTest (verifies local table structure constraints). Each test inherits from a BasePDFTest class that performs category‑specific normalization before comparison. Normalization steps include markdown/HTML cleanup, Unicode canonicalization, optional ASCII projection, alphanumeric filtering, layout‑insensitive spacing control, and fine‑grained masking. By tuning these steps per category, the benchmark discounts superficial differences (line breaks, emphasis markers, French guillemets, etc.) while remaining sensitive to genuine content errors.
The experimental pipeline is built on the open‑source vlmparse library, which provides a unified wrapper for both proprietary APIs (e.g., Gemini, OpenAI GPT) and open‑source OCR/VLM systems (OlmOCR, LightOnOCR, dots‑ocr, DeepSeek‑OCR, PaddleOCR‑VL, etc.). All models are run on a single NVIDIA A100 GPU (80 GB) with 32 parallel threads and a 500‑second timeout per page, ensuring a fair comparison of both accuracy (unit‑test pass rate) and throughput (seconds per page). The benchmark evaluates 15 models in total.
Results show a clear stratification. Gemini 3 Pro Preview achieves the highest overall pass rate (0.76), closely followed by Gemini 3 Flash Preview (0.74). Among open‑weights models, Chandra attains the best score (0.66). Handwritten text and form fields are the most discriminative categories: Gemini models retain moderate competence (≈0.60–0.72), whereas most open models drop to near‑zero, with granite‑docling and MinerU2.5 scoring virtually nothing on handwritten content. In contrast, multi‑column text and dense tables are handled reasonably well by many models, with several open‑source systems exceeding 0.80 pass rates, indicating that layout analysis for printed material is relatively mature. Throughput varies inversely with model size; lightweight models such as granite‑docling process a page in under 0.9 s, while Chandra requires about 4.3 s. DPI sensitivity experiments reveal a non‑monotonic trend: performance degrades up to 100 DPI due to decoding instability in smaller models, then improves at higher resolutions as legibility increases.
The discussion emphasizes that the remaining challenges lie chiefly in handwriting and form processing, where visual variability and low‑resolution glyphs still confound most VLMs. Additionally, some models exhibit repetitive or looping generation on long, dense pages, reducing overall coverage despite otherwise accurate local OCR. The authors argue that their unit‑test framework, combined with category‑specific normalization, offers a more operationally relevant assessment than global edit‑distance scores, and they release the benchmark code and test suite for community use.
In summary, this work fills a notable gap by providing a French‑language, layout‑rich PDF‑to‑Markdown benchmark that aligns evaluation with downstream RAG requirements. It demonstrates that while proprietary VLMs excel on complex handwritten and form documents, open‑source alternatives can already match them on standard printed layouts, highlighting clear avenues for future research in handwriting robustness, graphic captioning, and automated normalization tuning.
Comments & Academic Discussion
Loading comments...
Leave a Comment