Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown
Academic documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility and enable scalable digital library workflows. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Such documents, typically delivered in PDF format, contain complex elements including mathematical formulas, figures, headers, and tables, as well as densely layouted text. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language. However, these models exhibit significant inefficiencies; their token-by-token decoding from scratch wastes a lot of inference steps in regenerating dense text that could be directly copied from PDF files. To solve this problem, we introduce EditTrans, a hybrid editing-generation model whose features allow identifying a queue of to-be-edited text from a PDF before starting to generate markup language. EditTrans contains a lightweight classifier fine-tuned from a Document Layout Analysis model on 162,127 pages of documents from arXiv. In our evaluations, EditTrans reduced the transformation latency up to 44.5% compared to end-to-end decoder transformer models, while maintaining transformation quality. Our code and reproducible dataset production scripts are open-sourced.
💡 Research Summary
The paper tackles the inefficiency inherent in current PDF‑to‑Markdown conversion pipelines, which typically rely on end‑to‑end decoder transformers that generate every token from scratch. While such models (e.g., Nougat, Kosmos‑2.5, OlmOCR) can handle complex visual elements like formulas and tables, they waste a large amount of computation on dense plain‑text that could be directly copied from the source PDF.
EditTrans is introduced as a hybrid editing‑generation framework that first identifies which parts of a PDF page are copyable and which require regeneration. The authors fine‑tune a lightweight classifier derived from the ERNIE‑Layout document layout analysis model on a massive corpus of 162,127 arXiv pages (approximately 20 million text spans). The classifier assigns each span one of three labels: KEEP (copy as‑is), DELETE (omit), or INSERT_LEFT (insert a generation trigger before the span). This step leverages spatial features (bounding boxes, font size, position) to predict editability, effectively turning layout information into a binary copy‑vs‑edit decision.
After labeling, the system builds an edit queue. KEEP spans are appended directly to the output sequence, while INSERT_LEFT spans are preceded by a special marker (
Comments & Academic Discussion
Loading comments...
Leave a Comment