dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world’s vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce dots_ocr, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this benchmark, dots_ocr achieves state-of-the-art performance, delivering an approximately 10% relative improvement and demonstrating strong multilingual capability.


💡 Research Summary

Document layout parsing is a foundational capability for AI systems that need to ingest and understand the world’s massive stores of structured knowledge. It traditionally comprises three tightly coupled sub‑tasks: layout detection (localizing visual elements such as paragraphs, figures, and tables), content recognition (reading the text or symbols inside each element), and relational understanding (inferring logical connections like reading order). Existing pipelines treat these sub‑tasks as isolated stages, which leads to error propagation and prevents the model from exploiting the natural synergy among them. Moreover, most publicly available datasets and benchmarks are heavily English‑centric, leaving the majority of the world’s languages under‑represented.

The paper introduces dots.ocr, the first Vision‑Language Model (VLM) that jointly learns all three sub‑tasks in a single end‑to‑end pass. The authors reformulate multilingual document parsing as an autoregressive generation problem. Given an image I, the model outputs a structured sequence S =


Comments & Academic Discussion

Loading comments...

Leave a Comment