MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing
Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.
💡 Research Summary
MeDocVL addresses the pressing challenge of high‑precision medical document understanding, where complex layouts, domain‑specific terminology, and noisy annotations make traditional OCR pipelines and generic vision‑language models (VLMs) unreliable. The authors first identify three core problems: (1) OCR errors and layout mis‑recognition that propagate downstream, (2) the inability of existing multimodal models to guarantee field‑level exact matching, and (3) the prevalence of noisy or incomplete supervision in industrial settings. To overcome these, they propose a three‑component post‑training framework that is backbone‑agnostic and specifically designed for query‑driven document parsing.
The first component, Training‑driven Label Refinement (TLR), starts from a small “Clean Data” set annotated by experts. Using OCR engines and multimodal large language models (MLLMs), the system generates structured pseudo‑labels that contain both correct fields and systematic errors typical of the upstream models. These pseudo‑labels are transformed into instruction‑style prompts that expose the error patterns to an Annotation Refinement Model. The refinement model is trained to map noisy pseudo‑labels to corrected outputs by directly comparing with expert annotations. Importantly, TLR does not aim to eliminate all noise; instead, it distills the characteristic bias of OCR/MLLM pipelines, producing a “Refined Data” distribution that retains realistic variability while suppressing systematic errors. The TLR pipeline consists of (i) pseudo‑label construction and prompt synthesis, (ii) correction distillation training, and (iii) large‑scale refinement and filtering applied to massive industrial corpora.
The second component, Noise‑aware Hybrid Post‑training (NHP), leverages the Refined Data in a staged learning process. First, reinforcement learning (RL) is applied using a token‑wise GRPO (Generalized Reward for Precise Output) objective. This objective assigns fine‑grained rewards to each token based on exact‑match status, optional confidence weighting, and reference regularization, thereby encouraging the model to focus on reliable tokens and mitigating the impact of residual noise. After the RL stage, supervised fine‑tuning (SFT) on the Clean Data consolidates extraction behavior, output formatting, and domain specialization. This hybrid approach balances robustness (through RL on noisy but structured supervision) with stability (through SFT on high‑quality data).
The third innovation is the token‑wise GRPO objective itself. Unlike conventional sequence‑level RL that rewards whole strings, GRPO evaluates each token individually, aligning the learning signal with the strict field‑level exact‑matching requirements of medical invoices. By integrating token masking, confidence weighting, and a regularization term that penalizes deviation from reference annotations, GRPO provides a nuanced gradient that drives the model toward precise field extraction while remaining tolerant to ambiguous or missing tokens.
Experiments are conducted on publicly released medical invoice benchmarks, as well as a newly curated dataset with refined annotations and evaluation protocols. Controlled noise injection studies demonstrate that TLR reduces annotation noise by roughly 30 % and that each component of NHP (RL and SFT) contributes additive gains. Compared with conventional OCR pipelines (e.g., Tesseract‑based systems) and strong VLM baselines such as Monkey‑Series, mPLUG‑DocOwl2, and olmOCR, MeDocVL consistently achieves higher Exact Match and Field‑Level Accuracy scores, improving the state‑of‑the‑art by 4–6 percentage points. Moreover, the model supports natural‑language queries, enabling users to retrieve arbitrary fields without a predefined schema, thereby combining OCR‑level fidelity with the flexibility of large language models.
The authors also emphasize the framework’s generality: MeDocVL is instantiated with Qwen‑2.5‑VL as the backbone, fine‑tuned via parameter‑efficient adapters (e.g., LoRA), but the same TLR‑NHP‑GRPO pipeline can be applied to any multimodal transformer. The released code and the refined benchmark aim to foster reproducible research and facilitate extensions to other high‑stakes domains such as financial statements, legal contracts, and insurance claims.
In summary, MeDocVL introduces a novel, data‑centric post‑training paradigm that (1) refines noisy supervision through learnable label correction, (2) blends reinforcement learning with supervised fine‑tuning to handle residual noise, and (3) employs a token‑wise reward mechanism aligned with exact‑match requirements. This combination yields a query‑driven, end‑to‑end medical document parser that matches or exceeds dedicated OCR systems while preserving the adaptability of modern vision‑language models, setting a new benchmark for precision‑critical document understanding.
Comments & Academic Discussion
Loading comments...
Leave a Comment