UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents
Key Information Extraction (KIE) from real-world documents remains challenging due to substantial variations in layout structures, visual quality, and task-specific information requirements. Recent Large Multimodal Models (LMMs) have shown promising potential for performing end-to-end KIE directly from document images. To enable a comprehensive and systematic evaluation across realistic and diverse application scenarios, we introduce UNIKIE-BENCH, a unified benchmark designed to rigorously evaluate the KIE capabilities of LMMs. UNIKIE-BENCH consists of two complementary tracks: a constrained-category KIE track with scenario-predefined schemas that reflect practical application needs, and an open-category KIE track that extracts any key information that is explicitly present in the document. Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts, along with pronounced performance disparities across different document types and scenarios. These findings underscore persistent challenges in grounding accuracy and layout-aware reasoning for LMM-based KIE. All codes and datasets are available at https://github.com/NEUIR/UNIKIE-BENCH.
💡 Research Summary
The paper introduces UNIKIE‑BENCH, a comprehensive benchmark designed to evaluate the key information extraction (KIE) capabilities of large multimodal models (LMMs) under realistic and diverse document scenarios. Existing KIE benchmarks are limited by narrow domain coverage, heterogeneous schemas, and reliance on OCR‑based pipelines, which hinder consistent end‑to‑end assessment of LMMs. UNIKIE‑BENCH addresses these gaps by offering two complementary tracks: a constrained‑category track with scenario‑specific predefined schemas, and an open‑category track that requires models to extract any explicitly present key information without prior schema knowledge.
In the constrained‑category track, documents are organized across three domains (business transactions, public services, regulated records) and eleven real‑world application scenarios, totaling 4,472 images with an average of 3–9 fields per document. The authors curate data from publicly available datasets, manually assign each document to a scenario, and map or re‑annotate the original labels to fit a unified schema‑guided structured prediction format. This track reflects practical extraction requirements where the schema is known in advance.
The open‑category track comprises 1,661 documents spanning two languages (English, Chinese) and four document types (receipt, form, invoice, contract). Each document is paired with a unique schema derived from its content. To generate realistic samples, the authors employ an LLM‑driven pipeline: they first prompt an LMM with representative examples to produce a textual description of typical content, then ask a large language model to generate executable HTML that encodes the described layout and hierarchy. Rendering the HTML yields the final images, to which lightweight visual noise (blur, distortion) is added for authenticity. Ground‑truth key‑value pairs are obtained via OCR followed by human correction and hierarchical organization.
UNIKIE‑BENCH adopts a schema‑guided structured prediction formulation, where a model receives both the document image and the schema (fields F and relations R) and produces a single structured output assigning values to all fields in one inference step. This contrasts with the prevalent QA‑style formulation that requires a separate query for each field and fails to capture inter‑field relationships.
The benchmark is compared against prior datasets (OCRBench, CC‑OCR, etc.) across four dimensions: support for end‑to‑end evaluation, presence of multiple tracks, taxonomy richness, and dataset size. UNIKIE‑BENCH uniquely provides end‑to‑end evaluation, both tracks, a hierarchical taxonomy (3 domains, 11 scenarios), and the largest combined size (6,133 documents).
Experiments evaluate fifteen state‑of‑the‑art LMMs, including MiniCPM‑V, Qwen2.5‑VL, and Kosmos‑2.5. Key findings:
- Schema complexity – Performance drops sharply as the number of fields increases; documents with eight or more fields see an average F1 decline of ~12 percentage points.
- Long‑tail fields – Rare fields such as “contact” or “email” achieve below‑30% accuracy, indicating poor grounding for infrequent entities.
- Layout complexity & visual noise – Multi‑column, table‑rich, or image‑embedded layouts reduce overall F1 by ~15 points; added blur or distortion further amplifies errors.
- Domain/Scenario disparity – Simple receipts and invoices maintain >80% F1, whereas contracts and complex forms fall to ~55% F1, highlighting uneven generalization across document types.
- Schema grounding – Models often confuse field names with surrounding text, revealing a gap between visual‑text perception and structured semantic mapping.
- Open‑category performance – Without predefined schemas, most models achieve only 30–40% F1; English documents with dense schemas suffer the most, dropping below 20% in some cases.
The authors conclude that while LMMs excel at raw text recognition, they still lack robust mechanisms for schema‑aware reasoning and layout‑sensitive extraction. They advocate for dedicated pre‑training objectives that jointly optimize visual encoders and schema‑conditioned decoders, data augmentation targeting long‑tail fields and complex layouts, and multi‑task learning to improve structural generalization.
Overall, UNIKIE‑BENCH provides a unified, realistic, and scalable platform for rigorously benchmarking LMM‑based KIE systems, exposing current limitations and guiding future research toward more reliable document understanding.
Comments & Academic Discussion
Loading comments...
Leave a Comment