KH-FUNSD: A Hierarchical and Fine-Grained Layout Analysis Dataset for Low-Resource Khmer Business Document

Reading time: 5 minute
...

📝 Original Info

  • Title: KH-FUNSD: A Hierarchical and Fine-Grained Layout Analysis Dataset for Low-Resource Khmer Business Document
  • ArXiv ID: 2512.11849
  • Date: 2025-12-04
  • Authors: Nimol Thuon, Jun Du

📝 Abstract

Automated document layout analysis remains a major challenge for low-resource, non-Latin scripts. Khmer is a language spoken daily by over 17 million people in Cambodia, receiving little attention in the development of document AI tools. The lack of dedicated resources is particularly acute for business documents, which are critical for both public administration and private enterprise. To address this gap, we present \textbf{KH-FUNSD}, the first publicly available, hierarchically annotated dataset for Khmer form document understanding, including receipts, invoices, and quotations. Our annotation framework features a three-level design: (1) region detection that divides each document into core zones such as header, form field, and footer; (2) FUNSD-style annotation that distinguishes questions, answers, headers, and other key entities, together with their relationships; and (3) fine-grained classification that assigns specific semantic roles, such as field labels, values, headers, footers, and symbols. This multi-level approach supports both comprehensive layout analysis and precise information extraction. We benchmark several leading models, providing the first set of baseline results for Khmer business documents, and discuss the distinct challenges posed by non-Latin, low-resource scripts. The KH-FUNSD dataset and documentation will be available at URL.

💡 Deep Analysis

Figure 1

📄 Full Content

2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) KH-FUNSD: A Hierarchical and Fine-Grained Layout Analysis Dataset for Low-Resource Khmer Business Document Nimol Thuon∗and Jun Du∗† ∗National Engineering Research Center of Speech and Language Information Processing (NERC-SLIP) University of Science and Technology of China, Hefei, Anhui, 230027, China E-mail: tnimol@mail.ustc.edu.cn, jundu@ustc.edu.cn Abstract—Automated document layout analysis remains a ma- jor challenge for low-resource, non-Latin scripts. Khmer is a language spoken daily by over 17 million people in Cambodia, receiving little attention in the development of document AI tools. The lack of dedicated resources is particularly acute for business documents, which are critical for both public administration and private enterprise. To address this gap, we present KH- FUNSD, the first publicly available, hierarchically annotated dataset for Khmer form document understanding, including receipts, invoices, and quotations. Our annotation framework features a three-level design: (1) region detection that divides each document into core zones such as header, form field, and footer; (2) FUNSD-style annotation that distinguishes questions, answers, headers, and other key entities, together with their relationships; and (3) fine-grained classification that assigns specific semantic roles, such as field labels, values, headers, footers, and symbols. This multi-level approach supports both comprehensive layout analysis and precise information extraction. We benchmark several leading models, providing the first set of baseline results for Khmer business documents, and discuss the distinct challenges posed by non-Latin, low-resource scripts. The KH-FUNSD dataset and documentation will be available at https://github.com/back-kh/KH-FUNSD. I. INTRODUCTION The extraction of structured information from form-like doc- uments is fundamental to digital transformation across sectors [1], [2], [3]. Automated document layout analysis, understand- ing both the spatial and semantic structure of documents such as receipts and invoices, enables efficient data entry, large-scale digitization, business analytics, and regulatory compliance. Recent advances in layout-aware deep learning models, such as LayoutLM and its successors, have significantly improved performance in this field [4], [5], [6]. However, these models and their associated benchmarks have been developed primar- ily for high-resource, Latin-script languages, leaving non-Latin and low-resource scripts underrepresented. Khmer, the official and most widely spoken language of Cambodia, is used daily by over 17 million people [7], [8], [9]. As Cambodia undergoes rapid digitalization, the demand for Khmer-language AI tools and resources has become increas- ingly urgent. Business documents such as receipts, invoices, †Corresponding author and quotations play a critical role in commerce, public ad- ministration, taxation, and archival systems, yet remain largely underserved by existing document analysis technologies [10]. Most current tools and datasets are tailored to English or other high-resource languages, making it challenging to di- rectly apply state-of-the-art models to Khmer. The script itself presents unique challenges, including complex ligatures, stacked characters, and the lack of whitespace between words, all of which significantly hinder OCR and layout analysis [11], [12]. Furthermore, the absence of publicly available annotated datasets for Khmer business documents further limits both research progress and practical deployment [13], [14]. To address these challenges, we introduce KH-FUNSD, the first publicly available, hierarchically annotated dataset and evaluation benchmark for Khmer business documents. Our annotation scheme adopts a three-level framework, consisting of region-level detection, FUNSD-style entity linking, and fine-grained semantic labeling, which together capture both structural and semantic aspects of business documents. The main contributions of this work are as follows: • We introduce KH-FUNSD, the first dataset and bench- mark for Khmer Form Document Understanding, with comprehensive region annotations, FUNSD-style Q&A, and hierarchical annotations for layout analysis. • We provide baseline evaluations using state-of-the-art models (YOLO, DETR, LayoutLM) for region detection and semantic role prediction. • We make all dataset, guidelines publicly available to support further research and development in this field. This multi-level annotation approach enables robust struc- tural layout modeling and precise semantic information extrac- tion. By providing the first standardized evaluation for Khmer receipts and invoices, our work aims to bridge the gap in low-resource document AI and promote inclusive research for underrepresented languages across Southeast Asia. II. RELATED WORK A. Layout Analysis and Document Understanding Doc

📸 Image Gallery

Figure1.jpg Figure2.jpg

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut