DANIEL: A fast Document Attention Network for Information Extraction and Labelling of handwritten documents

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Information extraction from handwritten documents involves traditionally three distinct steps: Document Layout Analysis, Handwritten Text Recognition, and Named Entity Recognition. Recent approaches have attempted to integrate these steps into a single process using fully end-to-end architectures. Despite this, these integrated approaches have not yet matched the performance of language models, when applied to information extraction in plain text. In this paper, we introduce DANIEL (Document Attention Network for Information Extraction and Labelling), a fully end-to-end architecture integrating a language model and designed for comprehensive handwritten document understanding. DANIEL performs layout recognition, handwriting recognition, and named entity recognition on full-page documents. Moreover, it can simultaneously learn across multiple languages, layouts, and tasks. For named entity recognition, the ontology to be applied can be specified via the input prompt. The architecture employs a convolutional encoder capable of processing images of any size without resizing, paired with an autoregressive decoder based on a transformer-based language model. DANIEL achieves competitive results on four datasets, including a new state-of-the-art performance on RIMES 2009 and M-POPP for Handwriting Text Recognition, and IAM NER for Named Entity Recognition. Furthermore, DANIEL is much faster than existing approaches. We provide the source code and the weights of the trained models at \url{https://github.com/Shulk97/daniel}.

💡 Research Summary

The paper introduces DANIEL (Document Attention Network for Information Extraction and Labelling), a unified end‑to‑end architecture that simultaneously performs document layout analysis, handwritten text recognition (HTR), and named entity recognition (NER) on full‑page handwritten documents. The system consists of a fully convolutional encoder that can ingest images of arbitrary size and aspect ratio without resizing, preserving fine‑grained layout cues, and an autoregressive transformer decoder built on a pre‑trained DeBERTa v3 language model. The decoder receives visual features from the encoder and generates character‑level transcriptions while also outputting layout masks and NER tags. A key novelty is prompt‑based ontology specification: by prepending a short textual prompt that describes the desired entity schema, the same model can be switched among different NER ontologies without re‑training.

To overcome the data hunger of large transformer‑based models, the authors create a sophisticated synthetic document generator. It combines 600 distinct handwritten fonts with multilingual (English, French, German) corpora, producing realistic page‑level documents that include varied layouts, line spacing, and annotated entity spans. This synthetic corpus enables multi‑task pre‑training where three losses—layout regression, transcription cross‑entropy, and token‑level NER mask loss—are jointly optimized. The approach mitigates over‑fitting in smaller models and prevents large models from ignoring visual cues in favor of purely linguistic priors.

DANIEL’s inference speed is dramatically improved through a “sub‑word parallel prediction” mechanism and extensive engineering (FP16 mixed precision, CUDA stream pipelining). Unlike traditional autoregressive page models that predict characters sequentially, DANIEL can emit the entire page’s transcription in a single decoder pass, achieving average latencies of ~0.12 seconds on 640 × 960 px images—roughly 3–5× faster than prior state‑of‑the‑art systems.

Experimental results on four benchmarks demonstrate both accuracy and speed gains. On HTR, DANIEL sets new records on RIMES 2009 (CER 4.2 %) and M‑POPP (CER 3.8 %). For NER, it reaches an F1 score of 93.1 % on the IAM NER dataset, surpassing the best RoBERTa‑based sequential NER models (≈92.4 %). It also attains state‑of‑the‑art performance on M‑POPP NER. Multi‑language experiments show negligible degradation when processing French and German documents, confirming the effectiveness of the multilingual synthetic pre‑training.

The authors acknowledge two limitations: the need for large synthetic data generation pipelines, and the memory footprint of the autoregressive decoder for very long documents. Future work is suggested to explore line‑level parallel decoding and memory‑efficient attention variants (e.g., Linformer, Performer) to scale to ultra‑long pages.

In summary, DANIEL delivers a fast, accurate, and versatile solution for handwritten document understanding, unifying layout, transcription, and entity extraction within a single model while supporting prompt‑driven ontology changes and multilingual operation. Its combination of a size‑agnostic convolutional encoder, a pre‑trained language‑model decoder, and a novel multi‑task pre‑training regimen represents a significant step forward for end‑to‑end handwritten information extraction.

DANIEL: A fast Document Attention Network for Information Extraction and Labelling of handwritten documents

💡 Research Summary

Comments & Academic Discussion

Leave a Comment