Chargrid: Towards Understanding 2D Documents
We introduce a novel type of text representation that preserves the 2D layout of a document. This is achieved by encoding each document page as a two-dimensional grid of characters. Based on this representation, we present a generic document understanding pipeline for structured documents. This pipeline makes use of a fully convolutional encoder-decoder network that predicts a segmentation mask and bounding boxes. We demonstrate its capabilities on an information extraction task from invoices and show that it significantly outperforms approaches based on sequential text or document images.
💡 Research Summary
The paper introduces Chargrid, a novel representation that preserves the two‑dimensional layout of structured documents by converting each page into a grid of characters. Unlike traditional NLP pipelines that flatten text into a one‑dimensional sequence, Chargrid retains spatial information by assigning each character’s bounding box (obtained from OCR, PDF, or HTML) a unique integer index and filling the corresponding pixels in an H × W matrix. Empty regions are set to zero. The resulting sparse matrix is then one‑hot encoded, yielding an H × W × N₍c₎ tensor (N₍c₎ = size of the character vocabulary plus special tokens). This representation allows the model to directly exploit both textual content and layout cues.
The authors design a fully convolutional encoder‑decoder network, called chargrid‑net, to perform instance‑level semantic segmentation on this tensor. The encoder follows a VGG‑style architecture with five blocks; each block consists of three 3×3 convolutions, batch normalization, ReLU, and spatial dropout. Down‑sampling is achieved with stride‑2 convolutions (instead of max‑pooling), and the number of channels doubles after each down‑sampling step. Dilated convolutions with rates 2, 4, and 8 are used in the deeper blocks to enlarge the receptive field without further reducing resolution.
The decoder splits into two parallel branches:
- Semantic segmentation branch – up‑samples the feature maps via stride‑2 transposed convolutions, concatenates lateral encoder features (U‑Net style), and ends with a 1×1 convolution followed by a softmax over the nine target classes (background + eight semantic categories).
- Bounding‑box regression branch – adopts a one‑stage anchor‑box detector similar to RetinaNet. For each spatial location it predicts a foreground/background mask (binary cross‑entropy with focal loss) and four box offsets (Huber loss).
The total loss is a weighted sum of segmentation cross‑entropy, box‑mask binary cross‑entropy, and box‑coordinate Huber loss. Static class weighting mitigates the severe imbalance between background pixels and the relatively few “hard” pixels belonging to semantic classes.
To evaluate the approach, the authors compile a diverse invoice dataset containing 12 000 scanned invoices from many vendors and languages (English, French, Spanish, Norwegian, German, etc.). The data are split into 10 k training, 1 k validation, and 1 k test invoices, ensuring that vendors do not overlap across splits. Each invoice is manually annotated with bounding boxes for nine fields: Invoice Number, Date, Amount, Vendor Name, Vendor Address, and for each line‑item – Description, Quantity, and Amount. The model must simultaneously segment these fields and group line‑item characters into distinct instances.
Experimental results show that chargrid‑net markedly outperforms two strong baselines: (a) a state‑of‑the‑art 1‑D BiLSTM‑CRF named‑entity recognizer that operates on serialized text, and (b) a Faster‑RCNN object detector applied to document images. On the test set, chargrid‑net achieves an average F1‑score of 0.95 versus 0.78 for the BiLSTM‑CRF and 0.81 for Faster‑RCNN, representing a 15–20 percentage‑point gain. The advantage is especially pronounced for fields whose meaning depends on spatial context (e.g., multi‑column layouts, non‑standard date formats). Moreover, because each character occupies a block of pixels proportional to its visual size, the representation can be aggressively down‑sampled (e.g., 10 × 10) without loss of information, leading to substantial computational savings.
The paper highlights several strengths of the Chargrid paradigm: (1) direct access to layout information while still using textual content, (2) end‑to‑end trainability without a separate OCR‑post‑processing stage, and (3) flexibility to down‑sample for efficiency. Limitations include reliance on accurate character bounding boxes (OCR errors propagate to the grid), the memory cost of one‑hot encoding for large multilingual vocabularies, and the current focus on character‑level granularity (word‑level “wordgrid” could capture richer semantics).
Future work suggested by the authors involves exploring hybrid character‑/word‑grids, integrating sub‑word embeddings, making the pipeline robust to OCR noise, and incorporating transformer‑based global context modules to capture long‑range dependencies beyond the local receptive field of convolutions.
In summary, Chargrid offers a compelling solution for document understanding tasks where 2‑D layout is crucial. By converting documents into a dense, yet compact, character grid and applying a fully convolutional segmentation‑and‑detection network, the method achieves state‑of‑the‑art performance on invoice information extraction while remaining computationally efficient. The approach opens avenues for broader applications such as contracts, receipts, and multi‑language forms, potentially establishing a new standard for structured document processing.
Comments & Academic Discussion
Loading comments...
Leave a Comment