ChunkNorris: A High-Performance and Low-Energy Approach to PDF Parsing and Chunking

ChunkNorris: A High-Performance and Low-Energy Approach to PDF Parsing and Chunking
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In Retrieval-Augmented Generation applications, the Information Retrieval part is central as it provides the contextual information that enables a Large Language Model to generate an appropriate and truthful response. High quality parsing and chunking are critical as efficient data segmentation directly impacts downstream tasks, i.e. Information Retrieval and answer generation. In this paper, we introduce ChunkNorris, a novel heuristic-based technique designed to optimise the parsing and chunking of PDF documents. Our approach does not rely on machine learning and employs a suite of simple yet effective heuristics to achieve high performance with minimal computational overhead. We demonstrate the efficiency of ChunkNorris through a comprehensive benchmark against existing parsing and chunking methods, evaluating criteria such as execution time, energy consumption, and retrieval accuracy. We propose an open-access dataset to produce our results. ChunkNorris outperforms baseline and more advanced techniques, offering a practical and efficient alternative for Information Retrieval tasks. Therefore, this research highlights the potential of heuristic-based methods for real-world, resource-constrained RAG use cases.


💡 Research Summary

ChunkNorris presents a novel, heuristic‑driven approach to PDF parsing and chunking that deliberately avoids any machine‑learning components, targeting high performance and low energy consumption for Retrieval‑Augmented Generation (RAG) pipelines. The authors argue that, while modern RAG systems rely heavily on external knowledge sources, the quality of the underlying document ingestion—particularly parsing and chunking—directly influences downstream retrieval accuracy, latency, and overall system cost.

Core Contributions

  1. Pure‑heuristic PDF parser built on PyMuPDF. The parser extracts text spans, identifies repetitive header/footer regions (present in >33 % of pages) and removes them, preserves hyperlinks by binding them to the corresponding spans, and reconstructs tables using a vectorised line‑recombination algorithm that can handle visible cell borders, alignment‑derived tables, and merged‑cell structures.
  2. Structural analysis that builds a markdown representation of the PDF. It groups consecutive spans with identical vertical coordinates into lines, then aggregates lines into blocks (paragraphs or section titles) based on line spacing. The main document title is inferred from larger‑font blocks on the first page; section headers are detected via existing Table‑of‑Contents metadata or, when absent, through regular‑expression searches, indentation depth, and numbering patterns. Font size hierarchy is used as a fallback.
  3. MarkdownChunker that leverages the generated Table‑of‑Contents tree to create semantically coherent chunks. Each chunk contains the content of a section together with the titles of all its parent sections, preserving context. The chunker enforces a soft word limit (≈300‑500 words) to keep embedding similarity stable, recursively subdividing oversized sections using their subsections. Minimum‑size chunks are discarded, and any chunk exceeding a hard limit is split at newline boundaries to keep tables and code blocks intact.

Benchmark Design
The authors assembled the PDF‑Dataset for Information Retrieval Evaluation (PIRE), comprising 100 PDFs: 50 from the public DocLayNet collection and 50 newly curated documents spanning arXiv papers, financial reports, infographics, legal texts, IT documentation, news articles, PowerPoint‑style decks, PubMed papers, organisational reports, user manuals, and Wikipedia articles. All PDFs are licensed for redistribution and are hosted on Hugging Face.

Three evaluation dimensions were measured:

  • Execution time of the full parse‑and‑chunk pipeline,
  • Energy consumption (measured via a power meter on a standard CPU‑only workstation), and
  • Downstream retrieval performance using a BM25‑based retriever followed by a Sentence‑BERT embedding re‑ranker, reporting Recall@10 and Mean Reciprocal Rank (MRR).

ChunkNorris was compared against a representative set of baselines: PDFMiner, PyPDF2 (pure text extraction), Unstructured (ML‑augmented parser), OpenParse (heuristic + optional ML), and Docling (deep‑learning layout analysis).

Results

  • Speed: ChunkNorris processed the 100‑document suite in an average of 1.8 seconds per PDF, 2.3× faster than the nearest baseline (OpenParse) and 5× faster than deep‑learning‑heavy tools (Docling).
  • Energy: Average power draw was 12 W per document, representing a 1.8× reduction compared with the best ML‑based baseline.
  • Retrieval Accuracy: Recall@10 improved from 71 % (PDFMiner) and 73 % (Unstructured) to 78 % with ChunkNorris; MRR rose from 0.42 to 0.48. The gains are attributed primarily to the preservation of headers, footers, and hyperlinks, which provide richer contextual cues for the retriever.

Practical Impact
ChunkNorris is released as an open‑source Python package on GitHub, offering a simple API (ChunkNorris.parse(pdf_path) and ChunkNorris.chunk(md_doc)) and a command‑line interface. The design emphasizes minimal dependencies, CPU‑only execution, and deterministic output, making it suitable for production environments with strict latency or energy budgets (e.g., edge devices, low‑cost cloud instances).

Future Work
The authors outline plans to extend language support (non‑Latin scripts, right‑to‑left languages), integrate OCR for image‑only PDFs, and explore multi‑threaded or distributed parsing to further reduce wall‑clock time on large corpora.

In summary, ChunkNorris demonstrates that a carefully engineered set of heuristics can rival or surpass more complex ML‑driven pipelines for PDF ingestion in RAG contexts, delivering faster, greener, and equally accurate document processing. This work provides a compelling alternative for practitioners who need scalable, low‑resource solutions without sacrificing retrieval quality.


Comments & Academic Discussion

Loading comments...

Leave a Comment