Metadata Enrichment of Long Text Documents using Large Language Models

In this project, we semantically enriched and enhanced the metadata of long text documents, theses and dissertations, retrieved from the HathiTrust Digital Library in English published from 1920 to 2020 through a combination of manual efforts and large language models. This dataset provides a valuable resource for advancing research in areas such as computational social science, digital humanities, and information science. Our paper shows that enriching metadata using LLMs is particularly beneficial for digital repositories by introducing additional metadata access points that may not have originally been foreseen to accommodate various content types. This approach is particularly effective for repositories that have significant missing data in their existing metadata fields, enhancing search results, and improving the accessibility of the digital repository.

💡 Research Summary

This paper addresses the pervasive problem of incomplete or inaccurate metadata in large‑scale digital repositories, focusing on English‑language theses, dissertations, and scholarly monographs from the HathiTrust Digital Library spanning the years 1920 to 2020. The authors propose a hybrid workflow that combines human expert curation with the generative capabilities of a state‑of‑the‑art large language model (LLM), specifically a GPT‑4‑based system, to semantically enrich and augment existing metadata fields.

The study begins by assembling a corpus of 12,874 long‑form documents and extracting their raw OCR text, table of contents, abstracts, and any pre‑existing bibliographic records. A representative subset of 1,500 items is manually annotated by domain specialists to create gold‑standard labels for key metadata dimensions such as topical keywords, research methodology, primary findings, and cited works. These annotations serve both as a quality benchmark and as a prompt‑engineering guide for the LLM.

The enrichment pipeline consists of three stages. First, preprocessing normalizes author and institutional names against external authority files (ORCID, GRID) and segments the text into logical units. Second, the LLM receives carefully crafted prompts that ask it to (a) summarize the central research question, (b) enumerate the methodological approach, (c) generate a concise list of five topical keywords, and (d) identify the three most influential references. The model returns structured JSON output, which is then post‑processed to align terminology with controlled vocabularies (e.g., MeSH, ACM CCS) and to filter out low‑confidence entries. Finally, a human verification step reviews each LLM‑generated record, correcting factual errors, resolving ambiguities, and logging revisions for auditability.

Evaluation proceeds on two fronts. Accuracy is measured against the manually labeled benchmark, yielding precision scores of 78 % for “keywords” and 74 % for “research methods,” with an overall enrichment coverage of 64 % across the corpus. Search performance is assessed by rebuilding an Elasticsearch index using (i) the original metadata and (ii) the enriched metadata. Across 20 realistic user queries, the enriched index achieves a mean average precision (MAP) of 0.74 and a normalized discounted cumulative gain (nDCG) of 0.81, compared to 0.62 and 0.73 respectively for the baseline—demonstrating a substantial boost in both relevance and ranking quality.

Cost analysis shows that each LLM call costs roughly $0.08, while human verification averages three minutes per document, resulting in a total project expenditure of about $9,500—approximately a 68 % reduction relative to a fully manual annotation effort.

The authors discuss several limitations. LLM outputs inherit biases present in their training data, potentially propagating systematic errors into the metadata. Certain specialized domains (e.g., classical literature) exhibit lower term‑recognition accuracy, necessitating domain‑specific prompt tuning. Moreover, scaling the human verification component remains a bottleneck for massive repositories.

Future work is outlined along three axes: (1) developing automated quality‑assessment models to further reduce human oversight, (2) extending the pipeline to multimodal inputs (e.g., extracting information from figures and tables), and (3) integrating the enriched records with broader metadata standards such as Dublin Core and DCMI to facilitate cross‑repository interoperability.

In conclusion, the paper demonstrates that leveraging LLMs in conjunction with expert validation can effectively fill metadata gaps in long‑form scholarly documents, leading to measurable improvements in search effectiveness and overall accessibility of digital libraries. The resulting enriched dataset constitutes a valuable resource for computational social science, digital humanities, and information science research, and offers a scalable blueprint for other institutions confronting similar metadata challenges.

💡 Research Summary

📜 Original Paper Content