Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability

Large language models (LLMs) use data to learn about the world in order to produce meaningful correlations and predictions. As such, the nature, scale, quality, and diversity of the datasets used to train these models, or to support their work at inference time, have a direct impact on their quality. The rapid development and adoption of LLMs of varying quality has brought into focus the scarcity of publicly available, high-quality training data and revealed an urgent need to ground the stewardship of these datasets in sustainable practices with clear provenance chains. To that end, this technical report introduces Institutional Books 1.0, a large collection of public domain books originally digitized through Harvard Library’s participation in the Google Books project, beginning in 2006. Working with Harvard Library, we extracted, analyzed, and processed these volumes into an extensively-documented dataset of historic texts. This analysis covers the entirety of Harvard Library’s collection scanned as part of that project, originally spanning 1,075,899 volumes written in over 250 different languages for a total of approximately 250 billion tokens. As part of this initial release, the OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available. This report describes this project’s goals and methods as well as the results of the analyses we performed, all in service of making this historical collection more accessible and easier for humans and machines alike to filter, read and use.

💡 Research Summary

The technical report introduces Institutional Books 1.0, a newly released, publicly available corpus derived from the Harvard Library’s participation in the Google Books digitization effort that began in 2006. The original scan collection comprised 1,075,899 volumes spanning more than 250 languages and an estimated 250 billion tokens. After a rigorous copyright‑status assessment, 983,004 volumes—approximately 242 billion tokens—were identified as being in the public domain and are now released as a fully documented dataset.

The authors describe a three‑stage processing pipeline. First, bibliographic metadata from Harvard’s MARC records, the Google Books API, and internal catalogues were merged, de‑duplicated, and normalized, yielding fields such as ISBN, publication year, author, language, and page count. Second, raw OCR output generated by Google’s Tesseract‑based engine was subjected to extensive quality improvement. A custom normalization workflow, built on pre‑trained multilingual language models, performed sentence‑boundary reconstruction, removal of headers/footers, correction of non‑standard characters, and language‑specific spelling fixes. Both the original OCR text and the cleaned version are provided, allowing researchers to choose the level of preprocessing that best fits their tasks. Third, a public‑domain verification step cross‑referenced each volume’s publication date with country‑specific copyright statutes (e.g., 70‑year term in the United States, Europe, and South Korea). Automated filtering flagged uncertain cases, which were then manually reviewed by librarians and legal experts to ensure high confidence in the final selection.

The released package includes: (1) text files for each volume in UTF‑8, containing both raw OCR and post‑processed content; (2) a JSONL metadata file where each line stores bibliographic, source, and generated attributes; and (3) CSV statistics summarizing language distribution, publication year trends, and genre breakdowns. English accounts for roughly 55 % of the token count, while French, German, and Spanish together contribute about 27 %. Publications from the late 19th to early 20th centuries dominate (≈68 % of the corpus), and the collection covers a broad spectrum of subjects—science, philosophy, literature, history, and more.

Key strengths highlighted by the authors are the transparency of the entire pipeline, the dual‑format text provision that supports both raw‑data experiments and downstream applications, and the rich, searchable metadata that facilitates fine‑grained filtering by language, era, or subject. Limitations are also acknowledged: OCR accuracy varies dramatically across scripts, with Latin‑based languages achieving >95 % character accuracy, whereas Chinese, Arabic, and Hebrew remain below 80 % due to historical printing quality and script complexity. The copyright‑clearing process, while thorough, cannot guarantee absolute correctness because of ambiguous legal interpretations and incomplete historical records; consequently, any commercial use—especially for large‑scale language model pre‑training—should be preceded by independent legal review.

Institutional Books 1.0 therefore represents one of the largest, most diverse public‑domain text corpora currently available, offering substantial value for digital humanities research, historical text analysis, and as a high‑quality pre‑training resource for large language models. The authors outline future work that includes integrating state‑of‑the‑art deep‑learning OCR models to further reduce error rates, automating more aspects of copyright determination, and expanding the collection to cover additional languages and genre niches. By emphasizing sustainable data stewardship, clear provenance, and open documentation, the project sets a benchmark for responsible dataset creation in the rapidly evolving AI ecosystem.

💡 Research Summary

📜 Original Paper Content