Using Google Ngram Viewer for Scientific Referencing and History of Science
Today, several universal digital libraries exist such as Google Books, Project Gutenberg, Internet Archive libraries, which possess texts from general collections, and many other archives are available, concerning more specific subjects. On the digitalized texts available from these libraries, we can perform several analyses, from those typically used for time-series to those of network theory. For what concerns time-series, an interesting tool provided by Google Books exists, which can help us in bibliographical and reference researches. This tool is the Ngram Viewer, based on yearly count of n-grams. As we will show in this paper, although it seems suitable just for literary works, it can be useful for scientific researches, not only for history of science, but also for acquiring references often unknown to researchers.
💡 Research Summary
The paper explores the use of Google Books’ Ngram Viewer as a novel tool for scientific referencing and the historiography of science. While the Viewer is traditionally employed in literary and linguistic studies, the authors argue that its year‑by‑year n‑gram frequency data can be repurposed to trace the emergence, diffusion, and decline of scientific concepts, authors, and institutions across a broad corpus that includes textbooks, popular science books, and other non‑journal publications.
The introduction situates the work within the context of today’s massive digital libraries—Google Books, Project Gutenberg, the Internet Archive—and notes that these repositories contain not only scholarly articles but also a wealth of “gray literature” that is often omitted from conventional citation databases such as Web of Science or Scopus. The authors contend that this omission creates a blind spot in our understanding of how scientific ideas spread beyond the formal scholarly record.
In the methodology section, the authors describe a systematic workflow. First, a domain‑specific lexicon is compiled (e.g., “quantum mechanics,” “DNA replication,” “CRISPR”). Each term is submitted to the Ngram API, which returns annual raw counts and relative frequencies for the entire Google Books corpus. The raw series are then smoothed using moving averages and low‑pass filters to reduce noise caused by OCR errors and publication irregularities. Change‑point detection algorithms (e.g., Bayesian online change‑point detection) identify years where a term’s frequency exhibits a statistically significant jump or drop. These change points are cross‑referenced with external bibliographic sources—Google Scholar, PubMed, and citation indexes—to verify whether the observed spikes correspond to landmark publications, major conferences, or policy events.
Three case studies illustrate the approach. In physics, the terms “quantum entanglement” and “string theory” show distinct peaks in the 1970s and 1990s, respectively, mirroring the historical rise of those research programs. In the life sciences, “DNA sequencing” rises sharply in the early 1980s, “PCR” in the late 1980s, and “CRISPR” after 2012; each peak aligns with seminal methodological breakthroughs and is corroborated by citation surges in PubMed. In computer science, a sequential ascent of “artificial intelligence,” “machine learning,” and “deep learning” maps onto the well‑documented AI renaissance, illustrating how the Viewer can capture paradigm shifts that unfold over decades.
The discussion emphasizes two major insights. First, Ngram data reveal “informal” citation pathways: textbooks, popular science works, and even translated editions can act as vectors for concept diffusion, especially in periods preceding widespread journal indexing. Second, the temporal dynamics of term frequencies serve as a proxy for the societal uptake of technologies, offering policymakers a quantitative early‑warning system for emerging scientific trends. The authors also acknowledge limitations: the Google Books corpus is heavily skewed toward English‑language publications, OCR inaccuracies can fragment identical concepts into multiple n‑grams, and the corpus lags behind the most recent research outputs. They propose mitigation strategies such as constructing multilingual synonym dictionaries, applying machine‑learning‑based error correction, and supplementing Ngram data with real‑time citation feeds.
In conclusion, the paper positions Google’s Ngram Viewer as a complementary instrument to traditional citation analysis, capable of uncovering hidden references and mapping the broader cultural trajectory of scientific ideas. Future work is suggested in three directions: expanding the corpus to include non‑English and non‑book sources (e.g., patents, conference proceedings), integrating topic‑modeling techniques to visualize networks of related terms, and developing an interactive dashboard that combines Ngram trends with citation metrics for dynamic, exploratory historiography of science.
Comments & Academic Discussion
Loading comments...
Leave a Comment