Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings

Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.


💡 Research Summary

The paper presents a comprehensive study that integrates large‑scale text embeddings derived from large language models (LLMs) with traditional citation‑graph analysis to map the structure of scientific knowledge contained in the Web of Science (WoS) database. The authors focus on two complementary dimensions of scholarly data: (1) the semantic content of each publication, captured by the abstract, and (2) the relational information encoded in citations, references, and metadata. While citation‑based methods have long been the standard for clustering and classifying scientific literature, recent advances in LLM embeddings promise richer semantic representations that can complement graph‑based approaches.

To explore this synergy, the authors selected a random subsample of 48,076 records from the WoS collection up to 2024 (approximately 0.1 % of the full 56 million‑record corpus). After rigorous preprocessing—removing entries lacking essential fields such as publication year, authors, journal, or abstracts longer than 100 characters—the final dataset comprised 41,901 papers. Two open‑source embedding models were employed via the Ollama framework: (i) mxbai‑embed‑large (1024‑dimensional vectors) and (ii) nomic‑embed‑text v1.5 (768‑dimensional vectors). Both models were applied to the abstracts, producing normalized vectors on which cosine similarity served as the distance metric.

The authors first examined the intrinsic dimensionality of the embedding spaces using principal component analysis (PCA). The variance explained by each component indicated that the information is spread across all dimensions, suggesting that aggressive dimensionality reduction would sacrifice granularity needed for fine‑grained clustering. Next, they sampled 100,000 random pairs of papers to compare embedding distances with graph‑based distances measured as the shortest path length in the citation network. Pearson correlation coefficients of 0.455 (mxbai) and 0.337 (nomic) demonstrated a positive, albeit moderate, alignment between semantic similarity and citation proximity. This result supports the hypothesis that the two modalities capture overlapping but distinct aspects of scholarly relatedness.

For a higher‑level view of scientific fields, the authors computed a weighted centroid for each of the 255 subject categories defined by WoS. The weight for a paper was the inverse of its number of subject labels, thereby mitigating the influence of highly multi‑disciplinary records. Visualizing these centroids revealed distinct “clouds” corresponding to natural sciences, social sciences, and humanities, each with internal sub‑clusters reflecting finer disciplinary structures. Kernel density estimates for selected large subjects illustrated varying spread patterns, highlighting that some fields are tightly clustered while others are more dispersed, consistent with their interdisciplinary nature.

Crucially, the study proposes a hybrid methodology that combines embedding‑based distances with graph‑based distances to form a composite similarity measure. The authors argue that such a measure could improve topic classification of individual papers and overall clustering performance, as the two sources of information are methodologically independent and their outliers are likely uncorrelated. While the current work focuses on demonstrating feasibility and providing exploratory analyses, future directions include (a) optimizing the weighting scheme for the hybrid distance, (b) scaling the approach to the full 56 million‑record dataset, and (c) integrating supervised learning to predict subject probabilities for unlabeled documents.

In summary, the paper validates that LLM embeddings can be effectively applied to massive scientific corpora, that they correlate meaningfully with citation‑graph structures, and that a combined text‑graph framework holds promise for more accurate and nuanced mapping of the scientific landscape. The work contributes a reproducible pipeline, open‑source model choices, and a set of empirical findings that lay groundwork for next‑generation bibliometric tools.


Comments & Academic Discussion

Loading comments...

Leave a Comment