Making the complete OpenAIRE citation graph easily accessible through compact data representation
The OpenAIRE graph contains a large citation graph dataset, with over 200 million publications and over 2 billion citations. The current graph is available as a dump with metadata which uncompressed totals ~TB. This makes it hard to process on conventional computers. To make this network more available for the community we provide a processed OpenAIRE graph which is downscaled to 32GB, while preserving the full graph structure. Apart from this we offer the processed data in very simple format, which allows further straightforward manipulation. We also provide a python pipeline, which can be used to process the next releases of the OpenAIRE graph.
💡 Research Summary
The paper addresses the problem of accessing the massive OpenAIRE citation graph, which in its raw form contains over 200 million publications and more than 2 billion citation edges, occupying roughly 2.5 TB of uncompressed JSON data. Such a size makes it impractical for most researchers, especially those without access to high‑performance computing resources. To lower this barrier, the authors present a processed version of the graph that fits into about 32 GB of storage while preserving the complete citation structure.
The methodology consists of several clearly defined steps. First, only the publication and relation files from the full dump are retained, and within the relations only those of type “Cites” are extracted. Publication records are flattened from nested JSON into tabular form, keeping a minimal set of fields (node identifier, DOI, title, authors, abstract, date, container, citation count, language). Two CSV files are produced: a compact “publications.csv” (≈6 GB) containing just the internal node ID and DOI, and a richer “publications_large.csv” (≈179 GB) that includes all metadata.
A crucial transformation is the remapping of the original OpenAIRE identifiers to 32‑bit integer node IDs. The authors build a hash table that provides a bidirectional mapping between the original IDs and the new integer IDs, enabling the citation edges to be stored as simple pairs of int32 values. This reduces the edge list to a single CSV file (“citations.csv”) of 39 GB, where each row is “sourceNodeID,targetNodeID”. The edge extraction and ID substitution are performed with PySpark, allowing distributed processing of the multi‑terabyte source data while keeping memory consumption modest.
Quality control scripts verify that no publications or citations are lost during the conversion and that all selected fields are present in the output. Memory usage is measured by loading each CSV into a pandas DataFrame; both the compact publications file and the citations file require only about 16 GB of RAM, making them feasible to handle on a typical workstation.
The authors make the resulting datasets publicly available via Zenodo and release the full processing pipeline (Python scripts, PySpark jobs, README) on a public Git repository. This ensures reproducibility and enables users to apply the same workflow to future OpenAIRE releases.
In the broader context, the paper compares OpenAIRE to other citation resources such as OpenAlex, Crossref, and OpenCitations, noting that while OpenAIRE offers a richer knowledge graph (including projects, grants, and other entities), its sheer size and limited deduplication have hindered accessibility. By extracting only the citation subgraph and providing a lightweight representation, the authors bridge this gap, facilitating research in bibliometrics, the sociology of science, network science, and graph‑neural‑network training. The dataset can also serve as a basis for dynamic graph models, as the authors suggest, because the citation network evolves over time.
Limitations are acknowledged. The “publications_large.csv” remains sizable (179 GB), which still requires substantial storage for users needing full metadata. The integer ID mapping is a one‑time transformation; users who need to cross‑reference back to the original OpenAIRE IDs must retain the hash table. Moreover, the current release is a static snapshot; incremental updates would require re‑running the pipeline. Future work could explore columnar storage formats (Parquet, ORC) for further compression, implement incremental processing, and possibly integrate additional entity types while maintaining the compact representation.
Overall, the paper delivers a practical, open‑source solution that dramatically reduces the resource requirements for working with the OpenAIRE citation graph, thereby expanding its usability across a wide range of scientific disciplines.
Comments & Academic Discussion
Loading comments...
Leave a Comment